Friday, July 6, 2012

Lessons Netflix Learned From The AWS Storm

by Greg Orzell & Ariel Tseitlin

Overview

On Friday, June 29th, we experienced one of the most significant outages in over a year. It started at about 8 PM Pacific Time and lasted for about three hours, affecting Netflix members in the Americas. We’ve written frequently about our resiliency efforts and our experience with the Amazon cloud. In the past, we’ve been able to withstand Amazon Web Services (AWS) availability zone outages with minimal impact. We wanted to take this opportunity to share our findings about why this particular zone outage had such an impact.

For background, you can read about Amazon’s root-cause analysis of their outage here: http://aws.amazon.com/message/67457/.  The short version is that one of Amazon’s Availability Zones (AZs) failed on Friday evening due to a power outage that was caused by a severe storm.  Power was restored 20 minutes later. However, the Elastic Load Balancing (ELB) service suffered from capacity problems and an API backlog, which slowed recovery.

Our own root-cause analysis uncovered some interesting findings, including an edge-case in our internal mid-tier load-balancing service. This caused unhealthy instances to fail to deregister from the load-balancer which black-holed a large amount of traffic into the unavailable zone. In addition, the network calls to the instances in the unavailable zone were hanging, rather than returning no route to host.

As part of this outage we have identified a number of things that both we and Amazon can do better, and we are working with them on improvements.

Middle-tier Load Balancing

In our middle tier load-balancing, we had a cascading failure that was caused by a feature we had implemented to account for other types of failures. The service that keeps track of the state of the world has a fail-safe mode where it will not remove unhealthy instances in the event that a significant portion appears to fail simultaneously. This was done to deal with network partition events and was intended to be a short-term freeze until someone could investigate the large-scale issue. Unfortunately, getting out of this state proved both cumbersome and time consuming, causing services to continue to try and use servers that were no longer alive due to the power outage

Gridlock

Clients trying to connect to servers that were no longer available led to a second-order issue.  All of the client threads were taken up by attempted connections and there were very few threads that could process requests. This essentially caused gridlock inside most of our services as they tried to traverse our middle-tier. We are working to make our systems resilient to these kinds of edge cases. We continue to investigate why these connections were timing out during connect, rather than quickly determining that there was no route to the unavailable hosts and failing quickly.

Summary

Netflix made the decision to move from the data center to the cloud several years ago [1].  While it’s easy and common to blame the cloud for outages because it’s outside of our control, we found that our overall availability over the past several years has steadily improved. When we dig into the root-causes of our biggest outages, we find that we can typically put in resiliency patterns to mitigate service disruption.

There were aspects of our resiliency architecture that worked well:
  • Regional isolation contained the problem to users being served out of the US-EAST region.  Our European members were unaffected.
  • Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.
  • Chaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose.  This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.
The state of the cloud will continue to mature and improve over time.  We’re working closely with Amazon on ways that they can improve their systems, focusing our efforts on eliminating single points of failure that can cause region-wide outages and isolating the failures of individual zones.

We take our availability very seriously and strive to provide an uninterrupted service to all our members. We’re still bullish on the cloud and continue to work hard to insulate our members from service disruptions in our infrastructure.

We’re continuing to build up our Cloud Operations and Reliability Engineering team, which works on exactly the types of problems identified above, as well as each service team to deal with resiliency.  Take a look at jobs.netflix.com for more details and apply directly or contact @atseitlin if you’re interested.


[1] http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html