Monday, December 31, 2012

A Closer Look At The Christmas Eve Outage

by Adrian Cockcroft

Netflix streaming was impacted on Christmas Eve 2012 by problems in the Amazon Web Services (AWS) Elastic Load Balancer (ELB) 
service that routes network traffic to the Netflix services supporting streaming. The postmortem report by AWS can be read here.

We apologize for the inconvenience and loss of service. We’d like to explain what happened and how we continue to invest in higher availability solutions.

Partial Outage

The problems at AWS caused a partial Netflix streaming outage that started at around 12:30 PM Pacific Time on December 24 and grew in scope later that afternoon. The outage primarily affected playback on TV connected devices in the US, Canada and Latin America. Our service in the UK, Ireland and Nordic countries was not impacted.

Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address that your Web browser or streaming device calls. Netflix streaming has been implemented on over a thousand different streaming devices over the last few years, and groups of similar devices tend to depend on specific ELBs. Requests from devices are passed by the ELB to the individual servers that run the many parts of the Netflix application. Out of hundreds of ELBs in use by Netflix, a handful failed, losing their ability to pass requests to the servers behind them. None of the other AWS services failed, so our applications continued to respond normally whenever the requests were able to get through.

The Netflix Web site remained up throughout the incident, supporting sign up of new customers and streaming to Macs and PCs, although at times with higher latency and a likelihood of needing to retry. Over-all streaming playback via Macs and PCs was only slightly reduced from normal levels. A few devices also saw no impact at all as those devices have an ELB configuration that kept running throughout the incident, providing normal playback levels.

At 12:24 PM Pacific Time on December 24 network traffic stopped on a few ELBs used by a limited number of streaming devices. At around 3:30 PM on December 24, network traffic stopped on additional ELBs used by game consoles, mobile and various other devices to start up and load lists of TV shows and movies. These ELBs were patched back into service by AWS at around 10:30 PM on Christmas Eve, so game consoles etc. were impacted for about seven hours. Most customers were fully able to use the service again at this point. Some additional ELB cleanup work continued until around 8 am on December 25th, when AWS finished restoring service to all the ELBs in use by Netflix, and all devices were streaming again.

Even though Netflix streaming for many devices was impacted, this wasn't an immediate blackout. Those devices that were already running Netflix when the ELB problems started were in many cases able to continue playing additional content.

Christmas Eve is traditionally a slow Netflix night as many members celebrate with families or spend Christmas Eve in other ways than watching TV shows or movies. We see significantly higher usage on Christmas Day and increased streaming rates continue until customers go back to work or school.  While we truly regret the inconvenience this outage caused our customers on Christmas Eve, we were also fortunate to have Netflix streaming fully restored before a much higher number of our customers would have been affected.

What Broke And What Should We Do About It

In its postmortem on the outage, AWS reports that ...data was deleted by a maintenance process that was inadvertently run against the production ELB state data”. This caused data to be lost in the ELB service back end, which in turn caused the outage of a number of ELBs in the US-East region across all availability zones starting at 12:24 PM on December 24.

The problem spread gradually, causing broader impact until At 5:02 PM PST, the team disabled several of the ELB control plane workflows”.

The AWS team had to restore the missing state data from backups, which took all night. By 5:40 AM PST ... the new ELB state data had been verified.”. AWS has put safeguards in place against this particular failure, and also says We are confident that we could recover ELB state data in a similar event significantly faster”.

Netflix is designed to handle failure of all or part of a single availability zone in a region as we run across three zones and operate with no loss of functionality on two.  We are working on ways of extending our resiliency to handle partial or complete regional outages.

Previous AWS outages have mostly been at the availability zone level, and we’re proud of our track record in terms of up time, including our ability to keep Netflix streaming running while other AWS hosted services are down.

Our strategy so far has been to isolate regions, so that outages in the US or Europe do not impact each other.

It is still early days for cloud innovation and there is certainly more to do in terms of building resiliency in the cloud.
In 2012 we started to investigate running Netflix in more than one AWS region and got a better gauge on the complexity and investment needed to make these changes.

We have plans to work on this in 2013. It is an interesting and hard problem to solve, since there is a lot more data that will need to be replicated over a wide area and the systems involved in switching traffic between regions must be extremely reliable and capable of avoiding cascading overload failures. Naive approaches could have the downside of being more expensive, more complex and cause new problems that might make the service less reliable. Look for upcoming blog posts as we make progress in implementing regional resiliency.

As always, we are hiring the best engineers we can find to work on these problems, and are open sourcing the solutions we develop as part of our platform.

Happy New Year and best wishes for 2013.