By Jeremy Edberg & Ariel Tseitlin
On Monday, October 22nd, Amazon experienced a service degradation. It was highly publicized due to the fact that it took down many popular websites for multiple hours. Netflix however, while not completely unscathed, handled the outage with very little customer impact. We did some things well and could have done some things better, and we'd like to share with you the timeline of the outage from our perspective and some of the best practices we used to minimize customer impact.
On Monday, just after 8:30am, we noticed that a couple of large websites that are hosted on Amazon were having problems and displaying errors. We took a survey of our own monitoring system and found no impact to our systems. At 10:40am, Amazon updated their status board showing degradation in the EBS service. Since Netflix focuses on making sure services can handle individual instance failure and since we avoid using EBS for data persistence, we still did not see any impact to our service.
At around 11am, some Netflix customers started to have intermittent problems. Since much of our client software is designed for resilience against intermittent server problems, most customers did not notice. At 11:15am, the problem became significant enough that we opened an internal alert and began investigating the cause of the problem. At the time, the issue was exhibiting itself as a network issue, not an EBS issue, which caused some initial confusion. We should have opened an alert earlier, which would have helped us narrow down the issue faster and let us remediate sooner.
When we were able to narrow down the network issue to a single zone, Amazon was also able to confirm that the degradation was limited to a single Availability Zone. Once we learned the impact was isolated to one AZ, we began evacuating the affected zone.
Due to previous single zone outages, one of the drills we run is a zone evacuation drill. Between the zone evacuation drill and our learnings from previous outages, the decision to evacuate the troubled zone was an easy one -- we expected it to be as quick and painless as it was during past drills. So that is what we did.
In the past we identified zone evacuations as a good way of solving problems isolated to a single zone and as such have made it easy in Asgard to do this with a few clicks per application. That preparation came in handy on Monday when we were able to evacuate the troubled zone in just 20 minutes and completely restore service to all customers.
Building for High Availability
We’ve developed a few patterns for improving the availability of our service.
Past outages and a mindset for designing in resiliency at the start have taught us a few best-practices about building high availability systems.
One of the most important things that we do is we build all of our software to operate in three Availability Zones. Right along with that is making each app resilient to a single instance failing. These two things together are what made zone evacuation easier for us. We stopped sending traffic to the affected zone and everything kept running. In some cases we needed to actually remove the instances from the zone, but this too was done with a just a few clicks to reconfigure the auto scaling group.
We apply the same three Availability Zone redundancy model to our Cassandra clusters. We configure all our clusters to use a replication factor of three, with each replica located in a different Availability Zone. This allowed Cassandra to handle the outage remarkably well. When a single zone became unavailable, we didn't need to do anything. Cassandra routed requests around the unavailable zone and when it recovered, the ring was repaired.
Everyone has the best intentions when building software. Good developers and architects think about error handling, corner cases, and building resilient systems. However, thinking about them isn’t enough. To ensure resiliency on an ongoing basis, you need to alway test your system’s capabilities and its ability to handle rare events. That’s why we built the Simian Army: Chaos Monkey to test resilience to instance failure, Latency Monkey to test resilience to network and service degradation, and Chaos Gorilla to test resilience to zone outage. A future improvement we want to make is expanding the Chaos Gorilla to make zone evacuation a one-click operation, making the decision even easier. Once we build up our muscles further, we want to introduce Chaos Kong to test resilience to a complete regional outage.
Easy tooling & Automation
The last thing that made zone evacuation an easy decision is our cloud management tool, known as Asgard . With just a couple of clicks, service owners are able to stop the traffic to the instances or delete the instances as necessary.
Every time we have an outage, we make sure that we have an incident review. The purpose of these reviews is not to place blame, but to learn what we can do better. After each incident we put together a detailed timeline and then ask ourselves, “What could we have done better? How could we lessen the impact next time? How could we have detected the problem sooner?” We then take those answers and try to solve classes of problems instead of just the most recent problem. This is how we develop our best practices.
We weathered this last AWS outage quite well and learned a few more lessons to improve on. With each outage, we look for opportunities to improve both the way our system is built and the way we detect and react to failure. While we feel we’ve built a highly available and reliable service, there’s always room to grow and improve.
If you like thinking about high availability and how to build more resilient systems, we have many openings throughout the company, including a few Site Reliability Engineering positions.