Several years ago we introduced a tool called Chaos Monkey. This service pseudo-randomly plucks a server from our production deployment on AWS and kills it. At the time we were met with incredulity and skepticism. Are we crazy? In production?!?
Our reasoning was sound, and the results bore that out. Since we knew that server failures are guaranteed to happen, we wanted those failures to happen during business hours when we were on hand to fix any fallout. We knew that we could rely on engineers to build resilient solutions if we gave them the context to *expect* servers to fail. If we could align our engineers to build services that survive a server failure as a matter of course, then when it accidentally happened it wouldn’t be a big deal. In fact, our members wouldn’t even notice. This proved to be the case.
Building on the success of Chaos Monkey, we looked at an extreme case of infrastructure failure. We built Chaos Kong, which doesn’t just kill a server. It kills an entire AWS Region1.
It is very rare that an AWS Region becomes unavailable, but it does happen. This past Sunday (September 20th, 2015) Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1 Region. That instability caused more than 20 additional AWS services that are dependent on DynamoDB to fail. Some of the Internet’s biggest sites and applications were intermittently unavailable during a six- to eight-hour window that day.
Below is a chart of our video play metrics during a Chaos Kong exercise. These are three views of the same eight hour window. The top view shows the aggregate metric, while the bottom two show the same metric for the west region and the east region, respectively.
|Chaos Kong exercise in progress|
In the bottom row, you can clearly see traffic evacuate from the west region. The east region gets a corresponding bump in traffic as it steps up to play the role of savior. During the exercise, most of our attention stays focused on the top row. As long as the aggregate metric follows that relatively smooth trend, we know that our system is resilient to the failover. At the end of the exercise, you see traffic revert to the west region, and the aggregate view shows that our members did not experience an adverse effects. We run Chaos Kong exercises like this on a regular basis, and it gives us confidence that even if an entire region goes down, we can still serve our customers.
ADVANCING THE MODEL
We looked around to see what other engineering practices could benefit from these types of exercises, and we noticed that Chaos meant different things to different people. In order to carry the practice forward, we need a best-practice definition, a model that we can apply across different projects and different departments to make our services more resilient.
We want to capture the value of these exercises in a methodology that we can use to improve our systems and push the state of the art forward. At Netflix we have an extremely complex distributed system (microservice architecture) with hundreds of deploys every day. We don’t want to remove the complexity of the system; we want to thrive on it. We want to continue to accelerate flexibility and rapid development. And with that complexity, flexibility, and rapidity, we still need to have confidence in the resiliency of our system.
To have our cake and eat it too, we set out to develop a new discipline around Chaos. We developed an empirical, systems-based approach which addresses the chaos inherent in distributed systems at scale. This approach specifically builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it in a controlled experiment, and we use those learnings to fortify our systems before any systemic effect can disrupt the quality service that we provide our customers. We call this new discipline Chaos Engineering.
We have published the Principles of Chaos Engineering as a living document, so that other organizations can contribute to the concepts that we outline here.
We put these principles into practice. At Netflix we have a microservice architecture. One of our services is called Subscriber, which handles certain user management activities and authentication. It is possible that under some rare or even unknown situation Subscriber will be crippled. This might be due to network errors, under-provisioning of resources, or even by events in downstream services upon which Subscriber depends. When you have a distributed system at scale, sometimes bad things just happen that are outside any person’s control. We want confidence that our service is resilient to situations like this.
We have a steady-state definition: Our metric of interest is customer engagement, which we measure as the number of video plays that start each second. In some experiments we also look at load average and error rate on an upstream service (API). The lines that those metrics draw over time are predictable, and provide a good proxy for the steady-state of the system. We have a hypothesis: We will see no significant impact on our customer engagement over short periods of time on the order of an hour, even when Subscriber is in a degraded state. We have variables: We add latency of 30ms first to 20% then to 50% of traffic from Subscriber to its primary cache. This simulates a situation in which the Subscriber cache is over-stressed and performing poorly. Cache misses increase, which in turn increases load on other parts of the Subscriber service. Then we look for a statistically significant deviation between the variable group and the control group with respect to the system’s steady-state level of customer engagement.
If we find a deviation from steady-state in our variable group, then we have disproved our hypothesis. That would cause us to revisit the fallbacks and dependency configuration for Subscriber. We would undertake a concerted effort to improve the resiliency story around Subscriber and the services that it touches, so that customers can count on our service even when Subscriber is in a degraded state.
If we don’t find any deviation in our variable group, then we feel more confident in our hypothesis. That translates to having more confidence in our service as a whole.
In this specific case, we did see a deviation from steady-state when 30ms latency was added to 50% of the traffic going to this service. We identified a number of steps that we could take, such as decreasing the thread pool count in an upstream service, and subsequent experiments have confirmed the bolstered resiliency of Subscriber.
We started Chaos Monkey to build confidence in our highly complex system. We don’t have to simplify or even understand the system to see that over time Chaos Monkey makes the system more resilient. By purposefully introducing realistic production conditions into a controlled run, we can uncover weaknesses before they cause bigger problems. Chaos Engineering makes our system stronger, and gives us the confidence to move quickly in a very complex system.
This blog post is part of a series. In the next post on Chaos Engineering, we will take a deeper dive into the Principles of Chaos Engineering and hypothesis building with additional examples from our production experiments. If you have thoughts on Chaos Engineering or how to advance the state of the art in this field, we’d love to hear from you. Feel free to reach out to firstname.lastname@example.org.
-Chaos Team at Netflix
Ali Basiri, Lorin Hochstein, Abhijit Thosar, Casey Rosenthal
1. Technically, it only simulates killing an AWS Region. For our purposes, simulating this giant infrastructure failure is sufficient, and AWS doesn’t yet provide us with a way of turning off an entire region. ;-) ↩