Years ago, we decided to improve the resiliency of our microservice architecture. At our scale it is guaranteed that servers on our cloud platform will sometimes suddenly fail or disappear without warning. If we don’t have proper redundancy and automation, these disappearing servers could cause service problems.
The Freedom and Responsibility culture at Netflix doesn’t have a mechanism to force engineers to architect their code in any specific way. Instead, we found that we could build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward. We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours. Some people thought this was crazy, but we couldn’t depend on the infrequent occurrence to impact behavior. Knowing that this would happen on a frequent basis created strong alignment among our engineers to build in the redundancy and automation to survive this type of incident without any impact to the millions of Netflix members around the world.
We value Chaos Monkey as a highly effective tool for improving the quality of our service. Now Chaos Monkey has evolved. We rewrote the service for improved maintainability and added some great new features. The evolution of Chaos Monkey is part of our commitment to keep our open source software up to date with our current environment and needs.
Chaos Monkey 2.0 is fully integrated with Spinnaker, our continuous delivery platform.
Service owners set their Chaos Monkey configs through the Spinnaker apps, Chaos Monkey gets information about how services are deployed from Spinnaker, and Chaos Monkey terminates instances through Spinnaker.
Since Spinnaker works with multiple cloud backends, Chaos Monkey does as well. In the Netflix environment, Chaos Monkey terminates virtual machine instances running on AWS and Docker containers running on Titus, our container cloud.
Integration with Spinnaker gave us the opportunity to improve the UX as well. We interviewed our internal customers and came up with a more intuitive method of scheduling terminations. Service owners can now express a schedule in terms of the mean time between terminations, rather than a probability over an arbitrary period of time. We also added grouping by app, stack, or cluster, so that applications that have different redundancy architectures can schedule Chaos Monkey appropriate to their configuration. Chaos Monkey now also supports specifying exceptions so users can opt out specific clusters. Some engineers at Netflix use this feature to opt out small clusters that are used for testing.
|Chaos Monkey Spinnaker UI|
Chaos Monkey can now be configured for specifying trackers. These external services will receive a notification when Chaos Monkey terminates an instance. Internally, we use this feature to report metrics into Atlas, our telemetry platform, and Chronos, our event tracking system. The graph below, taken from Atlas UI, shows the number of Chaos Monkey terminations for a segment of our service. We can see chaos in action. Chaos Monkey even periodically terminates itself.
|Chaos Monkey termination metrics in Atlas|
Netflix only uses Chaos Monkey to terminate instances. Previous versions of Chaos Monkey allowed the service to ssh into a box and perform other actions like burning up CPU, taking disks offline, etc. If you currently use one of the prior versions of Chaos Monkey to run an experiment that involves anything other than turning off an instance, you may not want to upgrade since you would lose that functionality.
We also used this opportunity to introduce many small features such as automatic opt-out for canaries, cross-account terminations, and automatic disabling during an outage. Find the code on the Netflix github account and embrace the chaos!
-Chaos Engineering Team at Netflix
Lorin Hochstein, Casey Rosenthal