Monday, July 30, 2012

Chaos Monkey Released Into The Wild

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.
We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.
Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.

What is Chaos Monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don't, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

Why Run Chaos Monkey?

Failures happen and they inevitably happen when least desired or expected. If your application can't tolerate an instance failure would you rather find out by being paged at 3am or when you're in the office and have had your morning coffee? Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that "simple fix" you put in place last week could have undesired consequences. Do your traffic load balancers correctly detect and route requests around instances that go offline? Can you reliably rebuild your instances? Perhaps an engineer "quick patched" an instance last week and forgot to commit the changes to your source repository?
There are many failure scenarios that Chaos Monkey helps us detect. Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again.

Auto Scaling Groups

The default instance groupings that Chaos uses for selection is Amazon's Auto Scaling Group (ASG). Within an ASG, Chaos Monkey will select an instance at random and terminate it. The ASG should detect the instance termination and automatically bring up a new, identically configured, instance. If you are not using Auto Scaling Groups that should be the first step to making your application handle these isolated instance failure scenarios. Check out Asgard to make deploying and managing ASGs easy. There are many great features for ASGs beyond replacing terminated instances, like enabling the use of Amazon's Elastic Load Balancers (ELBs) to distribute traffic to all instances in your application. Netflix has a best-practice where all instances should be run within an ASG and we have Janitor Monkey to remind us by terminating all instances not following this best-practice.

Configuration

Chaos Monkey allows for an Opt-In or an Opt-Out model. At Netflix, we use the Opt-Out model, so if an application owner does nothing, Chaos Monkey will be acting on their application. For your organization, you have the option to choose what is right for you. This allows you to "test the water" and try out Chaos Monkey on a specific application to see how it reacts. Not every application can trivially handle an instance going offline.  Sometimes it takes a human to manually recover instances, perhaps exercising backups to bring them back. Ideally, engineers work towards making that process easier and faster and eventually automatic. For those applications, there is the ability to Opt-Out of Chaos Monkey. There is also a tunable "probability" that Chaos Monkey uses to control the chance of a termination.  A probability of 1 (or 100%) will terminate one instance per day per ASG.  If instance recovery is difficult and you only want a termination weekly, you can reduce the probability to 0.2 or 20% (daily is 100%, it runs 5 work days per week, so weekly is 20%). Note that this is still a probability and only meaningful when sampled multiple times. With a 20% probability, Chaos Monkey would terminate one instance a week on average. In practice, it might be 2 days in a row followed by 2 weeks of no terminations, but given a large enough sample it will terminate weekly on average. For an environment as large as Netflix, the configuration can get a bit tricky to manage and for this we have developed a dashboard to help that we hope to open source soon. You can read more about how to configure Chaos Monkey on the documentation wiki.

REST

Currently, there is a simple REST interface that allows you to query Chaos Monkey termination events. We keep records of what was terminated and when, so if something disappears, you can see if Chaos Monkey was responsible. You could use this API to get notifications of terminations, but we encourage you to use a more general application monitoring solution like servo to discover what is happening to your applications at runtime.

Costs

The termination events are stored in an Amazon SimpleDB table by default. There could be associated costs with Amazon SimpleDB but the activity of Chaos Monkey should be small enough to fall within Amazon's Free Usage Tier. Ultimately the costs associated with running Chaos Monkey are your responsibility.

More Monkey Business

We have a long line of simians waiting to be released.  The next likely candidate will be Janitor Monkey which helps keep your environment tidy and your costs down.  Stay tuned for more announcements.
If building tools to automate the operations and improve the reliability of the cloud sounds exciting, we're always looking for new members to join the team.  Take a look at jobs.netflix.com for current openings or contact @atseitlin.