Wednesday, September 10, 2014

Introducing Chaos Engineering

Chaos Monkey was launched in 2010 with our move to Amazon Web Services, and thus the Netflix Simian Army was born.  Our ecosystem has evolved as we’ve introduced thousands of devices, many new countries, a Netflix optimized CDN  often referred to as OpenConnect, a growing catalog of Netflix Originals, and new and exciting UI advancements.   Not only has complexity grown, but our infrastructure itself has grown to support our rapidly growing customer base.  As growth and evolution continues, we will experience and find new failure modes.

Our philosophy remains unchanged around injecting failure into production to ensure our systems are fault-tolerant. We are constantly testing our ability to survive “once in a blue moon” failures. In a sign of our commitment to this very philosophy, we want to double down on chaos aka failure-injection. We strive to mirror the failure modes that are possible in our production environment and simulate these under controlled circumstances.  Our engineers are expected to write services that can withstand failures and gracefully degrade whenever necessary.  By continuing to run these simulations, we are able to evaluate and improve such vulnerabilities in our ecosystem.

A great example of a new failure mode was the Christmas Eve 2012 regional ELB outage we experienced.  The Simian Army at the time only injected failure that we understood and experienced up to that point.  In response we invested in a multi-region Active-Active infrastructure to be resilient to such events.  Its not enough that we simply make a system that is fault-tolerant to region outages, we must regularly exercise our ability to withstand regional outages.  

Each outage reinforces our commitment to chaos to ensure a reliable experience possible for our users.  While much of the simian army is designed and built around maintaining our environments, Chaos Engineering is entirely focused on controlled failure injection.

The Plan for Chaos Engineering:

Establish Virtuous Chaos Cycles
A common industry practice around outages are blameless post-mortems, a discipline we practice along with action items to prevent recurrence.  In parallel with resilience patches and work to prevent recurrence, we also want to build new chaos tools to regularly and systematically test resilience to detect regressions or new conditions.

Regression Testing in Software Testing is a well understood discipline, chaos testing for regression in distributed systems at scale presents a unique challenge.  We aspire to make chaos testing as well an understood discipline in production systems as other disciplines in software development.

Increase use of Reliability Design Patterns
In distributed environments there’s a challenge in both creating reliability design patterns and integrating them in a consistent manner to handle failure.  When an outage or new failure mode surfaces it may start in a single service, but all services may be susceptible to the same failure mode.  Post-mortems will lead to immediate action items for a particular involved service but do not always lead to improvement for other loosely coupled services.  Eventually other susceptible services become impacted by a failure condition that may have previously surfaced.  Hystrix is a fantastic example of a reliability design pattern that helps to create consistency in our micro-services ecosystem.

Anticipate Future Failure Modes
Ideally distributed systems are designed to be so robust and fault-tolerant that they never fail. We must anticipate failure modes, determine ways to inject these conditions in a controlled manner and evolve our reliability design patterns.  Anticipating such events requires creativity and deep understanding of distributed systems; two of the most critical characteristics of Chaos Engineers.

New forms of Chaos and Reliability Design Patterns are two ways we are researching at Chaos Engineering.  As we get deeper into our research we will continue to post our findings.

For those interested in this challenging research, we’re hiring additional Chaos Engineers.  Check out the jobs for Chaos Engineering at our jobs site.

-Bruce Wong, Engineering Manager of Chaos Engineering at Netflix (sometimes referred to as Chaos Commander)