Introducing Chaos to C*

Netflix Technology Blog
Netflix TechBlog
Published in
2 min readOct 10, 2013

--

by Christos Kalantzis, Minh Do, Homajeet Cheema, and Roman Vasilyev

One of the practices that sets Netflix apart from most companies is the belief that you can only know how good your software stack is by trying to making it fail. We’ve blogged about Chaos Monkey and how it helps identify deficiencies in your software stack. Netflix, has implemented Chaos Monkey on our mid-tier stateless systems, to great success.

We are pleased to announce that the Cloud Database Engineering (CDE) team has turned on Chaos Monkey on our Production C* Clusters.

How did we do it?

At the heart of being able to introduce Chaos Monkey to C* are 3 things:

  1. Apache Cassandra’s Highly Available architecture.
  2. Reliable Monitoring
  3. Automatic Remediation

Apache Cassandra’s HA Architecture

Within the CAP theorem, C*’s shared-nothing data architecture and data replication makes it excel at AP. This allows us to “lose” C* nodes without affecting the overall usability of the C* Cluster.

Reliable Monitoring

The CDE team has gone to great lengths to understand the inner working of C* and how to expose metrics and detect the state of our clusters. We’ve used this knowledge to build reliable monitoring that can help us determine the real-time state of our C* Clusters. It also can distinguish between a transient AWS network issue or a real lost node which needs to be handled.

Automatic Remediation

One of the core values of Netflix is that all developers are responsible for operating their code. Since my developers and I are by nature lazy and like to sleep at night, we’ve developed automation around handling some of the most common states our C* Clusters face. One of those states is nodes being down. Our automatic remediation system will initiate a node replacement. Once the node has finished bootstrapping the data, the C* Cluster will once again be at full strength.

Workflow

Here is a representation of our workflow:

Netflix strongly believes that testing failure scenarios in production is the most reliable way to gain confidence in your software stack. If you find creating such automation and monitoring, right up your alley, visit jobs.netflix.com to join the CDE team.

Originally published at techblog.netflix.com on October 10, 2013.

--

--

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations