On Sept 25th, 2014 AWS notified users about an EC2 Maintenance where “a timely security and operational update” needed to be performed that required rebooting a large number of instances. (around 10%) On Oct 1st, 2014 AWS sent an updated about the status of the reboot and XSA-108.
While we’d love to claim that we weren’t concerned at all given our resilience strategy, the reality was that we were on high alert given the potential of impact to our services. We discussed different options, weighed the risks and monitored our services closely. We observed that our systems handled the reboots extremely well with the resilience measures we had in place. These types of unforeseen events reinforce regular, controlled chaos and continued to invest in chaos engineering is necessary. In fact, Chaos Monkey was mentioned as a best practice in the latest EC2 Maintenance update.
Our commitment to induced chaos testing helps drive resilience, but it definitely isn’t trivial or easy; especially in the case of stateful systems like Cassandra. The Cloud Database Engineering team at Netflix rose to the challenge to embrace chaos and runs chaos monkey live in production last year. The number of nodes rebooted served as true battle testing for the resilience design measures created to operate cassandra.
Monkeying with the Database
Databases have long been the pampered and spoiled princes of the application world. They received the best hardware, copious amounts of personalized attention and no one would ever dream of purposely mucking around with them. In the world of democratized Public Clouds, this is no longer possible. Node failures are not just probable, they are expected. This requires database technology that can withstand failure and continue to perform.
Cassandra, Netflix’s database of choice, straddles the AP (Availability, Partition Tolerance) side of the CAP theorem. By trading away C (Consistency), we’ve made a conscious decision to design our applications with eventual consistency in mind. Our expectation is that Cassandra would live up to its side of the bargain and provide strong availability and partition tolerance. Over the years, it had demonstrated fairly good resilience to failure. However, it required lots of human intervention.
Last year we decided to invest in automating the recovery of failed Cassandra nodes. We were able to detect and determine a failed node. With the cloud APIs afforded to us by AWS, we can identify the location of the failed node and programmatically initiate the replacement and bootstrap of a new Cassandra node. This gave us the confidence to have Cassandra participate in our Chaos Monkey exercises.
It wasn’t perfect at first, but then again, what is? In true Netflix fashion, we failed fast and fixed forward. Over the next few months, our automation got better. There were less false positives, and our remediation scripts required almost no more human intervention.
“When we got the news about the emergency EC2 reboots, our jaws dropped. When we got the list of how many Cassandra nodes would be affected, I felt ill. Then I remembered all the Chaos Monkey exercises we’ve gone through. My reaction was, “Bring it on!”.” - Christos Kalantzis - Engineering Manager, Cloud Database Engineering
That weekend our on-call staff was exceptionally vigilant. The whole Cloud Database Engineering team was on high alert. We have confidence in our automation but a prudent team prepares for the worst and hopes for the best.
Out of our 2700+ production Cassandra nodes, 218 were rebooted. 22 Cassandra nodes were on hardware that did not reboot successfully. This led to those Cassandra nodes not coming back online. Our automation detected the failed nodes and replaced them all, with minimal human intervention. Netflix experienced 0 downtime that weekend.
Repeatedly and regularly exercising failure, even in the persistence layer, should be part of every company’s resilience planning. If it wasn’t for Cassandra’s participation in Chaos Monkey, this story would have ended much differently.