By Ruslan Meshenberg, Naresh Gopalani, Luke Kosewski
In June, we talked about Isthmus - our approach to achieve resiliency against region-wide ELB outage. After completing the Isthmus project we embarked on the next step in our quest for higher resiliency and availability - a full multi-regional Active-Active solution. This project is now complete, and Netflix is running Active-Active across the USA, so this post aims to highlight some of the interesting challenges and learnings we found along the way.
Failure - function of scale and speed.
In general, failure rates an organization is dealing with depend largely on 2 factors: scale of operational deployment and velocity of change. If both scale and speed are small, then most of the time things just work. Once scale starts to grow, even with slow velocity, the chance of hardware failure will increase. Conversely, even at small scale, if velocity is fast enough, chance of software failure will increase. Ultimately, if you’re running at scale and still pursuing high velocity - things will break all the time.
Of course, not all failures are created equal. The types of failures that we’ll focus on in this post are the most important and difficult to deal with - complete and prolonged service outages with unhappy customers flooding customer service lines, going to twitter to express their frustration, articles popping up across multiple publications announcing “service X is down!”.
At Netflix, our internal availability goals are 99.99% - which does not leave much time for our services to be down. So in addition to deploying our services across multiple instances and Availability Zones, we decided to deploy them across multiple AWS Regions as well. Complete regional infrastructure outage is extremely unlikely, but our pace of change sometimes breaks critical services in a region, and we wanted to make Netflix resilient to any of the underlying dependencies. In doing so, we’re leveraging the principles of Isolation and Redundancy: a failure of any kind in one Region should not affect services running in another, a networking partitioning event should not affect quality of service in either Region.
In a nutshell, Active-Active solution gets all the services on the user call path deployed across multiple AWS Regions - in this case US-East-1 in Virginia and US-West-2 in Oregon. In order to do so, several requirements must be satisfied
- Services must be stateless - all data / state replication needs to handled in data tier.
- They must access any resource locally in-Region. This includes resources like S3, SQS, etc. This means several applications that are publishing data into an S3 bucket, now have to publish the same data into multiple regional S3 buckets.
- there should not be any cross-regional calls on user’s call path. Data replication should be asynchronous.
In a normal state of operation, users would be geo-DNS routed to the closest AWS Region, with a rough split of 50/50%. In the event of any significant region-wide outage, we have tools to override geo-DNS and direct all of users traffic to a healthy Region.
There are several technical challenges in order to achieve such a setup:
- Effective tooling for directing traffic to the correct Region
- Traffic shaping and load shedding, to handle thundering herd events
- State / Data asynchronous cross-regional replication
DNS - Controlling user traffic with Denominator
We direct a user’s traffic to our services via a combination of UltraDNS and Route53 entries, our Denominator project provides a single client library and command line that controls multiple DNS providers. There are several reasons why we ended up using a combination of two:
- UltraDNS provides us ability to directionally route customers from different parts of North America to different regional endpoints. This feature is supported by other vendors including Dyn, but is not supported in Route53. We didn’t want to use a latency based routing mechanism because it could cause unpredictable traffic migration effects.
- By using Route53 layer between UltraDNS and ELBs, we have an additional ability to switch user traffic, and the Route53 API provides reliable and fast configuration changes that are not a strong point for other DNS vendor APIs.
- Switching traffic using a separate Route53 layer makes such change much more straightforward. Instead of moving territories with directional groups, we just move Route53 CNAMEs.
Zuul - traffic shaping
We recently talked about Zuul in great detail, as we opened this component to the community in June 2013. All of Netflix Edge services are fronted with the Zuul layer. Zuul provides us with resilient and robust ways to direct traffic to appropriate service clusters, ability to change such routing at runtime, and an ability to decide whether we should shed some of the load to protect downstream services from being over-run.
We had to enhance Zuul beyond it’s original set of capabilities in order to enable Active-Active use cases and operational needs. The enhancements were in several areas:
- Ability to identify and handle mis-routed requests. User request is defined as mis-routed if it does not conform to our geo directional records. This ensures a single user device session does not span multiple regions. We also have controls for whether to use “isthmus” mode to send mis-routed requests to the correct AWS Region, or to return a redirect response that will direct clients to the correct Region.
- Ability to declare a region in a “failover” mode - this means it will no longer attempt to route any mis-routed requests to another region, and instead will handle them locally
- Ability to define a maximum traffic level at any point in time, so that any additional requests will be automatically shed (response == error), in order to protect downstream services against a thundering herd of requests. Such ability is absolutely necessary in order to protect services that are still scaling up in order to meet growing demands, or when regional caches are cold, so that the underlying persistence layer does not get overloaded with requests.
All of these capabilities provide us with a powerful and flexible toolset to manage how we handle user’s traffic in both stable state as well as failover situations.
Replicating the data - Cassandra and EvCache
One of the more interesting challenges in implementing Active-Active, was the replication of users' data/state. Netflix has embraced Apache Cassandra as our scalable and resilient NoSQL persistence solution. One of inherent capabilities of Apache Cassandra is the product's multi-directional and multi-datacenter (multi-region) asynchronous replication. For this reason all data read and written to fulfil users' requests is stored in Apache Cassandra.
Netflix has operated multi-region Apache Cassandra clusters in US-EAST-1 and EU-WEST-1 before Active-Active. However, most of the data stored in those clusters, although eventually replicated to the other region, was mostly consumed in the region it was written in using consistency levels of CL_LOCAL_QUORUM and CL_ONE. Latency was not such a big issue in that use case. Active-Active changes that paradigm. With requests possibly coming in from either US region, we need to make sure that the replication of data happens within an acceptable time threshold. This lead us to perform an experiment where we wrote 1 million records in one region of a multi-region cluster. We then initiated a read, 500ms later, in the other region, of the records that were just written in the initial region, while keeping a production level of load on the cluster. All records were successfully read. Although the test wasn't exhaustive, with negative tests and comprehensive corner case failure scenarios, it gave us enough confidence in the consistency/timeliness level of Apache Cassandra, for our use cases.
Since many of our applications that serve user requests need to do so in a timely manner, we need to guarantee that data tier reads are fast - generally in a single-millisecond range. In some cases, we also front our Cassandra clusters with a Memcached layer, in other cases we have ephemeral calculated data that only exists in Memcached. Managing memcached in a multi-regional Active-Active setup leads to a challenge of keeping the caches consistent with the source of truth. Rather than re-implementing multi-master replication for Memcached, we added remote cache invalidation capability to our EvCache client - a memcached client library that we open sourced earlier in 2013. Whenever there is a write in one region, EvCache client will send a message to another region (via SQS) to invalidate the corresponding entry. Thus a subsequent read in another region will recalculate or fall through to Cassandra and update the local cache accordingly.
Automating deployment across multiple environments and regions
When we launched our services in EU in 2012, we doubled our deployment environments from two to four: Test and Production, US-East and EU-West. Similarly, our decision to deploy our North American services in Active-Active mode increased deployment environments to six - adding Test and Production in the US-West region. While for any individual deployment we utilize Asgard - an extremely powerful and flexible deployment and configuration console, we quickly realized that our developers should not have to go through a sequence of at least 6 (some applications have more “flavors” that they support) manual deployment steps in order to keep their services consistent through all the regions.
To make the multi-regional deployment process more automated, our Tools team developed a workflow tool called Mimir, based on our open source Glisten workflow language, that allows developers to define multi-regional deployment targets and specifies rules of how and when to deploy. This, combined with automated canary analysis and automated rollback procedures allows applications to be automatically deployed in several places as a staged sequence of operations. Typically we wait for many hours between regional updates, so we can catch any problems before we deploy them world-wide.
Monkeys - Gorilla, Kong, Split-brain
We’ve talked a lot about our Simian Army - a collection of various monkeys we utilize to break our system - so that we can validate that our many services are resilient to different types of failures, and learn how we can make our system anti-fragile. Chaos Monkey - probably the most well known member of Simian Army runs in both Test and Production environments, and most recently it now includes Cassandra clusters in its hit list.
To validate our architecture was resilient to larger types of outages, we unleashed bigger Simians:
- Chaos Gorilla. It takes out a whole Availability Zone, in order to verify the services in remaining Zones are able to continue serving user requests without the loss of quality of service. While we were running Gorilla before Active-Active, continued regular exercises proved that our changed architecture was still resilient to Zone-wide outages.
- Split-brain. This is a new type of outage simulation where we severed the connectivity between Regions. We were looking to demonstrate that services in each Region continued to function normally, even though some of the data replication was getting queued up. Over the course of the Active-Active project we ran Split-brain exercise many times, and found and fixed many issues.
- Chaos Kong. This is the biggest outage simulation we’ve done so far. In a real world scenario this would have been triggered by a severe outage that would prompt us to rapidly shift user traffic to another region, but would inevitably result in some users experiencing total loss or lower quality of service. For the outage simulation we did not want to degrade user experience. We augmented what normally would have been an emergency traffic shifting exercise with extra steps so that users that were still routed to a “failed” region would still be redirected to the healthy region. Instead of getting errors, such users would still get appropriate responses. Also, we shifted traffic a bit more gradually than we would normally do under emergency circumstances in order to allow services to scale up appropriately and for caches to warm up gradually. We didn’t switch every single traffic source, but it was a majority and enough to prove we could take the full load of Netflix in US-West-2. We kept traffic in the west region for over 24 hours, and then gradually shifted it back to stable 50/50 state. Below you can see what this exercise looks like. Most traffic shifted from US-East to US-West, while EU-West remains unaffected:
Even before we fully validated multi-regional isolation and redundancy via Chaos Kong, we got a chance to exercise regional failover in real life. One of our middle tier systems in one of the regions experienced a severe degradation that eventually lead to the majority of the cluster becoming unresponsive. Under normal circumstances, this would have resulted in a severe outage with many users affected for some time. This time we had additional tool at our disposal - we decided to exercise the failover and route user requests to the healthy region. Within a short time, quality of service was restored to all the users. We could then spend time triaging the root cause of the problem, deploying the fix, and subsequently routing traffic back to the now healthy region. Here is the timeline of the failover, the black line is a guide from a week before:
Next steps in availability
All the work described above for Active-Active project is just a beginning. The project itself still has an upcoming Phase 2 - where we’ll focus on operational aspects of all of our multi-regional tooling, and automate as many of current manual steps as we can. We’ll focus on minimizing the time that we need to make a decision to execute a failover, and the time that it takes to fail over all of the traffic.
We’re also continuing to tackle some of the more difficult problems. For example, how do you deal with some of your dependencies responding slowly, or returning errors, but only some of the time? Arguably, this is harder than dealing with Chaos type of scenarios - when something is not there, or consistently failing, it’s much easier to decide what to do. To help us learn how our systems deal with such scenarios, we have a Latency Monkey - it can inject latencies and errors (both client and server-side) at a given frequency / distribution.
With large scale and velocity there is increased chance of failure. By leveraging the principles of Isolation and Redundancy we’ve made our architecture more resilient to widespread outages. The basic building blocks that we use to compose our services and make them resilient are available from our OSS Github site - with most of the changes for Active-Active already available, and some going through code reviews before we update the source code.
This post described the technical challenges and solutions for Active-Active. The non-technical level of complexity is even more difficult , especially given that many other important projects were being worked on at the same time. The secret sauce for success were the amazing teams and incredible engineers that work at Netflix, and the ability of AWS to rapidly provision additional capacity in US-West-2. It’s all of the teams working together that made this happen in only a few months from beginning to end.
If similar challenges and chance of working with amazing teams excite you, check out our jobs site. We’re looking for great engineers to join our teams and help us make Netflix services even more resilient and available!