Friday, June 14, 2013

Isthmus - Resiliency against ELB outages

On Christmas Eve, 2012, Netflix streaming service experienced an outage.   For full details, see “A Closer Look at the Christmas Eve Outage” by Adrian Cockcroft.  This outage was particularly painful, both because of the timing, as well as the root cause - ELB control plane, was outside of our ability to correct.  While our applications were running healthy, no traffic was getting to them.  AWS teams worked diligently to correct the problem, though it took several hours to completely resolve the outage.

Following the outage, our teams had many discussions focusing on lessons and takeaways.  We wanted to understand how to strengthen our architecture so we can withstand issues like a region-wide ELB outage without service quality degradation to the Netflix users.  If we wanted to survive such outage, we needed a set of ELB’s hosted at another region that we could use to route the traffic to our backend services.  That was the starting point.

At end of 2012 we were already experimenting with a setup internally referred to as “Isthmus” (definition here), for a different purpose - we wanted to see if setting up a thin layer of ELB + a routing layer at remote AWS region and using persistent long distance connections between the routing layer and the backend services would improve latency of user experience.  We realized we can use a similar setup to achieve multi-regional ELB resiliency.  Under normal operation, traffic would flow through both regions.  If one of the regions would experience ELB issues, we would route via DNS all the traffic through another region.

The routing layer that we used was developed by our API team.  It’s a powerful and flexible layer that can maintain pool of connections, allows smart filtering and much more.  You can find full details at our NetflixOSS GitHub site.  Zuul is at the core of the Isthmus setup - it forwards all of user traffic and establishes the bridge (or an Isthmus) between 2 AWS regions.

We had to make some more changes to our internal infrastructure to support this effort.  Eureka - our service discovery solution normally operated within an AWS region.  In this particular setup, we needed Eureka in US-West2 region to be aware of Netflix services in US-East.  In addition, our middle-tier IPC layer - Ribbon, needed to understand whether to route requests to a service local to the region, or in a remote location.  

We route user traffic to a particular set of ELBs via DNS.  The changes were typically done by one of our engineers through the DNS provider UI console - one endpoint at a time.  This method is manual and does not work well in case of an outage.  Thus, Denominator was born - an open source library to work with DNS providers and allow such changes to be done programmatically.  Now we could automate and repeatedly execute directional DNS changes.

Putting it all together: changing user traffic in production
In the weeks following the outage, we stood up the infrastructure necessary to support Isthmus and were ready to test it out.  After some internal tests, and stress tests by simulating production-level traffic in our test environment, we deployed Isthmus in production, though it was taking no traffic yet.  Since the whole system was brand-new, we proceeded very carefully.  We started with a single endpoint, though a rather important one - our API services.  Gradually, we increased % of production traffic that it was taking: 1%, 5% and so on, until we verified that we could actually route 100% of user traffic through an Isthmus without any detrimental effects to user experience.  Traffic routing was done with DNS geo-directional changes - specifying which States to route to which endpoint.  

After success with the API service working in Isthmus mode, we proceeded to repeat the same setup with other services that enable Netflix streaming.  Not taking any chances, we’ve repeated the same gradual ramp-up and validation as we did with API.  Similar sequence, though at faster ramp-up speeds was followed for the remaining services that ensure user’s ability to browse and stream movies.

Over last 2 months we’ve been shifting production user traffic between AWS regions to reach the desired stable state - where traffic flows approximately 50/50% between 2 US-East and US-West regions.

The best way we can prove that this setup solves the problem we set out to resolve is by actually simulating an ELB outage - and verifying that we could throw all the traffic to another AWS region.  We’re currently planning such "Chaos" exercise and will be executing it shortly.

First step towards the goal of Multi-Regional Resiliency
The work that we’ve done so far improved our architecture to better handle region-wide ELB outages.  ELB is just one service dependency though - and we have many more.  Our goal is to be able to survive any region-wide issue - either a complete AWS Region failure, or a self-inflicted problem - with minimal or no service quality degradation to Netflix users.  For example, the solution we’re working on should mitigate outages like we had on June 5th.  We’re starting to replicate all the data between the AWS regions, and eventually will stand up full complement of services as well.  We’re working on such efforts now, and are looking for a few great engineers to join our Infrastructure teams.  If these are the types of challenges you enjoy - check out Netflix Jobs site for more details. To learn out more about Zuul, Eureka, Ribbon and other NetflixOSS components, join us for the upcoming NetflixOSS Meetup on July 17, 2013.