Friday, April 29, 2011

Lessons Netflix Learned from the AWS Outage

On Thursday, April 21st, Amazon experienced a large outage in AWS US-East which they describe here. This outage was highly publicized because it took down or severely hampered a number of popular websites that depend on AWS for hosting. Our post below describes our experience at Netflix with the outage, and what we've learned from it.

Some Background
Why were some websites impacted while others were not? For Netflix, the short answer is that our systems are designed explicitly for these sorts of failures. When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3 and Cassandra services that we do depend upon were not affected by the outage.

What Went Well...
The Netflix service ran without intervention but with a higher than usual error rate and higher latency than normal through the morning, which is the low traffic time of day for Netflix streaming. We didn't see a noticeable increase in customer service calls, or our customers' ability to find and start movies.

In some ways this Amazon issue was the first major test to see if our new cloud architecture put us in a better place than we were in 2008. It wasn't perfect, but we didn't suffer any big external outages and only had to do a small amount of scrambling internally. Some of the decisions we made along the way that contributed to our success include:

Stateless Services
One of the major design goals of the Netflix re-architecture was to move to stateless services. These services are designed such that any service instance can serve any request in a timely fashion and so if a server fails it’s not a big deal. In the failure case requests can be routed to another service instance and we can automatically spin up a new node to replace it.

Data Stored Across Zones

In cases where it was impractical to re-architect in a stateless fashion we ensure that there are multiple redundant hot copies of the data spread across zones. In the case of a failure we retry in another zone, or switch over to the hot standby.

Graceful Degradation
Our systems are designed for failure. With that in mind we have put a lot of thought into what we do when (not if) a component fails. The general principles are:

Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.
Fallbacks: Each feature is designed to degrade or fall back to a lower quality representation. For example if we cannot generate personalized rows of movies for a user we will fall back to cached (stale) or un-personalized results.
Feature Removal: If a feature is non-critical then if it’s slow we may remove the feature from any given page to prevent it from impacting the member experience.

"N+1" Redundancy
Our cloud architecture is is designed with N+1 redundancy in mind. In other words we allocate more capacity than we actually need at any point in time. This capacity gives us the ability to cope with large spikes in load caused by member activity or the ripple effects of transient failures; as well as the failure of up to one complete AWS zone. All zones are active, so we don't have one hot zone and one idle zone as used by simple master/slave redundancy. The term N+1 indicates that one extra is needed, and the larger N is, the less overhead is needed for redundancy. We spread our systems and services out as evenly as we can across three of the four zones. When provisioning capacity, we use AWS reserved instances in every zone, and reserve more than we actually use, so that we will have guaranteed capacity to allocate if any one zone fails. As a result we have higher confidence that the other zones are able to grow to pick up the excess load from a zone that is not functioning properly. This does cost money (reservations are for one to three years with an advance payment) however, this is money well spent since it makes our systems more resilient to failures.

Cloud Solutions for the Cloud
We could have chosen the simplest path into the cloud, fork-lifting our existing applications from our data centers to Amazon's and simply using EC2 as if it was nothing more than another set of data centers. However, that wouldn't have given us the same level of scalability and resiliency that we needed to run our business. Instead, we fully embraced the cloud paradigm. This meant leveraging NoSQL solutions wherever possible to take advantage of the added availability and durability that they provide, even though it meant sacrificing some consistency guarantees. In addition, we also use S3 heavily as a durable storage layer and pretend that all other resources are effectively transient. This does mean that we can suffer if there are S3 related issues, however this is a system that Amazon has architected to be resilient across individual system and zone failures and has proven to be highly reliable, degrading rather than failing.

What Didn't Go So Well...

While we did weather the storm, it wasn't totally painless. Things appeared pretty calm from a customer perspective, but we did have to do some scrambling internally. As it became clear that AWS was unlikely to resolve the issues before Netflix reached peak traffic in the early evening, we decided to manually re-assign our traffic to avoid the problematic zone. Thankfully, Netflix engineering teams were able to quickly coordinate to get this done. At our current level of scale this is a bit painful, and it is clear that we need more work on automation and tooling. As we grow from a single Amazon region and 3 availability zones servicing the USA and Canada to being a worldwide service with dozens of availability zones, even with top engineers we simply won't be able to scale our responses manually.

Manual Steps
When Amazon's Availability Zone (AZ) started failing we decided to get out of the zone all together. This meant making significant changes to our AWS configuration. While we have tools to change individual aspects of our AWS deployment and configuration they are not currently designed to enact wholesale changes, such as moving sets of services out of a zone completely. This meant that we had to engage with each of the service teams to make the manual (and potentially error prone) changes. In the future we will be working to automate this process, so it will scale for a company of our size and growth rate.

Load Balancing
Netflix uses Amazons Elastic Load Balance (ELB) service to route traffic to our front end services. We utilize ELB for almost all our web services. There is one architectural limitation with the service: losing a large number of servers in a zone can create a service interruption.

ELB's are setup such that load is balanced across zones first, then instances. This is because the ELB is a two tier load balancing scheme. The first tier consists of basic DNS based round robin load balancing. This gets a client to an ELB endpoint in the cloud that is in one of the zones that your ELB is configured to use. The second tier of the ELB service is an array of load balancer instances (provisioned directly by AWS), which does round robin load balancing over our own instances that are behind it in the same zone.

In the case where you are in 3 zones and and you have many service instances go down then the rest of the nodes in that AZ have to pick up the slack and handle the extra load. Eventually, if you couldn't launch more nodes to bring capacity up to previous levels you would likely suffer a cascading failure where all the nodes in the zone go down. If this happened then a third of your traffic would essentially go into a "black hole" and fail.

The net effect of this is that we have to be careful to make sure that our zones stay evenly balanced with frontend servers so that we don't serve degraded traffic from any one zone. This meant that when the outage happened last week we had to manually update all of our ELB endpoints to completely avoid the failed zone, and change the autoscaling groups behind them to do the same so that the servers would all be in the zones that were getting traffic.

It also appears that the instances used by AWS to provide ELB services were dependent on EBS backed boot disks. Several people have commented that ELB itself had a high failure rate in the affected zone.

For middle tier load balancing Netflix uses its own software load balancing service that does balance across instances evenly, independent of which zone they are in. Services using middle tier load balancing are able to handle uneven zone capacity with no intervention.

Lessons Learned
This outage gave us some valuable experience and helped us to identify and prioritize several ways that we should improve our systems to make it more resilient to failures.

Create More Failures
Currently, Netflix uses a service called "Chaos Monkey" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla".

Automate Zone Fail Over and Recovery
Relying on multiple teams using manual intervention to fail over from a failing AZ simply doesn't scale. We need to automate this process, making it a "one click" operation, that can be invoked if required.

Multiple Region Support
We are currently re-engineering our systems to work across multiple AWS regions as part of the Netflix drive to support streaming in global markets. As AWS launches more regions around the world, we want to be able to migrate the support for a country to a newly launched AWS region that is closer to our customers. This is essentially the same operation as a disaster recovery migration from one region to another, so we would have the ability to completely vacate a region if a large scale outage occurred.

Avoid EBS Dependencies
We had already decided that EBS performance was an issue, so (with one exception) we have avoided it as a primary data store. With two outages so far this year we are taking steps to further reduce dependencies by switching our stateful instances to S3 backed AMIs. These take a lot longer to create in our build process, but will make our Cassandra storage services more resilient, by depending purely on internal disks. We already have plans to migrate our one large MySQL EBS based service to Cassandra.

Conclusion
We set out to build a highly available Netflix service on AWS and this outage was a major test of that decision. While we do have some lessons learned and some improvements to make, we continue to be confident that this is the right strategy for us.

Authors:
Adrian Cockroft - Director - Architecture, ECommerce and System Engineering
Cory Hicks - Sr. Manager, Applied Personalization Algorithms
Greg Orzell - Sr. Manager, Streaming Service and Insight Engineering

Monday, April 18, 2011

“More Like This…” Building a network of similarity

Ever wondered what makes “Inception” similar to “12 Monkeys”?

Looking at similarity between movies and TV shows is a useful way to find great titles to watch. At Netflix, we calculate levels of similarity between each movie and TV show and use these “similars”, as we call them, for a variety of customer-facing and internal systems.

Hi, there. Hans Granqvist, senior algorithm engineer, here to tell you more about how we build similars. I want to share some insights into how we lifted and improved this build as we moved it from our datacenter to the cloud.

The similars build process

First a little background.

To create a network of similars, we look at more than thousands facets associated with each title. A future post will go into more detail of the similarity build, but it follows this somewhat simplified process:

First we discover sets of titles that may be similar to the source title, based on the algorithm used and facets associated with each title. We then refine these sets of titles and filter them to remove unwanted matches, depending on algorithm deployment types and audiences.

After this, we dynamically weigh each facet and score it. The sum is the measure of similarity: the higher the sum, the more similar the titles.

Until last year, we built similars in our data center. As internal dependencies transferred to the cloud, its scaling capabilities became real. We wanted to use the cloud’s more flexible deployment structure, and its inherent parallelism would let us change the build to scale linearly with number of machines. We could increase the numbers of algorithms by an order of a magnitude.

Shortcomings of old Datacenter build. Opportunities and challenges of the cloud.

While the old build worked well, its shortcomings were several:
  • Algorithms were defined in the code making them hard to change.
  • The datacenter was limited to small set of machines leading to long recalculation times (several days).
  • Longer push cycles due to code linkage and runtime dependencies.
  • Building directly on production DB structure with varying resource availability.
Moving to the cloud presented new opportunities:
  • A new architecture lets us define algorithms outside of code, using distributed stores to properly isolate and share newer versions of algorithms.
  • The cloud’s unlimited capacity (within reason) could be exercised to build massively in parallel.
  • Netflix components are now all re-architected as services. We can push new code much faster, almost instantaneous. Internal dependencies are just an API call away. 
Of course, with this come challenges:
  • Remote service calls have latency.  Going from nanoseconds to milliseconds makes a huge difference when you repeat it millions of times.
  • The cloud persistence layers (SimpleDB and S3) have wildly varying performance characteristics. For some searches via SimpleDB, for example, there are surprisingly no SLAs.
  • With hundreds of machines building simultaneously, the need to partition and properly synchronize work becomes paramount.
  • The distributed nature by the cloud environment increases the risk of failures around data store, message bus systems, and caching layers.
Solutions

We based our new cloud architecture on series of tasks distributed by a controller to a set of builder nodes that communicate through a set of message queues.

Each task contains information including source title and algorithm, with optional versioning. As a build node picks up these tasks off the queue, it collects the definition of the algorithm from persistent storage, converts it into a sequence of executional steps, and starts executing.

Technologies used
  • AWS Simple Queue System for communication between controller and nodes.
  • AWS SimpleDB, Amazon’s row database, to store the definitions of algorithms.
  • AWS S3, Amazon’s key/value store.
  • EV Cache, a Netflix-developed version of memcache to increase throughput.
  • A Netflix-developed persistent store mechanism that transparently chains various types of caching (local near-cache LRU cache and service-shared EV cache, for example) to S3.
Build process

The following figure shows the various components in the build process. The Controller sends tasks on an SQS instruction queue. These tasks are read by a set of Build Nodes, which read the algorithms from SimpleDB and S3, and use various data sources to calculate the set of similars. When done, the node writes the result to persistent store and signals build status back to the controller via an SQS feedback queue.


Architecture of the Similars Build process (click to enlarge). A ‘wrapped component’ indicates the component needs to be instrumented to handle network hiccups, failures and AWS API rate limitations.
Based on their availability to process new tasks, build nodes periodically read from the instruction queue. When a message has been seen and read, SQS guarantees other nodes will not read the same message until the message visibility window expires. 

The build process spins off independent task threads for each task parsed. The first time an algorithm is seen, a builder node reads the algorithm definition and decides whether it can process the task. Newer versions of build nodes with knowledge of newer sets of data sources can co-exist with older ones using versioned messages.

If a node cannot process the task, it drops the message on the floor and relies on the SQS time out window to expire so the message becomes visible for other nodes.  The time-out window has been tuned to give a node reasonable time to process the message.

SQS guarantees only that messages arrive, not that they arrive in the order they were put on the queue. Care has to be given to define each message as independent and idempotent.

The final step is to persist the now calculated list of score-ordered similar to S3.

Once the task has been performed, the node puts a feedback message on a feedback queue. The controller uses this feedback to measure build task progress and also to collect statistics on each node’s performance. Based on this statistic, the controller may change the number of builder threads for a node, how often it should read from the queue, or various other timeout and retry values for SQS, S3, and SimpleDB.

Error situations and solutions

Building the system made us realize that we’re in a different reality in the cloud.

Some of the added complexity comes from writing a distributed system, where anything can fail at any given time. But some of the complexity was unexpected and we had to learn how to handle the following on a much larger scale than we initially envisioned.
  • Timeouts and slowness reading algorithms and weights from persistent store systems, each of which can rate-limit a client if it believes the client abuses the service. Once in such a restricted state, your code needs to quickly ease off. The only way to try to prevent AWS API rate limitation is to start out slow and gradually increase your activity. Restriction normally applies to the entire domain, so all clients on that domain will be restricted, not just the one client currently misusing it. We handled these issues via multiple levels of caching (using both a near cache on the builder node + application level cache to store partial results) with exponential fallback retries.
  • Timeouts and AWS API rate limitation writing to SQS. Putting messages on the queue can fail. We handle this via exponential fallback retries.
  • Inability of a node to read from SQS. Also handled via exponential fallback retries.
  • Inability of nodes to process all tasks in a message. We batch messages on SQS for both cost and performance reasons. When a node cannot process all tasks in a message, we drop message on floor and rely on SQS to resend it.
  • Inability of a node to process a task inside a batch message. A node may have occasional glitches or find it impossible to finish its tasks (e.g., data sources may have gone offline). We collate all failed tasks and retry on each node until set is empty or fail the batch message after a number of tries.
  • Timeouts and AWS API rate limitation writing to persistency layers. We handle with exponential retry.
All exponential retries typically wait in the order of 500ms, 2000ms, 8000ms, and so on, with some randomness added to avoid nodes retrying at fixed intervals. Sometimes operations have to be retried up to dozens of times.

Conclusion

By moving our build to the cloud, we managed to cut the time it takes to calculate a network of similars from up to two weeks down to mere hours. This means we can now experiment and A/B test new algorithms much more easily.

We also now have combinatorial algorithms (algorithms defined in terms of other algorithms) and the build nodes use this fact to execute builds in dependency order. Subsequent builds pick up cached results and we have seen exponential speed increases.

While network builds such as these many times can be embarrassingly parallel, it is worth noticing that the error situations come courtesy by a distributed environment where there in many cases are ill-defined (or none at all) SLAs.

One key insight is that the speed with which we can build new networks is now gated by how fast we can pump the result data to the receiving permanent store. CPU and RAM is a cheap and predictable cloud commodity. I/O is not.

The lessons of building this system will be invaluable as we progress into more complex processes where we take in more factors, many of which are highly temporal and real-time driven or limited to specific countries, regions, languages, and cultures.

Let us know what you think… and thanks for reading!

PS. Want to work in our group? Send us your resume and let us know why you think, or don’t think, “Inception” is similar to “12 Monkeys.”