Thursday, December 8, 2011

Making the Netflix API More Resilient

by Ben Schmaus

The API brokers catalog and subscriber metadata between internal services and Netflix applications on hundreds of device types. If any of these internal services fail there is a risk that the failure could propagate to the API and break the user experience for members.
To provide the best possible streaming experience for our members, it is critical for us to keep the API online and serving traffic at all times. Maintaining high availability and resiliency for a system that handles a billion requests a day is one of the goals of the API team, and we have made great progress toward achieving this goal over the last few months.

Principles of Resiliency

Here are some of the key principles that informed our thinking as we set out to make the API more resilient.
  1. A failure in a service dependency should not break the user experience for members
  2. The API should automatically take corrective action when one of its service dependencies fails
  3. The API should be able to show us what’s happening right now, in addition to what was happening 15-30 minutes ago, yesterday, last week, etc.

Keep the Streams Flowing

As stated in the first principle above, we want members to be able to continue instantly watching movies and TV shows streaming from Netflix when server failures occur, even if the experience is slightly degraded and less personalized. To accomplish this we’ve restructured the API to enable graceful fallback mechanisms to kick in when a service dependency fails. We decorate calls to service dependencies with code that tracks the result of each call. When we detect that a service is failing too often we stop calling it and serve fallback responses while giving the failing service time to recover. We then periodically let some calls to the service go through and if they succeed then we open traffic for all calls.
If this pattern sounds familiar to you, you're probably thinking of the CircuitBreaker pattern from Michael Nygard’s book "Release It! Design and Deploy Production-Ready Software", which influenced the implementation of our service dependency decorator code. Our implementation goes a little further than the basic CircuitBreaker pattern in that fallbacks can be triggered in a few ways:
  1. A request to the remote service times out
  2. The thread pool and bounded task queue used to interact with a service dependency are at 100% capacity
  3. The client library used to interact with a service dependency throws an exception
These buckets of failures factor into a service's overall error rate and when the error rate exceeds a defined threshold then we "trip" the circuit for that service and immediately serve fallbacks without even attempting to communicate with the remote service.
Each service that’s wrapped by a circuit breaker implements a fallback using one of the following three approaches:
  1. Custom fallback - in some cases a service’s client library provides a fallback method we can invoke, or in other cases we can use locally available data on an API server (eg, a cookie or local JVM cache) to generate a fallback response
  2. Fail silent - in this case the fallback method simply returns a null value, which is useful if the data provided by the service being invoked is optional for the response that will be sent back to the requesting client
  3. Fail fast - used in cases where the data is required or there’s no good fallback and results in a client getting a 5xx response. This can negatively affect the device UX, which is not ideal, but it keeps API servers healthy and allows the system to recover quickly when the failing service becomes available again.
Ideally, all service dependencies would have custom fallbacks as they provide the best possible user experience (given the circumstances). Although that is our goal, it’s also very challenging to maintain complete fallback coverage for many service dependencies. So the fail silent and fail fast approaches are reasonable alternatives.

Real-time Stats Drive Software and Diagnostics

I mentioned that our circuit breaker/fallback code tracks and acts on requests to service dependencies. This code counts requests to each service dependency over a 10 second rolling window. The window is rolling in the sense that request stats that are older than 10 seconds are discarded; only the results of requests over the last 10 seconds matter to the code. We also have a dashboard that’s wired up to these same stats that shows us the state of our service dependencies for the last 10 seconds, which comes in really handy for diagnostics.
You might ask, "Do you really need a dashboard that shows you the state of your service dependencies for the last 10 seconds?" The Netflix API receives around 20,000 requests per second at peak traffic. At that rate, 10 seconds translates to 200,000 requests from client devices, which can easily translate to 1,000,000+ requests from the API into upstream services. A lot can happen in 10 seconds, and we want our software to base its decision making on what just happened, not what was happening 10 or 15 minutes ago. These real-time insights can also help us identify and react to issues before they become member-facing problems. (Of course, we have charts for identifying trends beyond the 10 second window, too.)

Circuit Breaker in Action

Now that I've described the basics of the service dependency decorator layer that we've built, here's a real world example that demonstrates the value it can provide.The data that the API uses to respond to certain requests is stored in a database but it's also cached as a Java object in a shared cache. Upon receiving one of these requests, the API looks in the shared cache and if the object isn't there it queries the database. One day we discovered a bug where the API was occasionally loading a Java object into the shared cache that wasn't fully populated, which had the effect of intermittently causing problems on certain devices.
Once we discovered the problem, we decided to bypass the shared cache and go directly to the database while we worked on a patch. The following chart shows cache hits and that disabling the cache had the expected effect of dropping hits to zero.
What we weren’t counting on was getting throttled by our database for sending it too much traffic. Fortunately, we implemented custom fallbacks for database selects and so our service dependency layer started automatically tripping the corresponding circuit and invoking our fallback method, which checked a local query cache on database failure. This next chart shows the spike in fallback responses.
The fallback query cache had most of our active data set and so the overall impact to member experience was very low as can be seen by the following chart, which shows a minimal effect on overall video views. (The red line is video views per second this week and the black line is the same metric last week.)

Show Me, Don't Tell Me

While this was happening, we were able to see exactly what the system was doing by looking at our dashboard, which processes a data stream that includes the same stats used by the circuit breaker code. Here’s an excerpt that shows what the dashboard looked like during the incident.
The red 80% in the upper right shows the overall error rate for our database select circuit, and the “Open” and “Closed” counts show that the majority of server instances (157 of 200) were serving fallback responses. The blue count is the number of short-circuited requests that were never sent to the database server.
The dashboard is based on the classic green, yellow, red traffic light status page pattern and is designed to be quickly scannable. Each circuit (we have ~60 total at this point) has a circle to the left that encodes call volume (size of the circle - bigger means more traffic) and health (color of the circle - green is healthy and red indicates a service that’s having problems). The sparkline indicates call volume over a 2 minute rolling window (though the stats outside of the 10 second window are just used for display and don’t factor into the circuit breaker logic).
Here’s an example of what a healthy circuit looks like.
And here's a short video showing the dashboard in action.

The Future of API Resiliency

Our service dependency layer is still very new and there are a number of improvements we want to make, but it’s been great to see the thinking translate into higher availability and a better experience for members. That said, a ton of work remains to bolster the system, increase fallback coverage, refine visualizations and insights, etc. If these kinds of challenges excite you, especially at large scale, in the cloud, and on a small, fast-moving team, we’re actively looking for DevOps engineers.

Tuesday, November 29, 2011

Introducing Curator - The Netflix ZooKeeper Library

By Jordan Zimmerman

Open Source at Netflix
We are committed to Open Source Software at Netflix. We've blogged about it in the past. Today we are announcing a portal for open source projects from Netflix. The portal is currently hosted on Github. There are several projects in the pipeline (including Curator which we're announcing today):
  • Curator - The Netflix ZooKeeper Library
  • Astyanax - The Netflix Cassandra Client
  • Priam - Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra
  • CassJMeter - JMeter plugin to run cassandra tests

ZooKeeper
ZooKeeper is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface. For full details on ZooKeeper, refer to these pages:

Difficult to Use Correctly
While ZooKeeper comes bundled with a Java client, using the client is non-trivial and error prone. Users of the client are expected to do a great deal of manual housekeeping.

Connection Issues:
  • Initial connection: the ZooKeeper client does a handshake with the server that takes some time. Any methods executed synchronously against the server (e.g. create(), getData(), etc.) will throw an exception if this handshake hasn't completed.
  • Failover: if the ZooKeeper client loses its connection to the server, it will failover to another server in the cluster. However, this process puts the client back into "initial connection" mode.
  • Session expiration: there are edge cases that can cause the ZooKeeper session to expire. Clients are expected to watch for this state and close and re-create the ZooKeeper instance.

Recoverable Errors:
  • When creating a sequential ZNode on the server, there is the possibility that the server will successfully create the ZNode but crash prior to returning the node name to the client.
  • There are several recoverable exceptions thrown by the ZooKeeper client. Users are expected to catch these exceptions and retry the operation.

Recipes:
  • The standard ZooKeeper "recipes" (locks, leaders, etc.) are only minimally described and subtly difficult to write correctly.
  • Some important edge cases are not mentioned in the recipes. For example, the lock recipe does not describe how to deal with a server that successfully creates the Sequential/Ephemeral node but crashes before returning the node name to the client. If not dealt with properly, dead locks can result.
  • Certain use cases must be conscious of connection issues. For example, Leader Election must watch for connection instability. If the connected server crashes, the leader cannot assume it is safe to continue as the leader until failover to another server is successful.

The above issues (and others like it) must be addressed by every user of ZooKeeper. Solutions are neither easy to write nor obvious and can take considerable time. Curator deals with all of them.

What is Curator?
Curator n kyoor͝ˌātər: a keeper or custodian of a museum or other collection - A ZooKeeper Keeper. It consists of three related projects:
  • curator-client - A replacement for the bundled ZooKeeper class that takes care of some low-level housekeeping and provides some useful utilities
  • curator-framework - The Curator Framework is a high-level API that greatly simplifies using ZooKeeper. It adds many features that build on ZooKeeper and handles the complexity of managing connections to the ZooKeeper cluster and retrying operations.
  • curator-recipes - Implementations of some of the common ZooKeeper "recipes". The implementations are built on top of the Curator Framework.

Curator is focused on the recipes: locks, leaders, etc. Most people interested in ZooKeeper don't need to be concerned with the details of connection management, etc. What they want is a simple way to use the recipes. Curator is directed at this goal.

Curator deals with ZooKeeper complexity in the following ways:
  • Retry Mechanism: Curator supports a pluggable retry mechanism. All ZooKeeper operations that generate a recoverable error get retried per the configured retry policy. Curator comes bundled with several standard retry policies (e.g. exponential backoff).
  • Connection State Monitoring: Curator constantly monitors the ZooKeeper connection. Curator users can listen for state changes in the connection and respond accordingly.
  • ZooKeeper Instance Management: Curator manages the actual connection to the ZooKeeper cluster using the standard ZooKeeper class. However, the instance is managed internally (though you can access it if needed) and recreated as needed. Thus, Curator provides a reliable handle to the ZooKeeper cluster (unlike the built-in implementation).
  • Correct, Reliable Recipes: Curator comes bundled with implementations of most of the important ZooKeeper recipes (and some additional recipes as well). The implementations are written using ZooKeeper best practices and take account of all known edge cases (as mentioned above).
  • Curator's focus on recipes makes your code more resilient as you can focus strictly on the ZooKeeper feature you're interested in without worrying about correctly implementing ZooKeeper housekeeping requirements.

ZooKeeper at Netflix
ZooKeeper/Curator is being used extensively at Netflix. Some of the uses are:
  • InterProcessMutex used for ensuring unique values in various sequence ID generators
  • Cassandra Backups
  • TrackID Service
  • Our Chukwa collector uses the LeaderSelector for various housekeeping tasks
  • We make use of some third party services that allow only a limited number of concurrent users. The InterprocessSemaphore is used to manage this.
  • Various Caches

Accessing Curator

Like what you see? We're hiring!

 

Wednesday, November 2, 2011

Benchmarking Cassandra Scalability on AWS - Over a million writes per second

by Adrian Cockcroft and Denis Sheahan

Netflix has been rolling out the Apache Cassandra NoSQL data store for production use over the last six months. As part of our benchmarking we recently decided to run a test designed to validate our tooling and automation scalability as well as the performance characteristics of Cassandra. Adrian presented these results at the High Performance Transaction Systems workshop last week.

We picked a write oriented benchmark using the standard Cassandra "stress" tool that is part of the product, and Denis ran and analyzed the tests on Amazon EC2 instances. Writes stress a data store all the way to the disks, while read benchmarks may only exercise the in-memory cache. The benchmark results should be reproducible by anyone, but the Netflix cloud platform automation for AWS makes it quick and easy to do this kind of test.

The automated tooling that Netflix has developed lets us quickly deploy large scale Cassandra clusters, in this case a few clicks on a web page and about an hour to go from nothing to a very large Cassandra cluster consisting of 288 medium sized instances, with 96 instances in each of three EC2 availability zones in the US-East region. Using an additional 60 instances as clients running the stress program we ran a workload of 1.1 million client writes per second. Data was automatically replicated across all three zones making a total of 3.3 million writes per second across the cluster. The entire test was able to complete within two hours with a total cost of a few hundred dollars, and these EC2 instances were only in existence for the duration of the test. There was no setup time, no discussions with IT operations about datacenter space and no more cost once the test was over.

To measure scalability, the same test was run with 48, 96, 144 and 288 instances, with 10, 20, 30 and 60 clients respectively. The load on each instance was very similar in all cases, and the throughput scaled linearly as we increased the number of instances. Our previous benchmarks and production roll-out had resulted in many application specific Cassandra clusters from 6 to 48 instances, so we were very happy to see linear scale to six times the size of our current largest deployment. This benchmark went from concept to final result in five days as a spare time activity alongside other work, using our standard production configuration of Cassandra 0.8.6, running in our test account. The time taken by EC2 to create 288 new instances was about 15 minutes out of our total of 66 minutes. The rest of the time was taken to boot Linux, start the Apache Tomcat JVM that runs our automation tooling, start the Cassandra JVM and join the "ring" that makes up the Cassandra data store. For a more typical 12 instance Cassandra cluster the same sequence takes 8 minutes.

The Netflix cloud systems group recently created a Cloud Performance Team to focus on characterizing the performance of components such as Cassandra, and helping other teams make their code and AWS usage more efficient to reduce latency for customers and costs for Netflix. This team is currently looking for an additional engineer.

TL;DR

The rest of this post is the details of what we ran and how it worked, so that other performance teams working with Cassandra on AWS can leverage our work and replicate and extend our results.

EC2 Configuration

The EC2 instances used to run Cassandra in this test are known as M1 Extra Large (m1.xl), they have four medium speed CPUs, 15GB RAM and four disks of 400GB each. The total CPU power is rated by Amazon as 8 units. The other instance type we commonly use for Cassandra is an M2 Quadruple Extra Large (m2.4xl) which has eight (faster) CPUs, 68GB RAM and two disks of 800GB each, total 26 units of CPU power, so about three times the capacity. We use these for read intensive workloads to cache more data in memory. Both these instance types have a one Gbit/s network. There is also a Cluster Compute Quadruple Extra Large (cc.4xl) option with eight even faster CPUs (total 33.5 units), 23GB RAM, two 800GB disks and a low latency 10Gbit network. We haven't tried that option yet. In this case we were particularly interested in pushing the instance count to a high level to validate our tooling, so picked a smaller instance option. All four disks were striped together using CentOS 5.6 md and XFS. The Cassandra commit log and data files were all stored on this filesystem.

The 60 instances that ran stress as clients were m2.4xl instances running in a single availability zone, so two thirds of the client traffic was cross-zone, this slightly increased the mean client latency.

All instances are created using the EC2 auto-scaler feature. A single request is made to set the desired size of an auto-scale group (ASG), specifying an Amazon Machine Image (AMI) that contains the standard Cassandra 0.8.6 distribution plus Netflix specific tooling and instrumentation. Three ASGs are created, one in each availability zone, which are separate data-centers separated by about one millisecond of network latency. EC2 automatically creates the instances in each availability zone and maintains them at the set level. If an instance dies for any reason, the ASG automatically creates a replacement instance and the Netflix tooling manages bootstrap replacement of that node in the Cassandra cluster. Since all data is replicated three ways, effectively in three different datacenter buildings, this configuration is very highly available.

Here's an example screenshot of the configuration page for a small Cassandra ASG, editing the size fields and saving the change is all that is needed to create a cluster at any size.



Netflix Automation for Cassandra - Priam

Netflix has also implemented automation for taking backups and archiving them in the Simple Storage Service (S3), and we can perform a rolling upgrade to a new version of Cassandra without any downtime. It is also possible to efficiently double the size of a Cassandra cluster while it is running. Each new node buddies up and splits the data and load of one of the existing nodes so that data doesn't have to be reshuffled too much. If a node fails, it's replacement has a different IP address, but we want it to have the same token, and the original Cassandra replacement mechanisms had to be extended to handle this case cleanly. We call the automation "Priam", after Cassandra's father in Greek mythology, it runs in a separate Apache Tomcat JVM and we are in the process of removing Netflix specific code from Priam so that we can release it as an open source project later this year. We have already released an Apache Zookeeper interface called Curator at Github and also plan to release a Java client library called Astyanax (the son of Hector, who was Cassandra's brother, and Hector is also the name of a commonly used Java client library for Cassandra that we have improved upon). We are adding Greek mythology to our rota of interview questions :-)

Scale-Up Linearity

The scalability is linear as shown in the chart below. Each client system generates about 17,500 write requests per second, and there are no bottlenecks as we scale up the traffic. Each client ran 200 threads to generate traffic across the cluster.



Per-Instance Activity

The next step is to look at the average activity level on each instance for each of these tests to look for bottlenecks. A summary is tabulated below.



The writes per server are similar as we would expect, and the mean latency measured at the server remains low as the scale increases. The response time measured at the client was about 11ms, with about 1.2ms due to network latency and the rest from the Thrift client library overhead and scheduling delays as the threads pick up responses. The write latency measured at each Cassandra server is a small fraction of a millisecond (explained in detail later). Average server side latency of around one millisecond is what we typically see on our production Cassandra clusters with a more complex mixture of read and write queries. The CPU load is a little higher for the largest cluster. This could be due to a random fluctuation in the test, which we only ran once, variations in the detailed specification of the m1.xl instance type, or an increase in gossip or connection overhead for the larger cluster. Disk writes are caused by commit log writes and large sequential SSTable writes. Disk reads are due to the compaction processing that combines Cassandra SSTables in the background. Network traffic is dominated by the Cassandra inter-node replication messages.

Costs of Running This Benchmark

Benchmarking can take a lot of time and money, there are many permutations of factors to test so the cost of each test in terms of setup time and compute resources used can be a real limitation on how many tests are performed. Using the Netflix cloud platform automation for AWS a dramatic reduction in setup time and cost means that we can easily run more and bigger tests. The table below shows the test duration and AWS cost at the normal list price. This cost could be further reduced by using spot pricing, or by sharing unused reservations with the production site.



The usable storage capacity for a Cassandra 0.8.6 instance is half the available filesystem space, because Cassandra's current compaction algorithm needs space to compact into. This changes with Cassandra 1.0, which has an improved compaction algorithm and on-disk compression of the SSTables. The m1.xl instances cost $0.68 per hour and the m2.4xl test driver instances cost $2.00 per hour. We ran this test in our default configuration which is highly available by locating replicas in three availability zones, there is a cost for this, since AWS charges $0.01 per gigabyte for cross zone traffic. An estimation of cross zone traffic was made as two thirds of the total traffic and for this network intense test it actually cost more per hour than the instances. The test itself was run for about ten minutes, which was long enough to show a clear steady state load level. Taking the setup time into account, the smaller tests can be completed within an hour, the largest test needed a second hour for the nodes only.

Unlike conventional datacenter testing, we didn't need to ask permission, wait for systems to be configured for us, or put up with a small number of dedicated test systems. We could also run as many tests as we like at the same time, since they don't use the same resources. Denis has developed scripting to create, monitor, analyze and plot the results of these tests. For example here's the client side response time plot from a single client generating about 20,000 requests/sec.



Detailed Cassandra Configuration

The client requests use consistency level "ONE" which means that they are acknowledged when one node has acknowledged the data. For consistent read after write data access our alternative pattern is to use "LOCAL QUORUM". In that case the client acknowledgement waits for two out of the three nodes to acknowledge the data and the write response time increases slightly, but the work done by the Cassandra cluster is essentially the same. As an aside, in our multi-region testing we have found that network latency and Cassandra response times are lower in the AWS Europe region than in US East, which could be due to a smaller scale deployment or newer networking hardware. We don't think network contention from other AWS tenants was a significant factor in this test.
Stress command line
java -jar stress.jar -d "144 node ids" -e ONE -n 27000000 -l 3 -i 1 -t 200 -p 7102 -o INSERT -c 10 -r
The client is writing 10 columns per row key, row key randomly chosen from 27 million ids, each column has a key and 10 bytes of data. The total on disk size for each write including all overhead is about 400 bytes.

Thirty clients talk to the first 144 nodes and 30 talk to the second 144. For the Insert we write three replicas which is specified in the keyspace.

Cassandra Keyspace Configuration
Keyspace: Keyspace1:
Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Durable Writes: true
Options: [us-east:3]
Column Families:
ColumnFamily: Standard1
Key Validation Class: org.apache.cassandra.db.marshal.BytesType
Default column value validator: org.apache.cassandra.db.marshal.BytesType
Columns sorted by: org.apache.cassandra.db.marshal.BytesType
Row cache size / save period in seconds: 0.0/0
Key cache size / save period in seconds: 200000.0/14400
Memtable thresholds: 1.7671875/1440/128 (millions of ops/minutes/MB)
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.0
Replicate on write: true

Data Flows, Latency and Durability

To understand latency and durability we need to look at the data flows for writes for different Cassandra configurations. Cassandra client's are not normally aware of which node should store their data, so they pick a node at random which then acts as a coordinator to send replicas of the data to the correct nodes (which are picked using a consistent hash of the row key). For the fastest writes a single node has to acknowledge the write. This is useful when replacing an ephemeral memcached oriented data store with Cassandra, where we want to avoid the cold cache issues associated with failed memcached instances, but speed and availability is more important than consistency. The additional overhead compared to memcached is an extra network hop. However an immediate read after write may get the old data, which will be eventually consistent.

To get immediately consistent writes with Cassandra we use a quorum write. Two out of three nodes must acknowledge the write before the client gets its ack so the writes are durable. In addition, if a read after a quorum write also uses quorum, it will always see the latest data, since the combination of two out of three in both cases must include an overlap. Since there is no concept of a master node for data in Cassandra, it is always possible to read and write, even when a node has failed.

The Cassandra commit log flushes to disk with an fsync call every 10 seconds by default. This means that there is up to ten seconds where the committed data is not actually on disk, but it has been written to memory in three different instances in three different availability zones (i.e. datacenter buildings). The chance of losing all three copies in the same time window is small enough that this provides excellent durability, along with high availability and low latency. The latency for each Cassandra server that receives a write is just the few microseconds it takes to queue the data to the commit log writer.

Cassandra implements a gossip protocol that lets every node know the state of all the others, if the target of a write is down the coordinator node remembers that it didn't get the data, which is known as "hinted handoff". When gossip tells the coordinator that the node has recovered, it delivers the missing data.



Netflix is currently testing and setting up global Cassandra clusters to support expansion into the UK and Ireland using the AWS Europe region located in Ireland. For use cases that need a global view of the data, an extra set of Cassandra nodes are configured to provide an asynchronously updated replica of all the data written on each side. There is no master copy, and both regions continue to work if the connections between them fail. In that case we use a local quorum for reads and writes, which sends the data remotely, but doesn't wait for it, so latency is not impacted by the remote region. This gives consistent access in the local region with eventually consistent access in the remote region along with high availability.



Next Steps

There are a lot more tests that could be run, we are currently working to characterize the performance of Cassandra 1.0 and multi-region performance for both reads and writes, with more complex query combinations and client libraries.

Takeaway

Netflix is using Cassandra on AWS as a key infrastructure component of its globally distributed streaming product. Cassandra scales linearly far beyond our current capacity requirements, and very rapid deployment automation makes it easy to manage. In particular, benchmarking in the cloud is fast, cheap and scalable, once you try it, you won't go back.

If you are the kind of performance geek that has read this far and wishes your current employer would let you spin up huge tests in minutes and open source the tools you build, perhaps you should give us a call...

Wednesday, October 12, 2011

Netflix Performance on Top Networks



Hi, it’s Ken Florance, Director of Content Delivery for Netflix, with an updated “Netflix Performance on Top Networks” chart.

As promised, we’ve been able to separate AT&T and Verizon FTTx offerings from their DSL offerings. The chart now gives a fairly complete look at performance on top networks, with additional insight into how different technologies (DSL, Cable, FTTx) impact potential throughput.

As always, we appreciate your feedback.

Thursday, September 8, 2011

Netflix WebKit-Based UI for TV Devices

This is Matt McCarthy and Kim Trott, device UI engineering managers at Netflix. Our teams use WebKit, JavaScript, HTML5, and CSS3 to build user interfaces that are delivered to millions of Netflix customers on game consoles, Blu-ray players, and Internet-connected TVs.

We recently spoke at OSCON 2011 about how Netflix develops these user interfaces. We talked about:
  • How we use web technologies to support rapid innovation
  • How we architect our UIs to enable vast variation
  • Strategies to improve performance and take advantage of game consoles’ muscle while still running well on limited hardware
We wanted to share our learnings with a broader audience, so we've posted our slide deck. Hopefully, you'll find it interesting, or (better yet) useful.

Netflix Webkit-Based UI for TV Devices: PowerPoint 2010 | PDF | Slideshare

Saturday, August 13, 2011

Building with Legos

In the six years that I have been involved in building and releasing software here at Netflix, the process has evolved and improved significantly. When I started, we would build a WAR, get it setup and tested on a production host, and then run a script that would stop tomcat on the host being pushed to, rsync the directory structure and then start tomcat again. Each host would be manually pushed to using this process, and even with very few hosts this took quite some time and a lot of human interaction (potential for mistakes).

Our next iteration was an improvement in automation, but not really in architecture. We created a web based tool that would handle the process of stopping and starting things as well as copying into place and extracting the new code. This meant that people could push to a number of servers at once just by selecting check boxes. The tests to make sure that the servers were back up before proceeding could also be automated and have failsafes in the tool.

When we started migrating our systems to the cloud we took the opportunity to revisit our complete build pipeline, looking both at how we could leverage the cloud paradigm as well as the current landscape for build tools. What resulted was essentially a complete re-write of how the pipeline functioned, leveraging a suite of tools that were rapidly maturing (Ivy, Artifactory, Jenkins, AWS).

The key advance was using our continuous build system to build not only the artifact from source code, but the complete software stack, all the way up to a deployable image in the form of an AMI (Amazon Machine Image for AWS EC2). The "classic" part of the build job does the following: build the artifact, publish it to Artifactory, build the package, publish the package to the repo. Then there is a follow on job that mounts a base OS image, installs the packages and then creates the final AMI. Another important point is that we do all of this in our test environment only. When we need to move a built AMI into production we simply change the permissions on the AMI to allow it to be booted in production*.

Some of you might wonder why we chose not to use Chef/Puppet to manage our infrastructure and deployment, and there are a couple of good reasons we have not adopted this approach. One is that it eliminates a number of dependencies in the production environment: a master control server, package repository and client scripts on the servers, network permissions to talk to all of these. Another is that it guarantees that what we test in the test environment is the EXACT same thing that is deployed in production; there is very little chance of configuration or other creep/bit rot. Finally, it means that there is no way for people to change or install things in the production environment (this may seem like a really harsh restriction, but if you can build a new AMI fast enough it doesn't really make a difference).

In the cloud, we know exactly what we want a server to be, and if we want to change that we simply terminate it and launch a new server with a new AMI. This is enabled by a change in how you think about managing your resources in the cloud or a virtualized environment. Also it allows us to fail as early in the process as possible and by doing so mitigate the inherent risk in making changes.

Greg Orzell - Sr. Manager, Streaming Insight Engineering

* The reason this works is that we pass in a small set of variables, including environment, using user data. This does mean that we can find behavior differences between test and prod, and our deployment process and testing take this into account.

Tuesday, July 19, 2011

The Netflix Simian Army

We’ve talked a bit in the past about our move to the cloud and John shared some of our lessons learned in going through that transition in a previous post. Recently, we’ve been focusing on ways to improve availability and reliability and wanted to share some of our progress and thinking.

The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link. We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter/availability-zone and even regionally-redundant deployments. But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these "once in a blue moon" failures.

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables -- all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice.

Inspired by the success of the Chaos Monkey, we’ve started creating new simians that induce various kinds of failures, or detect abnormal conditions, and test our ability to survive them; a virtual Simian Army to keep our cloud safe, secure, and highly available.

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down. For example, we know that if we find instances that don’t belong to an auto-scaling group, that’s trouble waiting to happen. We shut them down to give the service owner the opportunity to re-launch them properly.

Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated.

Janitor Monkey ensures that our cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.

Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

10-18 Monkey (short for Localization-Internationalization, or l10n-i18n) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

With the ever-growing Netflix Simian Army by our side, constantly testing our resilience to all sorts of failures, we feel much more confident about our ability to deal with the inevitable failures that we'll encounter in production and to minimize or eliminate their impact to our subscribers. The cloud model is quite new for us (and the rest of the industry); fault-tolerance is a work in progress and we have ways to go to fully realize its benefits. Parts of the Simian Army have already been built, but much remains an aspiration -- waiting for talented engineers to join the effort and make it a reality.

Ideas for new simians are coming in faster than we can keep up and if you have ideas, we'd love to hear them! The Simian Army is one of many initiatives we've launched to put the spotlight on increasing the reliability of our service and delivering to our customers an uninterrupted stream of entertainment. If you're interested in joining the fun, check out our jobs page.

- Yury Izrailevsky, Director of Cloud & Systems Infrastructure
- Ariel Tseitlin, Director of Cloud Solutions

Friday, June 17, 2011

Upcoming Changes to the Open API Program

This is Daniel Jacobson, director of engineering, with a message to everyone in the Netflix Open API community.

We’re making some changes to the Open API program to support the Netflix focus on international streaming. Later this year, we will discontinue support for DVD-related features, degrading them through redirects or other means gracefully whenever possible. These changes will only affect the Open APIs, so your DVDs will continue to ship!

This change clears the path for us to add new features to the API to support international catalogs and languages. Eventually, we plan to expand our public developer community to other regions, allowing developers from around the world to build even more amazing apps and sites powered by the Netflix API.

During this transition, we will continue to work with our developer community to make the change as smooth as possible.

Tuesday, May 31, 2011

Netflix Performance on Top ISP Networks




Hello all, it's Ken Florance, Director of Content Delivery for Netflix. As you can see, we've updated our "Netflix Performance on Top Networks" chart showing performance on Top ISPs over the last several months.

We first published similar data in January and we were pleased with the feedback we received. Some of you said the chart was difficult to read so we’ve produced a chart that makes the distinctions between the data series more apparent.

As you can see, the familiar pattern persists, dividing cable networks on the high end from DSL networks in the lower bitrates. We're still showing the AT&T and Verizon networks’ performance as an average across their DSL and FTTx (Fiber) offerings. That's due to a limitation in how we collect data, which we will resolve soon, so you can expect the DSL and Fiber offerings of these ISPs to be represented separately in future updates.

Also of note, we've rolled Qwest under CenturyLink following the merger of those companies.

We're only publishing U.S. data this time. This data has become less significant for Canada in the wake of Netflix reducing default bitrates in Canada to help our Canadian members who are subject to low bandwidth caps.

Going forward, we will update this data once a quarter, so expect the next chart sometime in August.

As always, we appreciate your feedback.

Friday, April 29, 2011

Lessons Netflix Learned from the AWS Outage

On Thursday, April 21st, Amazon experienced a large outage in AWS US-East which they describe here. This outage was highly publicized because it took down or severely hampered a number of popular websites that depend on AWS for hosting. Our post below describes our experience at Netflix with the outage, and what we've learned from it.

Some Background
Why were some websites impacted while others were not? For Netflix, the short answer is that our systems are designed explicitly for these sorts of failures. When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3 and Cassandra services that we do depend upon were not affected by the outage.

What Went Well...
The Netflix service ran without intervention but with a higher than usual error rate and higher latency than normal through the morning, which is the low traffic time of day for Netflix streaming. We didn't see a noticeable increase in customer service calls, or our customers' ability to find and start movies.

In some ways this Amazon issue was the first major test to see if our new cloud architecture put us in a better place than we were in 2008. It wasn't perfect, but we didn't suffer any big external outages and only had to do a small amount of scrambling internally. Some of the decisions we made along the way that contributed to our success include:

Stateless Services
One of the major design goals of the Netflix re-architecture was to move to stateless services. These services are designed such that any service instance can serve any request in a timely fashion and so if a server fails it’s not a big deal. In the failure case requests can be routed to another service instance and we can automatically spin up a new node to replace it.

Data Stored Across Zones

In cases where it was impractical to re-architect in a stateless fashion we ensure that there are multiple redundant hot copies of the data spread across zones. In the case of a failure we retry in another zone, or switch over to the hot standby.

Graceful Degradation
Our systems are designed for failure. With that in mind we have put a lot of thought into what we do when (not if) a component fails. The general principles are:

Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.
Fallbacks: Each feature is designed to degrade or fall back to a lower quality representation. For example if we cannot generate personalized rows of movies for a user we will fall back to cached (stale) or un-personalized results.
Feature Removal: If a feature is non-critical then if it’s slow we may remove the feature from any given page to prevent it from impacting the member experience.

"N+1" Redundancy
Our cloud architecture is is designed with N+1 redundancy in mind. In other words we allocate more capacity than we actually need at any point in time. This capacity gives us the ability to cope with large spikes in load caused by member activity or the ripple effects of transient failures; as well as the failure of up to one complete AWS zone. All zones are active, so we don't have one hot zone and one idle zone as used by simple master/slave redundancy. The term N+1 indicates that one extra is needed, and the larger N is, the less overhead is needed for redundancy. We spread our systems and services out as evenly as we can across three of the four zones. When provisioning capacity, we use AWS reserved instances in every zone, and reserve more than we actually use, so that we will have guaranteed capacity to allocate if any one zone fails. As a result we have higher confidence that the other zones are able to grow to pick up the excess load from a zone that is not functioning properly. This does cost money (reservations are for one to three years with an advance payment) however, this is money well spent since it makes our systems more resilient to failures.

Cloud Solutions for the Cloud
We could have chosen the simplest path into the cloud, fork-lifting our existing applications from our data centers to Amazon's and simply using EC2 as if it was nothing more than another set of data centers. However, that wouldn't have given us the same level of scalability and resiliency that we needed to run our business. Instead, we fully embraced the cloud paradigm. This meant leveraging NoSQL solutions wherever possible to take advantage of the added availability and durability that they provide, even though it meant sacrificing some consistency guarantees. In addition, we also use S3 heavily as a durable storage layer and pretend that all other resources are effectively transient. This does mean that we can suffer if there are S3 related issues, however this is a system that Amazon has architected to be resilient across individual system and zone failures and has proven to be highly reliable, degrading rather than failing.

What Didn't Go So Well...

While we did weather the storm, it wasn't totally painless. Things appeared pretty calm from a customer perspective, but we did have to do some scrambling internally. As it became clear that AWS was unlikely to resolve the issues before Netflix reached peak traffic in the early evening, we decided to manually re-assign our traffic to avoid the problematic zone. Thankfully, Netflix engineering teams were able to quickly coordinate to get this done. At our current level of scale this is a bit painful, and it is clear that we need more work on automation and tooling. As we grow from a single Amazon region and 3 availability zones servicing the USA and Canada to being a worldwide service with dozens of availability zones, even with top engineers we simply won't be able to scale our responses manually.

Manual Steps
When Amazon's Availability Zone (AZ) started failing we decided to get out of the zone all together. This meant making significant changes to our AWS configuration. While we have tools to change individual aspects of our AWS deployment and configuration they are not currently designed to enact wholesale changes, such as moving sets of services out of a zone completely. This meant that we had to engage with each of the service teams to make the manual (and potentially error prone) changes. In the future we will be working to automate this process, so it will scale for a company of our size and growth rate.

Load Balancing
Netflix uses Amazons Elastic Load Balance (ELB) service to route traffic to our front end services. We utilize ELB for almost all our web services. There is one architectural limitation with the service: losing a large number of servers in a zone can create a service interruption.

ELB's are setup such that load is balanced across zones first, then instances. This is because the ELB is a two tier load balancing scheme. The first tier consists of basic DNS based round robin load balancing. This gets a client to an ELB endpoint in the cloud that is in one of the zones that your ELB is configured to use. The second tier of the ELB service is an array of load balancer instances (provisioned directly by AWS), which does round robin load balancing over our own instances that are behind it in the same zone.

In the case where you are in 3 zones and and you have many service instances go down then the rest of the nodes in that AZ have to pick up the slack and handle the extra load. Eventually, if you couldn't launch more nodes to bring capacity up to previous levels you would likely suffer a cascading failure where all the nodes in the zone go down. If this happened then a third of your traffic would essentially go into a "black hole" and fail.

The net effect of this is that we have to be careful to make sure that our zones stay evenly balanced with frontend servers so that we don't serve degraded traffic from any one zone. This meant that when the outage happened last week we had to manually update all of our ELB endpoints to completely avoid the failed zone, and change the autoscaling groups behind them to do the same so that the servers would all be in the zones that were getting traffic.

It also appears that the instances used by AWS to provide ELB services were dependent on EBS backed boot disks. Several people have commented that ELB itself had a high failure rate in the affected zone.

For middle tier load balancing Netflix uses its own software load balancing service that does balance across instances evenly, independent of which zone they are in. Services using middle tier load balancing are able to handle uneven zone capacity with no intervention.

Lessons Learned
This outage gave us some valuable experience and helped us to identify and prioritize several ways that we should improve our systems to make it more resilient to failures.

Create More Failures
Currently, Netflix uses a service called "Chaos Monkey" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla".

Automate Zone Fail Over and Recovery
Relying on multiple teams using manual intervention to fail over from a failing AZ simply doesn't scale. We need to automate this process, making it a "one click" operation, that can be invoked if required.

Multiple Region Support
We are currently re-engineering our systems to work across multiple AWS regions as part of the Netflix drive to support streaming in global markets. As AWS launches more regions around the world, we want to be able to migrate the support for a country to a newly launched AWS region that is closer to our customers. This is essentially the same operation as a disaster recovery migration from one region to another, so we would have the ability to completely vacate a region if a large scale outage occurred.

Avoid EBS Dependencies
We had already decided that EBS performance was an issue, so (with one exception) we have avoided it as a primary data store. With two outages so far this year we are taking steps to further reduce dependencies by switching our stateful instances to S3 backed AMIs. These take a lot longer to create in our build process, but will make our Cassandra storage services more resilient, by depending purely on internal disks. We already have plans to migrate our one large MySQL EBS based service to Cassandra.

Conclusion
We set out to build a highly available Netflix service on AWS and this outage was a major test of that decision. While we do have some lessons learned and some improvements to make, we continue to be confident that this is the right strategy for us.

Authors:
Adrian Cockroft - Director - Architecture, ECommerce and System Engineering
Cory Hicks - Sr. Manager, Applied Personalization Algorithms
Greg Orzell - Sr. Manager, Streaming Service and Insight Engineering

Monday, April 18, 2011

“More Like This…” Building a network of similarity

Ever wondered what makes “Inception” similar to “12 Monkeys”?

Looking at similarity between movies and TV shows is a useful way to find great titles to watch. At Netflix, we calculate levels of similarity between each movie and TV show and use these “similars”, as we call them, for a variety of customer-facing and internal systems.

Hi, there. Hans Granqvist, senior algorithm engineer, here to tell you more about how we build similars. I want to share some insights into how we lifted and improved this build as we moved it from our datacenter to the cloud.

The similars build process

First a little background.

To create a network of similars, we look at more than thousands facets associated with each title. A future post will go into more detail of the similarity build, but it follows this somewhat simplified process:

First we discover sets of titles that may be similar to the source title, based on the algorithm used and facets associated with each title. We then refine these sets of titles and filter them to remove unwanted matches, depending on algorithm deployment types and audiences.

After this, we dynamically weigh each facet and score it. The sum is the measure of similarity: the higher the sum, the more similar the titles.

Until last year, we built similars in our data center. As internal dependencies transferred to the cloud, its scaling capabilities became real. We wanted to use the cloud’s more flexible deployment structure, and its inherent parallelism would let us change the build to scale linearly with number of machines. We could increase the numbers of algorithms by an order of a magnitude.

Shortcomings of old Datacenter build. Opportunities and challenges of the cloud.

While the old build worked well, its shortcomings were several:
  • Algorithms were defined in the code making them hard to change.
  • The datacenter was limited to small set of machines leading to long recalculation times (several days).
  • Longer push cycles due to code linkage and runtime dependencies.
  • Building directly on production DB structure with varying resource availability.
Moving to the cloud presented new opportunities:
  • A new architecture lets us define algorithms outside of code, using distributed stores to properly isolate and share newer versions of algorithms.
  • The cloud’s unlimited capacity (within reason) could be exercised to build massively in parallel.
  • Netflix components are now all re-architected as services. We can push new code much faster, almost instantaneous. Internal dependencies are just an API call away. 
Of course, with this come challenges:
  • Remote service calls have latency.  Going from nanoseconds to milliseconds makes a huge difference when you repeat it millions of times.
  • The cloud persistence layers (SimpleDB and S3) have wildly varying performance characteristics. For some searches via SimpleDB, for example, there are surprisingly no SLAs.
  • With hundreds of machines building simultaneously, the need to partition and properly synchronize work becomes paramount.
  • The distributed nature by the cloud environment increases the risk of failures around data store, message bus systems, and caching layers.
Solutions

We based our new cloud architecture on series of tasks distributed by a controller to a set of builder nodes that communicate through a set of message queues.

Each task contains information including source title and algorithm, with optional versioning. As a build node picks up these tasks off the queue, it collects the definition of the algorithm from persistent storage, converts it into a sequence of executional steps, and starts executing.

Technologies used
  • AWS Simple Queue System for communication between controller and nodes.
  • AWS SimpleDB, Amazon’s row database, to store the definitions of algorithms.
  • AWS S3, Amazon’s key/value store.
  • EV Cache, a Netflix-developed version of memcache to increase throughput.
  • A Netflix-developed persistent store mechanism that transparently chains various types of caching (local near-cache LRU cache and service-shared EV cache, for example) to S3.
Build process

The following figure shows the various components in the build process. The Controller sends tasks on an SQS instruction queue. These tasks are read by a set of Build Nodes, which read the algorithms from SimpleDB and S3, and use various data sources to calculate the set of similars. When done, the node writes the result to persistent store and signals build status back to the controller via an SQS feedback queue.


Architecture of the Similars Build process (click to enlarge). A ‘wrapped component’ indicates the component needs to be instrumented to handle network hiccups, failures and AWS API rate limitations.
Based on their availability to process new tasks, build nodes periodically read from the instruction queue. When a message has been seen and read, SQS guarantees other nodes will not read the same message until the message visibility window expires. 

The build process spins off independent task threads for each task parsed. The first time an algorithm is seen, a builder node reads the algorithm definition and decides whether it can process the task. Newer versions of build nodes with knowledge of newer sets of data sources can co-exist with older ones using versioned messages.

If a node cannot process the task, it drops the message on the floor and relies on the SQS time out window to expire so the message becomes visible for other nodes.  The time-out window has been tuned to give a node reasonable time to process the message.

SQS guarantees only that messages arrive, not that they arrive in the order they were put on the queue. Care has to be given to define each message as independent and idempotent.

The final step is to persist the now calculated list of score-ordered similar to S3.

Once the task has been performed, the node puts a feedback message on a feedback queue. The controller uses this feedback to measure build task progress and also to collect statistics on each node’s performance. Based on this statistic, the controller may change the number of builder threads for a node, how often it should read from the queue, or various other timeout and retry values for SQS, S3, and SimpleDB.

Error situations and solutions

Building the system made us realize that we’re in a different reality in the cloud.

Some of the added complexity comes from writing a distributed system, where anything can fail at any given time. But some of the complexity was unexpected and we had to learn how to handle the following on a much larger scale than we initially envisioned.
  • Timeouts and slowness reading algorithms and weights from persistent store systems, each of which can rate-limit a client if it believes the client abuses the service. Once in such a restricted state, your code needs to quickly ease off. The only way to try to prevent AWS API rate limitation is to start out slow and gradually increase your activity. Restriction normally applies to the entire domain, so all clients on that domain will be restricted, not just the one client currently misusing it. We handled these issues via multiple levels of caching (using both a near cache on the builder node + application level cache to store partial results) with exponential fallback retries.
  • Timeouts and AWS API rate limitation writing to SQS. Putting messages on the queue can fail. We handle this via exponential fallback retries.
  • Inability of a node to read from SQS. Also handled via exponential fallback retries.
  • Inability of nodes to process all tasks in a message. We batch messages on SQS for both cost and performance reasons. When a node cannot process all tasks in a message, we drop message on floor and rely on SQS to resend it.
  • Inability of a node to process a task inside a batch message. A node may have occasional glitches or find it impossible to finish its tasks (e.g., data sources may have gone offline). We collate all failed tasks and retry on each node until set is empty or fail the batch message after a number of tries.
  • Timeouts and AWS API rate limitation writing to persistency layers. We handle with exponential retry.
All exponential retries typically wait in the order of 500ms, 2000ms, 8000ms, and so on, with some randomness added to avoid nodes retrying at fixed intervals. Sometimes operations have to be retried up to dozens of times.

Conclusion

By moving our build to the cloud, we managed to cut the time it takes to calculate a network of similars from up to two weeks down to mere hours. This means we can now experiment and A/B test new algorithms much more easily.

We also now have combinatorial algorithms (algorithms defined in terms of other algorithms) and the build nodes use this fact to execute builds in dependency order. Subsequent builds pick up cached results and we have seen exponential speed increases.

While network builds such as these many times can be embarrassingly parallel, it is worth noticing that the error situations come courtesy by a distributed environment where there in many cases are ill-defined (or none at all) SLAs.

One key insight is that the speed with which we can build new networks is now gated by how fast we can pump the result data to the receiving permanent store. CPU and RAM is a cheap and predictable cloud commodity. I/O is not.

The lessons of building this system will be invaluable as we progress into more complex processes where we take in more factors, many of which are highly temporal and real-time driven or limited to specific countries, regions, languages, and cultures.

Let us know what you think… and thanks for reading!

PS. Want to work in our group? Send us your resume and let us know why you think, or don’t think, “Inception” is similar to “12 Monkeys.”

Tuesday, March 29, 2011

NoSQL @ Netflix Talk (Part 1)

A few weeks ago, I gave the first in a series of planned talks on the topic of NoSQL @ Netflix. By now, it is widely known that Netflix has achieved something remarkable over the past 2 years – accelerated subscriber growth with an ever-improving streaming experience. In addition to streaming more titles to more devices in both the US and Canada, Netflix has moved its infrastructure, data, and applications to the AWS cloud.

In the spirit of helping others with similar needs, we are sharing our experiences with AWS and NoSQL technologies via this tech blog and several speaking appearances at conferences. Via these efforts, we hope to foster both improvements in cloud and NoSQL offerings and collaboration with open-source communities.

The NoSQL @ Netflix series specifically aims to share our recommendations on the best use of NoSQL technologies in high-traffic websites. What makes our experience unique is that we are using publicly available NoSQL and cloud technology to serve high-traffic customer-driven read-write workloads. Once again:


Netflix’s NoSQL Use-cases = public NoSQL +
public cloud +
customer traffic +
R/W workload +
high traffic conditions

The video below was loosely based on the following whitepaper. Some of the key questions addressed by the video and whitepaper are as follows:

  • What sort of data can you move to NoSQL?
  • Which NoSQL technologies are we working with?
  • How did we translate RDBMS concepts to NoSQL?
Driven by a culture that prizes curiosity and continuous improvement, Netflix is already pushing NoSQL technology and adoption further. If you would like to work with us on these technologies, have a look at our open positions

The slides are available here

Caption: The first 10 minutes are from sponsors and the last 30 minutes are Q & A.

Siddharth "Sid" Anand, Cloud Systems

    Tuesday, March 8, 2011

    Cloud Connect Keynote : Complexity and Freedom



    On March 8th, 2011, I was fortunate to be able to deliver 10 minutes of the keynote address for the Cloud Connect conference in Santa Clara, California. Here are some of the points I made during the talk.

    Availability

    We started this cloud re-architecture effort in 2008 in the aftermath of an outage of our DVD shipping software in August of that year. An unfortunate confluence of events caused our systems to go down. We had singleton vertically scaled databases for both our website and the nascent Netflix streaming functionality. We knew those two systems were equally vulnerable. We had to re-architect for high availability and move to a service oriented architecture spread across redundant data centers.

    Why Cloud?

    In August of 2008, there were already web based startups that were not building data centers because they were building in the cloud. Some of those start ups will grow to be as big as Netflix and therefore Netflix gave serious consideration to building for the clouds during this re-architecture effort.

    Why AWS (Amazon Web Services)?

    Our definition of cloud is a public, shared, and multi-tenant cloud. AWS is the market leader and has been able to create a continuous and virtuous cycle. Large AWS customers demand (and receive) continuous improvements from AWS. Those improvements, in turn, attract more large customers and the cycle then repeats itself. Netflix has benefited nicely from jumping on and riding that virtuous cycle.

    Agility

    We went to the cloud looking for high availability. We found availability but we are also happy that we found a lot of new agility as well. Our software developers and our business found new agility by eliminating a lot of complexity.

    Essential vs. Accidental Complexity : No Silver Bullet

    In 1986, Dr. Fred Books of University of North Carolina, Chapel Hill wrote his famous paper entitled 'No Silver Bullet'. This paper touches on a lot of things but the thing most relevant to this post is the contrast Brooks paints between Essential complexity and Accidental complexity. Essential complexity is caused by the problem to be solved, and nothing can remove it. An vital example of essential complexity at Netflix is our personalized movie recommendation system. Accidental complexity relates to problems that we create on our own and which can be fixed. In 1986, one example of retiring accidental complexity that Brooks wrote about was coding large scale systems in assembly language, because adequate high level languages were not viable. That accidental complexity was largely retired by 1986 when Brooks wrote the paper.

    Accidental complexity is generational. Every new application domain repeats the cycle of early phases of accidental complexity that are eventually retired. In the mid 1990's I was writing code that parsed raw http request headers. Everyone had to do that so they could write the early dynamic web applications that many of us worked on in those days.

    Building and running data centers is the accidental complexity of the 2011 generation. If you are building a data center that hosts less than multiples of 10's of thousands of machines, then you are inviting complexity, centralized control, and process that you don't need for your business. At Netflix, recurring issues of data center space, equipment upgrades, power and cooling fire drills, and data center moves were all accidental complexities that distracted from software development towards our essential complexities.

    Running data centers also requires an accurate capacity forecast so the equipment needed to add capacity is racked, stacked, and tested before it is needed. For Netflix, an accurate capacity forecast requires an accurate business forecast. Netflix's good fortune has made this difficult. We started 2010 with just over 12 million subscribers and finished the year with over 20 million subscribers, far above what we predicted at the beginning of 2010. The newly added load put us at risk of running out of data center capacity. At the same time we were re-architecting for the cloud. We moved over 80% of our customer transactions, mostly for movie discovery and streaming, to the AWS cloud. The elasticity of the cloud enabled us to absorb that growth with little pain. The move to the cloud also allowed us to eliminate a lot of the centralized process required to run data centers.

    Killing Process : Freedom and Responsibility

    You may want to take a look at the Netflix Culture Deck, found at jobs.netflix.com. It talks about how we love killing process and lot about our value of Freedom and Responsibility. Here are 2 relevant sentences from the culture deck:

    1. Our model is to increase employee freedom as we grow, rather than limit it.

    2. Responsible people thrive on freedom and are worthy of freedom.

    Implementing Freedom and Responsibility in our service oriented cloud architecture means the following things:

    1. Each engineering team owns their own deployment. They push changes and re-architect when they need to without seeking widespread alignment and without a sign-off process.

    2. Software developers own capacity procurement. In the cloud, adding cpu and storage are simple API calls.

    3. We don't have a single point of control over cloud spending. We've had a few bugs that consumed extra resources, but we also had those when we had a more centralized process for adding capacity to our data center.

    Centralized process and control were needed in the past to help manage the complexity of operating our own data centers. We eliminated a lot of that complexity by moving to the cloud and these three facts of operating in the clouds at Netflix have delivered a tremendous new agility as our business and engineering teams continue to grow.

    Availability and Agility

    We moved to the clouds looking for availability. We have also found a tremendous agility by eliminating complexity, process, and control. There was a steep learning curve and moments of doubt along the way but the end result is that Netflix software developers now have a lot more freedom to innovate and evolve our architectures rapidly as the business continues it's rapid growth. We continue to seek great talent to add to our engineering teams. I hope you'll take a look at our open positions at jobs.netflix.com.


    Thanks,

    Kevin McEntee
    VP Engineering, Systems & ECommerce