Wednesday, August 14, 2013

Deploying the Netflix API

by Ben Schmaus

As described in previous posts (“Embracing the Differences” and “Optimizing the Netflix API”), the Netflix API serves as an integration hub that connects our device UIs to a distributed network of data services.  In supporting this ecosystem, the API needs to integrate an ever-evolving set of features from these services and expose them to devices.  The faster these features can be delivered through the API, the faster they can get in front of customers and improve the user experience.

Along with the number of backend services and device types (we’re now topping 1,000 different device types), t
he rate of change in the system is increasing, resulting in the need for faster API development cycles. Furthermore, as Netflix has expanded into international markets, the infrastructure and team supporting the API has grown.  To meet product demands, scale the team’s development, and better manage our cloud infrastructure, we've had to adapt our process and tools for testing and deploying the API.

With the context above in mind, this post presents our approach to software delivery and some of the techniques we've developed to help us get features into production faster while minimizing risk to quality of service.

Moving Toward Continuous Delivery

Before moving on it’s useful to draw a quick distinction between continuous deployment and delivery. Per the book, if you’re practicing continuous deployment then you’re necessarily also practicing continuous delivery, but the reverse doesn’t hold true.  Continuous deployment extends continuous delivery and results in every build that passes automated test gates being deployed to production.  Continuous delivery requires an automated deployment infrastructure but the decision to deploy is made based on business need rather than simply deploying every commit to prod.  We may pursue continuous deployment as an optimization to continuous delivery but our current focus is to enable the latter such that any release candidate can be deployed to prod quickly, safely, and in an automated way.

To meet demand for new features and to make a growing infrastructure easier to manage, we’ve been overhauling our dev, build, test, and deploy pipeline with an eye toward a continuous delivery.  Being able to deploy features as they’re developed gets them in front of Netflix subscribers as quickly as possible rather than having them “sit on the shelf.”  And deploying smaller sets of features more frequently reduces the number of changes per deployment, which is an inherent benefit of continuous delivery and helps mitigate risk by making it easier to identify and triage problems if things go south during a deployment.

The foundational concepts underlying our delivery system are simple:  automation and insight.  By applying these ideas to our deployment pipeline we can strike an effective balance between velocity and stability.
Automation - Any process requiring people to execute manual steps repetitively will get you into trouble on a long enough timeline.  Any manual step that can be done by a human can be automated by a computer; automation provides consistency and repeatability.  It’s easy for manual steps to creep into a process over time and so constant evaluation is required to make sure sufficient automation is in place.

Insight - You can't support, understand, and improve what you can't see.  Insight applies both to the tools we use to develop and deploy the API as well as the monitoring systems we use to track the health of our running applications.  For example, being able to trace code as it flows from our SCM systems through various environments (test, stage, prod, etc.) and quality gates (unit tests, regression tests, canary, etc.) on its way to production helps us distribute deployment and ops responsibilities across the team in a scalable way.  Tools that surface feedback about the state of our pipeline and running apps give us the confidence to move fast and help us quickly identify and fix issues when things (inevitably) break.

Development & Deployment Flow

The following diagram illustrates the logical flow of code from feature inception to global deployment to production clusters across all of our AWS regions.  Each phase in the flow provides feedback about the “goodness” of the code, with each successive step providing more insight into and confidence about feature correctness and system stability.



Taking a closer look at our continuous integration and deploy flow, we have the diagram below, which pretty closely outlines the flow we follow today.  Most of the pipeline is automated, and tooling gives us insight into code as it moves from one state to another.




Branches

Currently we maintain 3 long-lived branches (though we’re exploring approaches to cut down the number of branches, with single master being a likely longer term goal) that serve different purposes and get deployed to different environments.  The pipeline is fully automated with the exception of weekly pushes from the release branch, which require an engineer to kick off the global prod deployment.

Test branch - used to develop features that may take several dev/deploy/test cycles and require integration testing and coordination of work across several teams for an extended period of time (e.g., more than a week).  The test branch gets auto deployed to a test environment, which varies in stability over time as new features undergo development and early stage integration testing.  When a developer has a feature that’s a candidate for prod they manually merge it to the release branch.

Release branch - serves as the basis for weekly releases.  Commits to the release branch get auto-deployed to an integration environment in our test infrastructure and a staging environment in our prod infrastructure.  The release branch is generally in a deployable state but sometimes goes through a short cycle of instability for a few days at a time while features and libraries go through integration testing.  Prod deployments from the release branch are kicked off by someone on our delivery team and are fully automated after the initial action to start the deployment.

Prod branch - when a global deployment of the release branch (see above) finishes it’s merged into the prod branch, which serves as the basis for patch/daily pushes.  If a developer has a feature that's ready for prod and they don't need it to go through the weekly flow then they can commit it directly to the prod branch, which is kept in a deployable state.  Commits to the prod branch are auto-merged back to release and are auto-deployed to a canary cluster taking a small portion of live traffic.  If the result of the canary analysis phase is a “go” then the code is auto deployed globally.

Confidence in the Canary

The basic idea of a canary is that you run new code on a small subset of your production infrastructure, for example, 1% of prod traffic, and you see how the new code (the canary) compares to the old code (the baseline).

Canary analysis used to be a manual process for us where someone on the team would look at graphs and logs on our baseline and canary servers to see how closely the metrics (HTTP status codes, response times, exception counts, load avg, etc.) matched.

Needless to say this approach doesn't scale when you're deploying several times a week to clusters in multiple AWS regions.  So we developed an automated process that compares 1000+ metrics between our baseline and canary code and generates a confidence score that gives us a sense for how likely the canary is to be successful in production.  The canary analysis process also includes an automated squeeze test for each canary Amazon Machine Image (AMI) that determines the throughput “sweet spot” for that AMI in requests per second.  The throughput number, along with server start time (instance launch to taking traffic), is used to configure auto scaling policies.

The canary analyzer generates a report for each AMI that includes the score and displays the total metric space in a scannable grid.  For commits to the prod branch (described above), canaries that get a high-enough confidence score after 8 hours are automatically deployed globally across all AWS regions.

The screenshots below show excerpts from a canary report.  
If the score is too low (< 95 generally means a "no go", as is the case with the canary below), the report helps guide troubleshooting efforts by providing a starting point for deeper investigation. This is where the metrics grid, shown below, helps out. The grid puts more important metrics in the upper left and less important metrics in the lower right.  Green means the metric correlated between baseline and canary.  Blue means the canary has a lower value for a metric ("cold") and red means the canary has a higher value than the baseline for a metric ("hot").




Along with the canary analysis report we automatically generate a source diff report, cross-linked with Jira (if commit messages contain Jira IDs), of code changes in the AMI, and a report showing library and config changes between the baseline and canary.  These artifacts increase our visibility into what’s changing between deployments.

Multi-region Deployment Automation

Over the past two years we've expanded our deployment footprint from 1 to 3 AWS regions supporting different markets around the world.  Running clusters in geographically disparate regions has driven the need for more comprehensive automation.  We use Asgard to deploy the API (with some traffic routing help from Zuul), and we’ve switched from manual deploys using the Asgard GUI to driving deployments programmatically via Asgard’s API.

The basic technique we use to deploy new code into production is the "red/black push."  Here's a summary of how it works.

1) Go to the cluster - which is a set of auto-scaling groups (ASGs) - running the application you want to update, like the API, for example.

2) Find the AMI you want to deploy, look at the number of instances running in the baseline ASG, and launch a new ASG running the selected AMI with enough instances to handle traffic levels at that time (for the new ASG we typically use the number of instances in the baseline ASG plus 10%).  When the instances in the new ASG are up and taking traffic, the new and baseline code is running side by side (ie, "red/red").

3) Disable traffic to the baseline ASG (ie, make it "black"), but keep the instances online in case a rollback is needed.  At this point you'll have your cluster in a "red/black" state with the baseline code being "black" and the new code being "red." If a rollback is needed, since the "black" ASG still has all its instances online (just not taking traffic) you can easily enable traffic to it and then disable the new ASG to quickly get back to your starting state.  Of course, depending on when the rollback happens you may need to adjust server capacity of the baseline ASG.

4) If the new code looks good, delete the baseline ASG and its instances altogether.  The new AMI is now your baseline.

The following picture illustrates the basic flow.



Going through the steps above manually for many clusters and regions is painful and error prone.  To make deploying new code into production easier and more reliable, we've built additional automation on top of Asgard to push code to all of our regions in a standardized, repeatable way.  Our deployment automation tooling is coded to be aware of peak traffic times in different markets and to execute deployments outside of peak times.  Rollbacks, if needed, are also automated.

Keep the Team Informed

With all this deployment activity going on, it’s important to keep the team informed about what’s happening in production.   We want to make it easy for anyone on the team to know the state of the pipeline and what’s running in prod, but we also don’t want to spam people with messages that are filtered out of sight.  To complement our dashboard app, we run an XMPP bot that sends a message to our team chatroom when new code is pushed to a production cluster.  The bot sends a message when a deployment starts and when it finishes.  The topic of the chatroom has info about the most recent push event and the bot maintains a history of pushes that can be accessed by talking to the bot.



Move Fast, Fail Fast (and Small)

Our goal is to provide the ability for any engineer on the team to easily get new functionality running in production while keeping the larger team informed about what’s happening, and without adversely affecting system stability.  By developing comprehensive deployment automation and exposing feedback about the pipeline and code flowing through it we’ve been able to deploy more easily and address problems earlier in the deploy cycle.  Failing on a few canary machines is far superior to having a systemic failure across an entire fleet of servers.  

Even with the best tools, building software is hard work.  We're constantly looking at what's hard to do and experimenting with ways to make it easier.  Ultimately, great software is built by great engineering teams.  If you're interested in helping us build the Netflix API, take a look at some of our open roles.