Saturday, August 13, 2011

Building with Legos

In the six years that I have been involved in building and releasing software here at Netflix, the process has evolved and improved significantly. When I started, we would build a WAR, get it setup and tested on a production host, and then run a script that would stop tomcat on the host being pushed to, rsync the directory structure and then start tomcat again. Each host would be manually pushed to using this process, and even with very few hosts this took quite some time and a lot of human interaction (potential for mistakes).

Our next iteration was an improvement in automation, but not really in architecture. We created a web based tool that would handle the process of stopping and starting things as well as copying into place and extracting the new code. This meant that people could push to a number of servers at once just by selecting check boxes. The tests to make sure that the servers were back up before proceeding could also be automated and have failsafes in the tool.

When we started migrating our systems to the cloud we took the opportunity to revisit our complete build pipeline, looking both at how we could leverage the cloud paradigm as well as the current landscape for build tools. What resulted was essentially a complete re-write of how the pipeline functioned, leveraging a suite of tools that were rapidly maturing (Ivy, Artifactory, Jenkins, AWS).

The key advance was using our continuous build system to build not only the artifact from source code, but the complete software stack, all the way up to a deployable image in the form of an AMI (Amazon Machine Image for AWS EC2). The "classic" part of the build job does the following: build the artifact, publish it to Artifactory, build the package, publish the package to the repo. Then there is a follow on job that mounts a base OS image, installs the packages and then creates the final AMI. Another important point is that we do all of this in our test environment only. When we need to move a built AMI into production we simply change the permissions on the AMI to allow it to be booted in production*.

Some of you might wonder why we chose not to use Chef/Puppet to manage our infrastructure and deployment, and there are a couple of good reasons we have not adopted this approach. One is that it eliminates a number of dependencies in the production environment: a master control server, package repository and client scripts on the servers, network permissions to talk to all of these. Another is that it guarantees that what we test in the test environment is the EXACT same thing that is deployed in production; there is very little chance of configuration or other creep/bit rot. Finally, it means that there is no way for people to change or install things in the production environment (this may seem like a really harsh restriction, but if you can build a new AMI fast enough it doesn't really make a difference).

In the cloud, we know exactly what we want a server to be, and if we want to change that we simply terminate it and launch a new server with a new AMI. This is enabled by a change in how you think about managing your resources in the cloud or a virtualized environment. Also it allows us to fail as early in the process as possible and by doing so mitigate the inherent risk in making changes.

Greg Orzell - Sr. Manager, Streaming Insight Engineering

* The reason this works is that we pass in a small set of variables, including environment, using user data. This does mean that we can find behavior differences between test and prod, and our deployment process and testing take this into account.