Friday, June 15, 2012

Netflix Operations: Part I, Going Distributed

Running the Netflix Cloud

Moving to the cloud presented new challenges for us[1] and forced us to develop new design patterns for running a reliable and resilient distributed system[2].  We’ve focused many of our past posts on the technical hurdles we overcame to run successfully in the cloud.  However, we had to make operational and organizational transformations as well.  We want to share the way we think about operations at Netflix to help others going through a similar journey.  In putting this post together, we realized there’s so much to share that we decided to make this a first in a series of posts on operations at Netflix.

The old guard

When we were running out of our data center, Netflix was a large monolithic Java application running inside of a tomcat container.  Every two weeks, the deployment train left at exactly the same time and anyone wanting to deploy a production change needed to have their code checked-in and tested before departure time.  This also meant that anyone could check-in bad code and bring the entire train to a halt while the issue was diagnosed and resolved.  Deployments were heavy and risk-laden and, because of all the moving parts going into each deployment, it was handled by a centralized team that was part of ITOps.

Production support was similarly centralized within ITOps.  We had a traditional NOC that monitored charts and graphs and was called when a service interruption occurred.  They were organizationally separate from the development team.  More importantly, there was a large cultural divide between the operations and development teams because of the mismatched goals of site uptime versus features and velocity of innovation.

Built for scale in the cloud

In moving to the cloud, we saw an opportunity to recast the mold for how we build and deploy our software.  We used the cloud migration as an opportunity to re-architect our system into a service oriented architecture with hundreds of individual services.  Each service could be revved on its own deployment schedule, often weekly, empowering each team to deliver innovation at its own desired pace.  We unwound the centralized deployment team and distributed the function into the teams that owned each service.

Post-deployment support was similarly distributed as part of the cloud migration.  The NOC was disbanded and a new Site Reliability Engineering team was created within the development organization not to operate the system, but to provide system-wide analysis and development around reliability and resiliency.

The road to a distributed future

As the scale of web applications has grown over time due to the addition of features and growth of usage, the application architecture has changed radically.  There are a number of things that exemplify this: service oriented architecture, eventually consistent data stores, map-reduce, etc.  The fundamental thing that they all share is a distributed architecture that involves numerous applications, servers and interconnections.  For Netflix this meant moving from a few teams checking code into a large monolithic application running on tens of servers to having tens of engineering teams developing hundreds of component services that run on thousands of servers.

The Netflix distributed system
As all of these changes occurred on the engineering side, we had to modify the way that we think about and organize for operations as well.  Our approach has been to make operations itself a distributed system.  Each engineering team is responsible for coding, testing and operating its systems in the production environment.  The result is that each team develops the expertise in the operational areas it most needs and then we leverage that knowledge across the organization.  There is an argument that developers needing to fundamentally understand how to operate, monitor and improve the resiliency of their applications in production is a distraction from their “real” work.  However, our experience has been that the added ownership actually leads to more robust applications and greater agility than centralizing these efforts.  For example, we've found that making the developers responsible for fixing their own code at 4am has encouraged them to write more robust code that handles failure gracefully, as a way of avoiding getting another 4am call. Our developers get more work done more quickly than before.

As we grew, it quickly became clear that centralized operations was not well suited for our new use case.  Our production environment is too complex for any one team or organization to understand well, which meant that they were forced to either make ill-informed decisions based on their perceptions, or get caught in a game of telephone tag with different development teams.  We also didn’t want our engineering teams to be tightly coupled to each other when they were making changes to their applications.

In addition to distributing the operations experience throughout development, we also heavily invested in tools and automation.  We created a number of engineering teams to focus on high volume monitoring and event correlation, end-to-end continuous integration and builds, and automated deployment tools[
3][4].  These tools are critical to limiting the amount of extra work developers must do in order to manage the environment while also providing them with the information that they need to make smart decisions about when and how they deploy, vet and diagnose issues with each new code deployment.

Our architecture and code base evolved and adapted to our new cloud-based environment.  Our operations have evolved as well.  Both aim to be distributed and scalable.  Once a centralized function, operations is now distributed throughout the development organization.  Successful operations at Netflix is a distributed system, much like our software, that relies on algorithms, tools, and automation to scale to meet the demands of our ever-growing user-base and product.

In future posts, we’ll explore the various aspects of operations at Netflix in more depth.  If you want to see any area explored in more detail, comment below or tweet us.

If you’re passionate about building and running massive-scale web applications, we’re always looks for amazing developers and site reliability engineers.  See all our open positions at

- Ariel Tseitlin (@atseitlin), Director of Cloud Solutions
- Greg Orzell (@chaossimia), Cloud & Platform Engineering Architect