Tuesday, April 18, 2017

The Evolution of Container Usage at Netflix

Containers are already adding value to our proven globally available cloud platform based on Amazon EC2 virtual machines.  We’ve shared pieces of Netflix’s container story in the past (video, slides), but this blog post will discuss containers at Netflix in depth.  As part of this story, we will cover Titus: Netflix’s infrastructural foundation for container based applications.  Titus provides Netflix scale cluster and resource management as well as container execution with deep Amazon EC2 integration and common Netflix infrastructure enablement.


This month marks two major milestones for containers at Netflix.  First, we have achieved a new level of scale, crossing one million containers launched per week.  Second, Titus now supports services that are part of our streaming service customer experience.  We will dive deeper into what we have done with Docker containers as well as what makes our container runtime unique.

History of Container Growth

Amazon’s virtual machine based infrastructure (EC2) has been a powerful enabler of innovation at Netflix.  In addition to virtual machines, we’ve also chosen to invest in container-based workloads for a few unique values they provide.  The benefits, excitement and explosive usage growth of containers from our developers has surprised even us.


While EC2 supported advanced scheduling for services, this didn’t help our batch users.  At Netflix there is a significant set of users that run jobs on a time or event based trigger that need to analyze data, perform computations and then emit results to Netflix services, users and reports.  We run workloads such as machine learning model training, media encoding, continuous integration testing, big data notebooks and CDN deployment analysis jobs many times each day.  We wanted to provide a common resource scheduler for container based applications independent of workload type that could be controlled by higher level workflow schedulers.  Titus serves as a combination of a common deployment unit (Docker image) and a generic batch job scheduling system. The introduction of Titus has helped Netflix expand to support the growing batch use cases.


With Titus, our batch users are able to put together sophisticated infrastructure quickly due to having to only specify resource requirements.  Users no longer have to deal with choosing and maintaining AWS EC2 instance sizes that don’t always perfectly fit their workload.  Users trust Titus to pack larger instances efficiently across many workloads.  Batch users develop code locally and then immediately schedule it for scaled execution on Titus.  Using containers, Titus runs any batch application letting the user specify exactly what application code and dependencies are needed.  For example, in machine learning training we have users running a mix of Python, R, Java and bash script applications.


Beyond batch, we saw an opportunity to bring the benefits of simpler resource management and a local development experience for other workloads.  In working with our Edge, UI and device engineering teams, we realized that service users were the next audience.  Today, we are in the process of rebuilding how we deploy device-specific server-side logic to our API tier leveraging single core optimized NodeJS servers.  Our UI and device engineers wanted a better development experience, including a simpler local test environment that was consistent with the production deployment.


In addition to a consistent environment, with containers developers can push new application versions faster than before by leveraging Docker layered images and pre-provisioned virtual machines ready for container deployments.  Deployments using Titus now can be done in one to two minutes versus the tens of minutes we grew accustomed to with virtual machines.  


The theme that underlies all these improvements is developer innovation velocity.  Both batch and service users can now experiment locally and test more quickly.  They can also deploy to production with greater confidence than before.  This velocity drives how fast features can be delivered to Netflix customers and therefore is a key reason why containers are so important to our business.

Titus Details

We have already covered what led us to build Titus.  Now, let’s dig into the details of how Titus provides these values.  We will provide a brief overview of how  Titus scheduling and container execution supports the service and batch job requirements as shown in the below diagram.


Screen Shot 2017-04-17 at 2.52.01 PM.png


Titus handles the scheduling of applications by matching required resources and available compute resources.  Titus supports both service jobs that run “forever” and batch jobs that run “until done”.  Service jobs restart failed instances and are autoscaled to maintain a changing level of load.  Batch jobs are retried according to policy and run to completion.  


Titus offers multiple SLA’s for resource scheduling.  Titus offers on-demand capacity for ad hoc batch and non-critical internal services by autoscaling capacity in EC2 based on current needs.  Titus also offers pre-provisioned guaranteed capacity for user facing workloads and more critical batch.   The scheduler does both bin packing for efficiency across larger virtual machines and anti-affinity for reliability spanning virtual machines and availability zones.  The foundation of this scheduling is a Netflix open source library called Fenzo.


Titus’s container execution, which runs on top of EC2 VMs, integrates with both AWS and Netflix infrastructure. We expect users to use both virtual machines and containers for a long time to come so we decided that we wanted the cloud platform and operational experiences to be as similar as possible.  In using AWS we choose to deeply leverage existing EC2 services.  We used Virtual Private Cloud (VPC) for routable IPs rather than a separate network overlay.  We leveraged Elastic Network Interfaces (ENIs) to ensure that all containers had application specific security groups.  Titus provides a metadata proxy that enables containers to get a container specific view of their environment as well as IAM credentials.  Containers do not see the host’s metadata (e.g., IP, hostname, instance-id).  We implemented multi-tenant isolation (CPU, memory, disk, networking and security) using a combination of Linux, Docker and our own isolation technology.


For containers to be successful at Netflix, we needed to integrate them seamlessly into our existing developer tools and operational infrastructure.  For example, Netflix already had a solution for continuous delivery – Spinnaker.  While it might have been possible to implement rolling updates and other CI/CD concepts in our scheduler, delegating this feature set to Spinnaker allowed for our users to have a consistent deployment tool across both virtual machines and containers.  Another example is service to service communication.  We avoided reimplementing service discovery and service load balancing.  Instead we provided a full IP stack enabling containers to work with existing Netflix service discovery and DNS (Route 53) based load balancing.   In each of these examples, a key to the success of Titus was deciding what Titus would not do, leveraging the full value other infrastructure teams provide.


Using existing systems comes at the cost of augmenting these systems to work with containers in addition to virtual machines.  Beyond the examples above, we had to augment our telemetry, performance autotuning, healthcheck systems, chaos automation, traffic control, regional failover support, secret management and interactive system access.  An additional cost is that tying into each of these Netflix systems has also made it difficult to leverage other open source container solutions that provide more than the container runtime platform.


Running a container platform at our level of scale (with this diversity of workloads) requires a significant focus on reliability.  It also uncovers challenges in all layers of the system.  We’ve dealt with scalability and reliability issues in the Titus specific software as well as the open source we depend on (Docker Engine, Docker Distribution, Apache Mesos, Snap and Linux).  We design for failure at all levels of our system including reconciliation to drive consistency between distributed state that exists between our resource management layer and the container runtime.  By measuring clear service level objectives (container launch start latency, percentage of containers that crash due to issues in Titus, and overall system API availability) we have learned to balance our investment between reliability and functionality.


A key part of how containers help engineers become more productive is through developer tools.  The developer productivity tools team built a local development tool called Newt (Netflix Workflow Toolkit).  Newt helps simplify container development both iteratively locally and through Titus onboarding.  Having a consistent container environment between Newt and Titus helps developer deploy with confidence.

Current Titus Usage

We run several Titus stacks across multiple test and production accounts across the three Amazon regions that power the Netflix service.


When we started Titus in December of 2015, we launched a few thousand containers per week across a handful of workloads.  Last week, we launched over one million containers.  These containers represented hundreds of workloads.  This 1000X increase in container usage happened over a year timeframe, and growth doesn’t look to be slowing down.


We run a peak of 500 r3.8xl instances in support of our batch users.  That represents 16,000 cores of compute with 120 TB of memory.  We also added support for GPUs as a resource type using p2.8xl instances to power deep learning with neural nets and mini-batch.


In the early part of 2017, our stream-processing-as-a-service team decided to leverage Titus to enable simpler and faster cluster management for their Flink based system.  This usage has resulted in over 10,000 service job containers that are long running and re-deployed as stream processing jobs are changed.  These and other services use thousands of m4.4xl instances.


While the above use cases are critical to our business, issues with these containers do not impact Netflix customers immediately.  That has changed as Titus containers recently started running services that satisfy Netflix customer requests.


Supporting customer facing services is not a challenge to be taken lightly.  We’ve spent the last six months duplicating live traffic between virtual machines and containers.  We used this duplicated traffic to learn how to operate the containers and validate our production readiness checklists.  This diligence gave us the confidence to move forward making such a large change in our infrastructure.

The Titus Team

One of the key aspects of success of Titus at Netflix has been the experience and growth of the Titus development team.  Our container users trust the team to keep Titus operational and innovating with their needs.


We are not done growing the team yet.  We are looking to expand the container runtime as well as our developer experience.  If working on container focused infrastructure excites you and you’d like to be part of the future of Titus check out our jobs page.



On behalf of the entire Titus development team