Tuesday, April 18, 2017

The Evolution of Container Usage at Netflix

Containers are already adding value to our proven globally available cloud platform based on Amazon EC2 virtual machines.  We’ve shared pieces of Netflix’s container story in the past (video, slides), but this blog post will discuss containers at Netflix in depth.  As part of this story, we will cover Titus: Netflix’s infrastructural foundation for container based applications.  Titus provides Netflix scale cluster and resource management as well as container execution with deep Amazon EC2 integration and common Netflix infrastructure enablement.


This month marks two major milestones for containers at Netflix.  First, we have achieved a new level of scale, crossing one million containers launched per week.  Second, Titus now supports services that are part of our streaming service customer experience.  We will dive deeper into what we have done with Docker containers as well as what makes our container runtime unique.

History of Container Growth

Amazon’s virtual machine based infrastructure (EC2) has been a powerful enabler of innovation at Netflix.  In addition to virtual machines, we’ve also chosen to invest in container-based workloads for a few unique values they provide.  The benefits, excitement and explosive usage growth of containers from our developers has surprised even us.


While EC2 supported advanced scheduling for services, this didn’t help our batch users.  At Netflix there is a significant set of users that run jobs on a time or event based trigger that need to analyze data, perform computations and then emit results to Netflix services, users and reports.  We run workloads such as machine learning model training, media encoding, continuous integration testing, big data notebooks and CDN deployment analysis jobs many times each day.  We wanted to provide a common resource scheduler for container based applications independent of workload type that could be controlled by higher level workflow schedulers.  Titus serves as a combination of a common deployment unit (Docker image) and a generic batch job scheduling system. The introduction of Titus has helped Netflix expand to support the growing batch use cases.


With Titus, our batch users are able to put together sophisticated infrastructure quickly due to having to only specify resource requirements.  Users no longer have to deal with choosing and maintaining AWS EC2 instance sizes that don’t always perfectly fit their workload.  Users trust Titus to pack larger instances efficiently across many workloads.  Batch users develop code locally and then immediately schedule it for scaled execution on Titus.  Using containers, Titus runs any batch application letting the user specify exactly what application code and dependencies are needed.  For example, in machine learning training we have users running a mix of Python, R, Java and bash script applications.


Beyond batch, we saw an opportunity to bring the benefits of simpler resource management and a local development experience for other workloads.  In working with our Edge, UI and device engineering teams, we realized that service users were the next audience.  Today, we are in the process of rebuilding how we deploy device-specific server-side logic to our API tier leveraging single core optimized NodeJS servers.  Our UI and device engineers wanted a better development experience, including a simpler local test environment that was consistent with the production deployment.


In addition to a consistent environment, with containers developers can push new application versions faster than before by leveraging Docker layered images and pre-provisioned virtual machines ready for container deployments.  Deployments using Titus now can be done in one to two minutes versus the tens of minutes we grew accustomed to with virtual machines.  


The theme that underlies all these improvements is developer innovation velocity.  Both batch and service users can now experiment locally and test more quickly.  They can also deploy to production with greater confidence than before.  This velocity drives how fast features can be delivered to Netflix customers and therefore is a key reason why containers are so important to our business.

Titus Details

We have already covered what led us to build Titus.  Now, let’s dig into the details of how Titus provides these values.  We will provide a brief overview of how  Titus scheduling and container execution supports the service and batch job requirements as shown in the below diagram.


Screen Shot 2017-04-17 at 2.52.01 PM.png


Titus handles the scheduling of applications by matching required resources and available compute resources.  Titus supports both service jobs that run “forever” and batch jobs that run “until done”.  Service jobs restart failed instances and are autoscaled to maintain a changing level of load.  Batch jobs are retried according to policy and run to completion.  


Titus offers multiple SLA’s for resource scheduling.  Titus offers on-demand capacity for ad hoc batch and non-critical internal services by autoscaling capacity in EC2 based on current needs.  Titus also offers pre-provisioned guaranteed capacity for user facing workloads and more critical batch.   The scheduler does both bin packing for efficiency across larger virtual machines and anti-affinity for reliability spanning virtual machines and availability zones.  The foundation of this scheduling is a Netflix open source library called Fenzo.


Titus’s container execution, which runs on top of EC2 VMs, integrates with both AWS and Netflix infrastructure. We expect users to use both virtual machines and containers for a long time to come so we decided that we wanted the cloud platform and operational experiences to be as similar as possible.  In using AWS we choose to deeply leverage existing EC2 services.  We used Virtual Private Cloud (VPC) for routable IPs rather than a separate network overlay.  We leveraged Elastic Network Interfaces (ENIs) to ensure that all containers had application specific security groups.  Titus provides a metadata proxy that enables containers to get a container specific view of their environment as well as IAM credentials.  Containers do not see the host’s metadata (e.g., IP, hostname, instance-id).  We implemented multi-tenant isolation (CPU, memory, disk, networking and security) using a combination of Linux, Docker and our own isolation technology.


For containers to be successful at Netflix, we needed to integrate them seamlessly into our existing developer tools and operational infrastructure.  For example, Netflix already had a solution for continuous delivery – Spinnaker.  While it might have been possible to implement rolling updates and other CI/CD concepts in our scheduler, delegating this feature set to Spinnaker allowed for our users to have a consistent deployment tool across both virtual machines and containers.  Another example is service to service communication.  We avoided reimplementing service discovery and service load balancing.  Instead we provided a full IP stack enabling containers to work with existing Netflix service discovery and DNS (Route 53) based load balancing.   In each of these examples, a key to the success of Titus was deciding what Titus would not do, leveraging the full value other infrastructure teams provide.


Using existing systems comes at the cost of augmenting these systems to work with containers in addition to virtual machines.  Beyond the examples above, we had to augment our telemetry, performance autotuning, healthcheck systems, chaos automation, traffic control, regional failover support, secret management and interactive system access.  An additional cost is that tying into each of these Netflix systems has also made it difficult to leverage other open source container solutions that provide more than the container runtime platform.


Running a container platform at our level of scale (with this diversity of workloads) requires a significant focus on reliability.  It also uncovers challenges in all layers of the system.  We’ve dealt with scalability and reliability issues in the Titus specific software as well as the open source we depend on (Docker Engine, Docker Distribution, Apache Mesos, Snap and Linux).  We design for failure at all levels of our system including reconciliation to drive consistency between distributed state that exists between our resource management layer and the container runtime.  By measuring clear service level objectives (container launch start latency, percentage of containers that crash due to issues in Titus, and overall system API availability) we have learned to balance our investment between reliability and functionality.


A key part of how containers help engineers become more productive is through developer tools.  The developer productivity tools team built a local development tool called Newt (Netflix Workflow Toolkit).  Newt helps simplify container development both iteratively locally and through Titus onboarding.  Having a consistent container environment between Newt and Titus helps developer deploy with confidence.

Current Titus Usage

We run several Titus stacks across multiple test and production accounts across the three Amazon regions that power the Netflix service.


When we started Titus in December of 2015, we launched a few thousand containers per week across a handful of workloads.  Last week, we launched over one million containers.  These containers represented hundreds of workloads.  This 1000X increase in container usage happened over a year timeframe, and growth doesn’t look to be slowing down.


We run a peak of 500 r3.8xl instances in support of our batch users.  That represents 16,000 cores of compute with 120 TB of memory.  We also added support for GPUs as a resource type using p2.8xl instances to power deep learning with neural nets and mini-batch.


In the early part of 2017, our stream-processing-as-a-service team decided to leverage Titus to enable simpler and faster cluster management for their Flink based system.  This usage has resulted in over 10,000 service job containers that are long running and re-deployed as stream processing jobs are changed.  These and other services use thousands of m4.4xl instances.


While the above use cases are critical to our business, issues with these containers do not impact Netflix customers immediately.  That has changed as Titus containers recently started running services that satisfy Netflix customer requests.


Supporting customer facing services is not a challenge to be taken lightly.  We’ve spent the last six months duplicating live traffic between virtual machines and containers.  We used this duplicated traffic to learn how to operate the containers and validate our production readiness checklists.  This diligence gave us the confidence to move forward making such a large change in our infrastructure.

The Titus Team

One of the key aspects of success of Titus at Netflix has been the experience and growth of the Titus development team.  Our container users trust the team to keep Titus operational and innovating with their needs.


We are not done growing the team yet.  We are looking to expand the container runtime as well as our developer experience.  If working on container focused infrastructure excites you and you’d like to be part of the future of Titus check out our jobs page.



On behalf of the entire Titus development team

Monday, April 17, 2017

Introducing Bolt: On Instance Diagnostic and Remediation Platform

Last August we introduced Winston, our event driven diagnostic and remediation platform. Winston helps orchestrate diagnostic and remediation actions from the outside. As part of that orchestration, there are multiple actions that need to be performed at an AWS instance(vm) level to collect data or take mitigation steps. We would like to discuss a supporting service called Bolt that helps with instance level action executions. By `action`, we refer to runbook, script or automation code.

Problem space

Netflix does not run its own data centers. We use AWS services for all our infrastructure needs. While we do utilize value added services from AWS (SQS, S3, etc.), much of our usage for AWS is on top of the core compute service provided by Amazon called EC2.

As part of operationalizing our fleet of EC2 instances, we needed a simple way of automating common diagnostics and remediation tasks on these instances. This solution could integrate with our orchestration services like Winston to run these actions across multiple instances, making it easy to collect data and aggregate at scale as needed. This solution would became especially important for our instances hosting persistent services like Cassandra which are more long lived than our non-persistent services. For example, when a Cassandra node is running low on disk space, a Bolt action would be executed to analyze if any stale snapshots are laying around and if so, reclaim disk space without user intervention.

Bolt to the rescue

Bolt is an instance level daemon that runs on every instance and exposes custom actions as REST commands for execution. Developers provide their own custom scripts as actions to Bolt which are version controlled and then seamlessly deployed across all relevant EC2 instance dynamically. Once deployed, these actions are available for execution via REST endpoints through Bolt.


Some interesting design decisions when building Bolt were
  • There is no central instance catalog in Bolt. The cloud is ever changing and ephemeral. Maintaining this dynamic list is not our core value add. Instead, Bolt client gets this list on demand from Spinnaker. This allows clear separation of concern from the systems that provide the infrastructure map from Bolt that uses that map to act upon.
  • No middle tier service, Bolt client libraries and CLI tools talk directly to the instances. This allows Bolt to scale to thousands of instances and reduces operational complexity.
  • History of executions stays on the instance v.s. in a centralized external store. By decentralizing the history store and making it instance scoped and tied to the life of that instance, we reduce operational cost and simplify scalability at the cost of losing the history when the instance is lost.
  • Lean resource usage: The agent only uses 50Mb of memory and is niced to prevent it from taking over CPU.
Below, we go over some of the more interesting features of Bolt.

Language support

The main languages used for writing Bolt actions are Python and Bash/Shell. Actually, any scripting language that are installed on the instance can be used (Groovy, Perl, …), as long as a proper shebang is specified.


The advantage of using Python is that we provide per-pack virtual environment support which give dependency isolation to the automation. We also provide, through Winston Studio,  self-serve dependency management using the standard requirements.txt approach.

Self serve interface

We chose to extend Winston Studio to supports CRUD & Execute operations for both Winston and Bolt. By providing a simple and intuitive interface on where to upload your Bolt actions, look at execution history, experiment and execute on demand, we ensured that the Fault Detection Engineering team is not the bottleneck for operations associated with Bolt actions.


Here is a screenshot of what a single Bolt action looks like. Users (Netflix engineers) can iterate, introspect and manage all aspects of their action via this studio. This studio also implements and exposes the paved path for action deployment to make it easy for engineers to do the right thing to mitigate risks.




Users can also look at the previous executions and individual execution details through Winston Studio as shown in the following snapshots.




Here is an example requirements.txt file which specifies the pack dependencies (for Python Bolt actions):

Action lifecycle management

Similar to Winston, we help ensure that these actions are version controlled correctly and we enforce staged deployment. Here are the 3 stages:
  1. Dev: At this stage, a Bolt action can only be executed from the Studio UI. They are only deployed at runtime on the instance where they will be executed. This stage is used for development purposes as the name suggests.
  2. Test: When development is completed, the user promote the action to Test environment. At this stage, within ~5 minutes, the action will be deployed on all relevant Test EC2 instances. This action will then sit at this stage for a couple of hours/days/weeks (at the discretion of the user) to catch edge cases and unexpected issues at scale.
  3. Prod: When the user is comfortable with the stability of the action, the next step is to promote it to Prod. Again, within ~5 minutes, the action will be deployed on all relevant Prod EC2 instances.

Security

Even though this is an internal administrative service, security was a big aspect of building software that installs on all EC2 instances we run. Here are some key security features and decisions we took to harden the service from bad actors
  • White listing actions. We do not allow running arbitrary commands on instances. Developers explicitly choose to expose a command/script on their service instances via Bolt.
  • Auditing - All CRUD changes to actions are authenticated and audited in the studio. Netflix engineer’s have to use their credentials to be able to make any change to the whitelisted actions.
  • Mutual TLS authenticated REST endpoints: Arbitrary systems cannot invoke executions via Bolt on instances.

Async execution

The decision of choosing an Async API to execute actions allows the ability to run long running diagnostics or remediation actions without blocking the client. It also allows the clients to scale to managing thousands of instances in short interval of time through fire and check back later interface.


Here is a simple sequence diagram to illustrate the flow of an action being executed on a single EC2 instance:

Client libraries

The Bolt ecosystem consists of both a Python and Java client libraries for integrations..
This client library also makes the task of TLS authenticated calls available out of the box as well as implements common client side patterns of orchestration.


What do we mean by orchestration? Let say that you want to restart tomcat on 100 instances. You probably don’t want to do it on all instances at the same time, as your service would experience down-time. This is where the serial/parallel nature of an execution comes into play. We support running an action one instance at a time, one EC2 Availability Zone at a time, one EC2 Region at a time, or on all instances in parallel (if this is a read-only action for example). The user decides which strategy applies to his action (the default is one instance at a time, just to be safe).

Resiliency features

  • Bolt engine and all action executions are run at a nice level of 1 to prevent them from taking over the host
  • Health check: Bolt exposes a health check API and we have logic that periodically check for health of the Bolt service across all instances. We also use it for our deployments and regression testing.
  • Metrics monitoring: CPU/memory usage of the engine and running actions
  • Staged/controlled Deployments of the engine and the actions
  • No disruption during upgrades: Running actions won’t be interrupted by Bolt upgrades
  • No middle tier: better scalability and reliability

Other features

Other notables features include the support for action timeout and the support for killing running actions. Also, the user can see the output (stdout/stderr) while the action is running (doesn’t have to wait for the action to be complete).

Use cases

While Bolt is very flexible as to what action it can perform, the majority of use cases fall into these patterns:

Diagnostic

Bolt is used as a diagnostic tool to help engineers when their system is not behaving as expected. Also, some team uses it to gather metadata about their instance (disk usage, version of packages installed, JDK version, …). Others use Bolt to get detailed disk usage information when a low disk space alert gets triggered.

Remediation

Remediation is an approach to fix an underlying condition that is rendering a system non-optimal. For example, Bolt is used by our Data Pipeline team to restart the Kafka broker when needed. It is also used by our Cassandra team to free up disk space.

Proactive maintenance

In the case of our Cassandra instances, Bolt is used for proactive maintenance (repairs, compactions, Datastax & Priam binary upgrades, …). It is also used for binary upgrades on our Dynomite instances (for Florida and Redis).

Some usage numbers

Here are some usage stats number:

  • Thousands of actions execution per day
  • Bolt is installed on tens of thousands of instances

Architecture

Below is a simplified diagram of how Bolt actions are deployed on the instances:
  1. When the user modifies an action from Winston Studio, it is committed to Git version control system (for auditing purposes, Stash in our case) and synced to a Blob store (AWS S3 bucket in our case)
  2. Every couple of minutes, all the Bolt-enabled Netflix instances check if there is a new version available for each installed pack and proceed with its upgrade if needed.


It also explains how they are being triggered (either manually from Winston Studio or in response to an event from Winston).


The diagram also summarizes how Bolt is stored on each instance:
  • The Bolt engine binary, the Bolt packs and the Python virtual environments are installed on the root device
  • Bolt log files, Action stdout/stderr output and the Sqlite3 database that contains the executions history are stored on the ephemeral device. We keep about 180 days of execution history and compress the stdout/stderr a day after the execution is completed, to same space.


It also shows that we send metrics to Atlas to track usage as well as Pack upgrades failures. This is critical to ensure that we get notified of any issues.



Build vs. Buy

Here are some alternatives we looked at before building Bolt.

SSH

Before Bolt, we were using plain and simple SSH to connect to instances and execute actions as needed. While inbuilt in every VM and very flexible, there were some critical issues with using this model:


  1. Security: Winston, or any other orchestrator, needs to be a bastion to have the keys to our kingdom, something we were not comfortable with
  2. async vs. sync: Running SSH commands meant we could only run them in a sync manner blocking a process on the client side to wait for the action to finish. This sync approach has scalability concerns as well as reliability issues (long running actions were sometime interrupted by network issues)
  3. History: Persisting what was run for future diagnostics.
  4. Admin interface


We discussed the idea of using separate sudo rules, a dedicated set of keys and some form of binary whitelisting to improve the security aspect. This would alleviate our security concerns, but would have limited our agility to add custom scripts (which could contains non whitelisted binaries).


In the end, we decided to invest in building technology that allows us to support the instance level use case with the flexibility to add value added features while strengthening the security, reliability and performance challenges we had.

AWS EC2 Run

AWS EC2 Run is a great option if you need to run remote script on AWS EC2 instances. But in October 2014 (when Bolt was created), EC2 Run was either not created yet, or not public (introduction post was published on October 2015). Since then, we sync quarterly with the EC2 Run team from AWS to see how we can leverage their agent to replace the Bolt agent. But, at the time of writing, given the Netflix specific features of Bolt and it’s integration with our Atlas Monitoring system, we decided not to migrate yet.

Chef/Puppet/Ansible/Salt/RunDeck ...

These are great tools for configuration management/orchestration. Sadly, some of them (Ansible/RunDeck) had dependencies on SSH, which was an issue for the reasons listed above. For Chef and Puppet, we assessed that they were not quite the right tool for what we were trying to achieve (being mostly configuration management tools). As for Salt, adding a middle tier (Salt Master) meant increased risk of down time even with a Multi-Master approach. Bolt doesn’t have any middle tier or master, which significantly reduces the risk of down time and allows better scalability.

Future work

Below is a glimpse of some future work we plan to do in this space.

Resources Capping

As an engineer, one of the biggest concern when running extra agent/process is: “Could it take over my system memory/cpu?”. In this regard, we plan to invest into memory/cpu capping as an optional configuration in Bolt. We already reduce processing priority of Bolt and the actions processes with ‘nice’, but memory is currently unconstrained. The current plan is to use cgroups.

Integrate with Spinnaker

The Spinnaker team is planning on using Bolt to gather metadata information about running instances as well as other general use-cases that applies to all Netflix micro-services (Restart Tomcat, …). For this reason, we are planning to add Bolt to our BaseAMI, which will automatically make it available on all Netflix micro-services.

Summary

Since its creation to solve specific needs for our Cloud Database Engineering group to its broader adoption, and soon to be included in all Netflix EC2 instances, Bolt has evolved into a reliable and effective tool.


By: Jean-Sebastien Jeannotte and Vinay Shah
on behalf of the Fault Detection Engineering team

Monday, April 10, 2017

BetterTLS - A Name Constraints test suite for HTTPS clients


Written by Ian Haken

At Netflix we run a microservices architecture that has hundreds of independent applications running throughout our ecosystem. One of our goals, in the interest of implementing security in depth, is to have end-to-end encrypted, authenticated communication between all of our services wherever possible, regardless of whether or not it travels over the public internet. Most of the time, this means using TLS, an industry standard implemented in dozens of languages. However, this means that every application in our environment needs a TLS certificate.

Bootstrapping the identity of our applications is a problem we have solved, but most of our applications are resolved using internal names or are directly referenced by their IP (which lives in a private IP space). Public Certificate Authorities (CAs) are specifically restricted from issuing certificates of this type (see section 7.1.4.2.1 of the CA/B baseline requirements), so it made sense to use an internal CA for this purpose. As we convert applications to use TLS (e.g., by using HTTPS instead of HTTP) it was reasonably straightforward to configure them to use a truststore which includes this internal CA. However, the question remained of what to do about users accessing their services using a browser. Our internal CA isn’t trusted by browsers out-of-the-box, so what should we do?

The most obvious answer is straightforward: “add the CA to browsers’ truststores.” But we were hesitant about this solution. By forcing our users to trust a private CA, they must take on faith that this CA is only used to mint certificates for internal services and is not being used to man-in-the-middle traffic to external services (such as banks, social media sites, etc). Even if our users do take on faith our good behavior, the impact of a compromise to our infrastructure becomes significant; not only could an attacker compromise our internal traffic channels, but all of our employees are suddenly at risk, even when they’re at home.
Fortunately, the often underutilized Name Constraints extension provides us a solution to both of these concerns.

The Name Constraints Extension

One powerful (but often neglected) feature of the TLS specification is the Name Constraints extension. This is an extension that can be put on CA certificates which whitelists and/or blacklists the domains and IPs for which that CA or any sub-CAs are allowed to create certificates for. For example, suppose you trust the Acme Corp Root CA, which delegates to various other sub-CAs that ultimately sign certificates for websites. They may have a certificate hierarchy that looks like this:

Now suppose that Beta Corp and Acme Corp become partners and need to start trusting each other’s services. Similar to Acme Corp, Beta Corp has a root CA that has signed certificates for all of its services. Therefore, services inside Acme Corp need to trust the Beta Corp root CA. Rather than update every service in Acme Corp to include the new root CA in its truststore, a simpler solution is for Acme Corp to cross-certify with Beta Corp so that the Beta Corp root CA has a certificate signed by the the Acme Root CA. For users inside Acme Corp their trust hierarchy now looks like this.

However, this has the undesirable side effect of exposing users inside of Acme Corp to the risk of a security incident inside Beta Corp. If a Beta Corp CA is misused or compromised, it could issue certificates for any domain, including those of Acme Corp.

This is where the Name Constraints extension can play a role. When Acme Corp signs the Beta Corp root CA certificate, it can include an extension in the certificate which declares that it should only be trusted to issue certificates under the “betacorp.com” domain. This way Acme Corp users would not trust mis-issued certificates for the “acmecorp.com” domain from CAs under the Beta Corp root CA.

This example demonstrates how Name Constraints can be useful in the context of CA cross-certification, but it also applies to our original problem of inserting an internal CA into browsers’ trust stores. By minting the root CA with Name Constraints, we can limit what websites could be verified using that trust root, even if the CA or any of its intermediaries were misused.

At least, that’s how Name Constraints should work.

The Trouble with Name Constraints

The Name Constraints extension lives on the certificate of a CA but can’t actually constrain what a bad actor does with that CA’s private key (much less control what a subordinate CA issues), so even with the extension present there is nothing to stop the bad actor from signing a certificate which violates the constraint. Therefore, it is up to the TLS client to verify that all constraints are satisfied whenever the client verifies a certificate chain.

This means that for the Name Constraints extension to be useful, HTTPS clients (and browsers in particular) must enforce the constraints properly.

Before relying on this solution to protect our users, we wanted to make sure browsers were really implementing Name Constraints verification and doing so correctly. The initial results were promising: each of the browsers we tested (Chrome, Firefox, Edge, and Safari) all gave verification exceptions when browsing to a site where a CA signed a certificate in violation of the constraints.

However, as we extended our test suite beyond basic tests we rapidly began to lose confidence. We created a battery of test certificates which moved the subject name between the certificate’s subject common name and Subject Alternate Name extension, which mixed the use of Name Constraint whitelisting and blacklisting, and which used both DNS names and IP names in the constraint. The result was that every browser (except for Firefox, which showed a 100% pass rate) and every HTTPS client (such as Java, Node.JS, and Python) allowed some sort of Name Constraint bypass.

Introducing BetterTLS

shot-1488406780.png
In order to raise awareness around the issues we discovered and encourage TLS implementers to correct them, and to allow them to include some of these tests in their own test suite, we are open sourcing the test suite we created and making it available online. Inspired by badssl.com, we created bettertls.com with the hope that the tests we add to this site can help improve the resiliency of TLS implementations.

Before we made bettertls.com public, we reached out to many of the affected vendors and are happy to say that we received a number of positive responses. We’d particularly like to thank Ryan Sleevi and Adam Langley from Google who were extremely responsive and immediately took actions to remediate some of the discovered issues and incorporate some of these test certificates into their own test suite. We have also received confirmation from Oracle that they will be addressing the results of this test suite in Java in an upcoming security release.

The source for bettertls.com is available on github, and we welcome suggestions, improvements, corrections, and additional tests!