Thursday, April 27, 2017

Towards true continuous integration: distributed repositories and dependencies

For the past 8 years, Netflix has been building and evolving a robust microservice architecture in AWS. Throughout this evolution, we learned how to build reliable, performant services in AWS. Our microservice architecture decouples engineering teams from each other, allowing them to build, test and deploy their services as often as they want. This flexibility enables teams to maximize their delivery velocity. Velocity and reliability are paramount design considerations for any solution at Netflix.
As supported by our architecture, microservices provide their consumers with a client library that handles all of the IPC logic. This provides a number of benefits to both the service owner and the consumers. In addition to the consumption of client libraries, the majority of microservices are built on top of our runtime platform framework, which is composed of internal and open source libraries.
While service teams do have the flexibility to release as they please, their velocity can often be hampered by updates to any of the libraries they depend on. An upcoming product feature may require a number of microservices to pick up the latest version of a shared library or client library. Updating dependency versions carries risk.
Or put simply, managing dependencies is hard.
Updating your project’s dependencies could mean a number of potential issues, including:
  • Breaking API changes - This is the best case scenario. A compilation failure that breaks your build. Semantic versioning combined with dependency locking and dynamic version selectors should be sufficient for most teams to prevent this from happening, assuming cultural rigor around semver. However, locking yourself into a major version makes it that much harder to upgrade the company’s codebase, leading to prolonged maintenance of older libraries and configuration drift.
  • Transitive dependency updates - Due to the JVM’s flat classpath, only a single version of a class can exist within an application. Build tools like Gradle and Maven handle version conflict resolution preventing multiple versions of the same library to be included. This will also mean that there is now code within your application that is running with a transitive dependency version that it has never been tested against.
  • Breaking functional changes - Welcome to the world of software development! Ideally this is mitigated by proper testing. Ideally, library owners are able to run their consumer’s contract tests to understand the functionality that is expected of them.
To address the challenges of managing dependencies at scale, we have observed companies moving towards two approaches: Share little and monorepos.
  • The share little approach (or don’t use shared libraries) has been recently popularized by the broader microservice movement. The share little approach states that no code should be shared between microservices. Services should only be coupled via their HTTP APIs. Some recommendations even go as far as to say that copy and paste is preferable to share libraries. This is the most extreme approach to decoupling.
  • The monorepo approach dictates that all source code for the organization live in a single source repository. Any code change should be compiled/tested against everything in the repository before being pushed to HEAD. There are no versions of internal libraries, just what is on HEAD. Commits are gated before they make it to HEAD. Third party library versions are generally limited to one of two “approved” versions.
While both approaches address the problems of managing dependencies at scale, they also impose certain challenges. The share little approach favors decoupling and engineering velocity, while sacrificing code reuse and consistency. The monorepo approach favors consistency and risk reduction, while sacrificing freedom by requiring gates to deploying changes. Adopting either approach would entail significant changes to our development infrastructure and runtime architecture. Additionally, both solutions would challenge our culture of Freedom and Responsibility.
The challenge we’ve posed to ourselves is this:
Can we provide engineers at Netflix the benefits of a monorepo and still maintaining the flexibility of distributed repositories?
Using the monorepo as our requirements specification, we began exploring alternative approaches to achieving the same benefits. What are the core problems that a monorepo approach strives to solve? Can we develop a solution that works within the confines of a traditional binary integration world, where code is shared?
Our approach, while still experimental, can be distilled into three key features:
  • Publisher feedback - provide the owner of shared code fast feedback as to which of their consumers they just broke, both direct and transitive. Also, allow teams to block releases based on downstream breakages. Currently, our engineering culture puts sole responsibility on consumers to resolve these issues. By giving library owners feedback on the impact they have to the rest of Netflix, we expect them to take on additional responsibility.
  • Managed source - provide consumers with a means to safely increment library versions automatically as new versions are released. Since we are already testing each new library release against all downstreams, why not bump consumer versions and accelerate version adoption, safely.
  • Distributed refactoring - provide owners of shared code a means to quickly find and globally refactor consumers of their API. We have started by issuing pull requests en masse to all Git repositories containing a consumer of a particular Java API. We’ve run some early experiments and expect to invest more in this area going forward.
We are just starting our journey. Our publisher feedback service is currently being alpha tested by a number of service teams and we plan to broaden adoption soon, with managed source not far behind. Our initial experiments with distributed refactoring have helped us understand how best to rapidly change code globally. We also see an opportunity to reduce the size of the overall dependency graph by leveraging tools we build in this space. We believe that expanding and cultivating this capability will allow teams at Netflix to achieve true organization-wide continuous integration and reduce, if not eliminate, the pain of managing dependencies.
If this challenge is of interest to you, we are actively hiring for this team. You can apply using one of the links below:

- Mike McGarr, Dianne Marsh and the Developer Productivity team

Tuesday, April 18, 2017

The Evolution of Container Usage at Netflix

Containers are already adding value to our proven globally available cloud platform based on Amazon EC2 virtual machines.  We’ve shared pieces of Netflix’s container story in the past (video, slides), but this blog post will discuss containers at Netflix in depth.  As part of this story, we will cover Titus: Netflix’s infrastructural foundation for container based applications.  Titus provides Netflix scale cluster and resource management as well as container execution with deep Amazon EC2 integration and common Netflix infrastructure enablement.


This month marks two major milestones for containers at Netflix.  First, we have achieved a new level of scale, crossing one million containers launched per week.  Second, Titus now supports services that are part of our streaming service customer experience.  We will dive deeper into what we have done with Docker containers as well as what makes our container runtime unique.

History of Container Growth

Amazon’s virtual machine based infrastructure (EC2) has been a powerful enabler of innovation at Netflix.  In addition to virtual machines, we’ve also chosen to invest in container-based workloads for a few unique values they provide.  The benefits, excitement and explosive usage growth of containers from our developers has surprised even us.


While EC2 supported advanced scheduling for services, this didn’t help our batch users.  At Netflix there is a significant set of users that run jobs on a time or event based trigger that need to analyze data, perform computations and then emit results to Netflix services, users and reports.  We run workloads such as machine learning model training, media encoding, continuous integration testing, big data notebooks and CDN deployment analysis jobs many times each day.  We wanted to provide a common resource scheduler for container based applications independent of workload type that could be controlled by higher level workflow schedulers.  Titus serves as a combination of a common deployment unit (Docker image) and a generic batch job scheduling system. The introduction of Titus has helped Netflix expand to support the growing batch use cases.


With Titus, our batch users are able to put together sophisticated infrastructure quickly due to having to only specify resource requirements.  Users no longer have to deal with choosing and maintaining AWS EC2 instance sizes that don’t always perfectly fit their workload.  Users trust Titus to pack larger instances efficiently across many workloads.  Batch users develop code locally and then immediately schedule it for scaled execution on Titus.  Using containers, Titus runs any batch application letting the user specify exactly what application code and dependencies are needed.  For example, in machine learning training we have users running a mix of Python, R, Java and bash script applications.


Beyond batch, we saw an opportunity to bring the benefits of simpler resource management and a local development experience for other workloads.  In working with our Edge, UI and device engineering teams, we realized that service users were the next audience.  Today, we are in the process of rebuilding how we deploy device-specific server-side logic to our API tier leveraging single core optimized NodeJS servers.  Our UI and device engineers wanted a better development experience, including a simpler local test environment that was consistent with the production deployment.


In addition to a consistent environment, with containers developers can push new application versions faster than before by leveraging Docker layered images and pre-provisioned virtual machines ready for container deployments.  Deployments using Titus now can be done in one to two minutes versus the tens of minutes we grew accustomed to with virtual machines.  


The theme that underlies all these improvements is developer innovation velocity.  Both batch and service users can now experiment locally and test more quickly.  They can also deploy to production with greater confidence than before.  This velocity drives how fast features can be delivered to Netflix customers and therefore is a key reason why containers are so important to our business.

Titus Details

We have already covered what led us to build Titus.  Now, let’s dig into the details of how Titus provides these values.  We will provide a brief overview of how  Titus scheduling and container execution supports the service and batch job requirements as shown in the below diagram.


Screen Shot 2017-04-17 at 2.52.01 PM.png


Titus handles the scheduling of applications by matching required resources and available compute resources.  Titus supports both service jobs that run “forever” and batch jobs that run “until done”.  Service jobs restart failed instances and are autoscaled to maintain a changing level of load.  Batch jobs are retried according to policy and run to completion.  


Titus offers multiple SLA’s for resource scheduling.  Titus offers on-demand capacity for ad hoc batch and non-critical internal services by autoscaling capacity in EC2 based on current needs.  Titus also offers pre-provisioned guaranteed capacity for user facing workloads and more critical batch.   The scheduler does both bin packing for efficiency across larger virtual machines and anti-affinity for reliability spanning virtual machines and availability zones.  The foundation of this scheduling is a Netflix open source library called Fenzo.


Titus’s container execution, which runs on top of EC2 VMs, integrates with both AWS and Netflix infrastructure. We expect users to use both virtual machines and containers for a long time to come so we decided that we wanted the cloud platform and operational experiences to be as similar as possible.  In using AWS we choose to deeply leverage existing EC2 services.  We used Virtual Private Cloud (VPC) for routable IPs rather than a separate network overlay.  We leveraged Elastic Network Interfaces (ENIs) to ensure that all containers had application specific security groups.  Titus provides a metadata proxy that enables containers to get a container specific view of their environment as well as IAM credentials.  Containers do not see the host’s metadata (e.g., IP, hostname, instance-id).  We implemented multi-tenant isolation (CPU, memory, disk, networking and security) using a combination of Linux, Docker and our own isolation technology.


For containers to be successful at Netflix, we needed to integrate them seamlessly into our existing developer tools and operational infrastructure.  For example, Netflix already had a solution for continuous delivery – Spinnaker.  While it might have been possible to implement rolling updates and other CI/CD concepts in our scheduler, delegating this feature set to Spinnaker allowed for our users to have a consistent deployment tool across both virtual machines and containers.  Another example is service to service communication.  We avoided reimplementing service discovery and service load balancing.  Instead we provided a full IP stack enabling containers to work with existing Netflix service discovery and DNS (Route 53) based load balancing.   In each of these examples, a key to the success of Titus was deciding what Titus would not do, leveraging the full value other infrastructure teams provide.


Using existing systems comes at the cost of augmenting these systems to work with containers in addition to virtual machines.  Beyond the examples above, we had to augment our telemetry, performance autotuning, healthcheck systems, chaos automation, traffic control, regional failover support, secret management and interactive system access.  An additional cost is that tying into each of these Netflix systems has also made it difficult to leverage other open source container solutions that provide more than the container runtime platform.


Running a container platform at our level of scale (with this diversity of workloads) requires a significant focus on reliability.  It also uncovers challenges in all layers of the system.  We’ve dealt with scalability and reliability issues in the Titus specific software as well as the open source we depend on (Docker Engine, Docker Distribution, Apache Mesos, Snap and Linux).  We design for failure at all levels of our system including reconciliation to drive consistency between distributed state that exists between our resource management layer and the container runtime.  By measuring clear service level objectives (container launch start latency, percentage of containers that crash due to issues in Titus, and overall system API availability) we have learned to balance our investment between reliability and functionality.


A key part of how containers help engineers become more productive is through developer tools.  The developer productivity tools team built a local development tool called Newt (Netflix Workflow Toolkit).  Newt helps simplify container development both iteratively locally and through Titus onboarding.  Having a consistent container environment between Newt and Titus helps developer deploy with confidence.

Current Titus Usage

We run several Titus stacks across multiple test and production accounts across the three Amazon regions that power the Netflix service.


When we started Titus in December of 2015, we launched a few thousand containers per week across a handful of workloads.  Last week, we launched over one million containers.  These containers represented hundreds of workloads.  This 1000X increase in container usage happened over a year timeframe, and growth doesn’t look to be slowing down.


We run a peak of 500 r3.8xl instances in support of our batch users.  That represents 16,000 cores of compute with 120 TB of memory.  We also added support for GPUs as a resource type using p2.8xl instances to power deep learning with neural nets and mini-batch.


In the early part of 2017, our stream-processing-as-a-service team decided to leverage Titus to enable simpler and faster cluster management for their Flink based system.  This usage has resulted in over 10,000 service job containers that are long running and re-deployed as stream processing jobs are changed.  These and other services use thousands of m4.4xl instances.


While the above use cases are critical to our business, issues with these containers do not impact Netflix customers immediately.  That has changed as Titus containers recently started running services that satisfy Netflix customer requests.


Supporting customer facing services is not a challenge to be taken lightly.  We’ve spent the last six months duplicating live traffic between virtual machines and containers.  We used this duplicated traffic to learn how to operate the containers and validate our production readiness checklists.  This diligence gave us the confidence to move forward making such a large change in our infrastructure.

The Titus Team

One of the key aspects of success of Titus at Netflix has been the experience and growth of the Titus development team.  Our container users trust the team to keep Titus operational and innovating with their needs.


We are not done growing the team yet.  We are looking to expand the container runtime as well as our developer experience.  If working on container focused infrastructure excites you and you’d like to be part of the future of Titus check out our jobs page.



On behalf of the entire Titus development team

Monday, April 17, 2017

Introducing Bolt: On Instance Diagnostic and Remediation Platform

Last August we introduced Winston, our event driven diagnostic and remediation platform. Winston helps orchestrate diagnostic and remediation actions from the outside. As part of that orchestration, there are multiple actions that need to be performed at an AWS instance(vm) level to collect data or take mitigation steps. We would like to discuss a supporting service called Bolt that helps with instance level action executions. By `action`, we refer to runbook, script or automation code.

Problem space

Netflix does not run its own data centers. We use AWS services for all our infrastructure needs. While we do utilize value added services from AWS (SQS, S3, etc.), much of our usage for AWS is on top of the core compute service provided by Amazon called EC2.

As part of operationalizing our fleet of EC2 instances, we needed a simple way of automating common diagnostics and remediation tasks on these instances. This solution could integrate with our orchestration services like Winston to run these actions across multiple instances, making it easy to collect data and aggregate at scale as needed. This solution would became especially important for our instances hosting persistent services like Cassandra which are more long lived than our non-persistent services. For example, when a Cassandra node is running low on disk space, a Bolt action would be executed to analyze if any stale snapshots are laying around and if so, reclaim disk space without user intervention.

Bolt to the rescue

Bolt is an instance level daemon that runs on every instance and exposes custom actions as REST commands for execution. Developers provide their own custom scripts as actions to Bolt which are version controlled and then seamlessly deployed across all relevant EC2 instance dynamically. Once deployed, these actions are available for execution via REST endpoints through Bolt.


Some interesting design decisions when building Bolt were
  • There is no central instance catalog in Bolt. The cloud is ever changing and ephemeral. Maintaining this dynamic list is not our core value add. Instead, Bolt client gets this list on demand from Spinnaker. This allows clear separation of concern from the systems that provide the infrastructure map from Bolt that uses that map to act upon.
  • No middle tier service, Bolt client libraries and CLI tools talk directly to the instances. This allows Bolt to scale to thousands of instances and reduces operational complexity.
  • History of executions stays on the instance v.s. in a centralized external store. By decentralizing the history store and making it instance scoped and tied to the life of that instance, we reduce operational cost and simplify scalability at the cost of losing the history when the instance is lost.
  • Lean resource usage: The agent only uses 50Mb of memory and is niced to prevent it from taking over CPU.
Below, we go over some of the more interesting features of Bolt.

Language support

The main languages used for writing Bolt actions are Python and Bash/Shell. Actually, any scripting language that are installed on the instance can be used (Groovy, Perl, …), as long as a proper shebang is specified.


The advantage of using Python is that we provide per-pack virtual environment support which give dependency isolation to the automation. We also provide, through Winston Studio,  self-serve dependency management using the standard requirements.txt approach.

Self serve interface

We chose to extend Winston Studio to supports CRUD & Execute operations for both Winston and Bolt. By providing a simple and intuitive interface on where to upload your Bolt actions, look at execution history, experiment and execute on demand, we ensured that the Fault Detection Engineering team is not the bottleneck for operations associated with Bolt actions.


Here is a screenshot of what a single Bolt action looks like. Users (Netflix engineers) can iterate, introspect and manage all aspects of their action via this studio. This studio also implements and exposes the paved path for action deployment to make it easy for engineers to do the right thing to mitigate risks.




Users can also look at the previous executions and individual execution details through Winston Studio as shown in the following snapshots.




Here is an example requirements.txt file which specifies the pack dependencies (for Python Bolt actions):

Action lifecycle management

Similar to Winston, we help ensure that these actions are version controlled correctly and we enforce staged deployment. Here are the 3 stages:
  1. Dev: At this stage, a Bolt action can only be executed from the Studio UI. They are only deployed at runtime on the instance where they will be executed. This stage is used for development purposes as the name suggests.
  2. Test: When development is completed, the user promote the action to Test environment. At this stage, within ~5 minutes, the action will be deployed on all relevant Test EC2 instances. This action will then sit at this stage for a couple of hours/days/weeks (at the discretion of the user) to catch edge cases and unexpected issues at scale.
  3. Prod: When the user is comfortable with the stability of the action, the next step is to promote it to Prod. Again, within ~5 minutes, the action will be deployed on all relevant Prod EC2 instances.

Security

Even though this is an internal administrative service, security was a big aspect of building software that installs on all EC2 instances we run. Here are some key security features and decisions we took to harden the service from bad actors
  • White listing actions. We do not allow running arbitrary commands on instances. Developers explicitly choose to expose a command/script on their service instances via Bolt.
  • Auditing - All CRUD changes to actions are authenticated and audited in the studio. Netflix engineer’s have to use their credentials to be able to make any change to the whitelisted actions.
  • Mutual TLS authenticated REST endpoints: Arbitrary systems cannot invoke executions via Bolt on instances.

Async execution

The decision of choosing an Async API to execute actions allows the ability to run long running diagnostics or remediation actions without blocking the client. It also allows the clients to scale to managing thousands of instances in short interval of time through fire and check back later interface.


Here is a simple sequence diagram to illustrate the flow of an action being executed on a single EC2 instance:

Client libraries

The Bolt ecosystem consists of both a Python and Java client libraries for integrations..
This client library also makes the task of TLS authenticated calls available out of the box as well as implements common client side patterns of orchestration.


What do we mean by orchestration? Let say that you want to restart tomcat on 100 instances. You probably don’t want to do it on all instances at the same time, as your service would experience down-time. This is where the serial/parallel nature of an execution comes into play. We support running an action one instance at a time, one EC2 Availability Zone at a time, one EC2 Region at a time, or on all instances in parallel (if this is a read-only action for example). The user decides which strategy applies to his action (the default is one instance at a time, just to be safe).

Resiliency features

  • Bolt engine and all action executions are run at a nice level of 1 to prevent them from taking over the host
  • Health check: Bolt exposes a health check API and we have logic that periodically check for health of the Bolt service across all instances. We also use it for our deployments and regression testing.
  • Metrics monitoring: CPU/memory usage of the engine and running actions
  • Staged/controlled Deployments of the engine and the actions
  • No disruption during upgrades: Running actions won’t be interrupted by Bolt upgrades
  • No middle tier: better scalability and reliability

Other features

Other notables features include the support for action timeout and the support for killing running actions. Also, the user can see the output (stdout/stderr) while the action is running (doesn’t have to wait for the action to be complete).

Use cases

While Bolt is very flexible as to what action it can perform, the majority of use cases fall into these patterns:

Diagnostic

Bolt is used as a diagnostic tool to help engineers when their system is not behaving as expected. Also, some team uses it to gather metadata about their instance (disk usage, version of packages installed, JDK version, …). Others use Bolt to get detailed disk usage information when a low disk space alert gets triggered.

Remediation

Remediation is an approach to fix an underlying condition that is rendering a system non-optimal. For example, Bolt is used by our Data Pipeline team to restart the Kafka broker when needed. It is also used by our Cassandra team to free up disk space.

Proactive maintenance

In the case of our Cassandra instances, Bolt is used for proactive maintenance (repairs, compactions, Datastax & Priam binary upgrades, …). It is also used for binary upgrades on our Dynomite instances (for Florida and Redis).

Some usage numbers

Here are some usage stats number:

  • Thousands of actions execution per day
  • Bolt is installed on tens of thousands of instances

Architecture

Below is a simplified diagram of how Bolt actions are deployed on the instances:
  1. When the user modifies an action from Winston Studio, it is committed to Git version control system (for auditing purposes, Stash in our case) and synced to a Blob store (AWS S3 bucket in our case)
  2. Every couple of minutes, all the Bolt-enabled Netflix instances check if there is a new version available for each installed pack and proceed with its upgrade if needed.


It also explains how they are being triggered (either manually from Winston Studio or in response to an event from Winston).


The diagram also summarizes how Bolt is stored on each instance:
  • The Bolt engine binary, the Bolt packs and the Python virtual environments are installed on the root device
  • Bolt log files, Action stdout/stderr output and the Sqlite3 database that contains the executions history are stored on the ephemeral device. We keep about 180 days of execution history and compress the stdout/stderr a day after the execution is completed, to same space.


It also shows that we send metrics to Atlas to track usage as well as Pack upgrades failures. This is critical to ensure that we get notified of any issues.



Build vs. Buy

Here are some alternatives we looked at before building Bolt.

SSH

Before Bolt, we were using plain and simple SSH to connect to instances and execute actions as needed. While inbuilt in every VM and very flexible, there were some critical issues with using this model:


  1. Security: Winston, or any other orchestrator, needs to be a bastion to have the keys to our kingdom, something we were not comfortable with
  2. async vs. sync: Running SSH commands meant we could only run them in a sync manner blocking a process on the client side to wait for the action to finish. This sync approach has scalability concerns as well as reliability issues (long running actions were sometime interrupted by network issues)
  3. History: Persisting what was run for future diagnostics.
  4. Admin interface


We discussed the idea of using separate sudo rules, a dedicated set of keys and some form of binary whitelisting to improve the security aspect. This would alleviate our security concerns, but would have limited our agility to add custom scripts (which could contains non whitelisted binaries).


In the end, we decided to invest in building technology that allows us to support the instance level use case with the flexibility to add value added features while strengthening the security, reliability and performance challenges we had.

AWS EC2 Run

AWS EC2 Run is a great option if you need to run remote script on AWS EC2 instances. But in October 2014 (when Bolt was created), EC2 Run was either not created yet, or not public (introduction post was published on October 2015). Since then, we sync quarterly with the EC2 Run team from AWS to see how we can leverage their agent to replace the Bolt agent. But, at the time of writing, given the Netflix specific features of Bolt and it’s integration with our Atlas Monitoring system, we decided not to migrate yet.

Chef/Puppet/Ansible/Salt/RunDeck ...

These are great tools for configuration management/orchestration. Sadly, some of them (Ansible/RunDeck) had dependencies on SSH, which was an issue for the reasons listed above. For Chef and Puppet, we assessed that they were not quite the right tool for what we were trying to achieve (being mostly configuration management tools). As for Salt, adding a middle tier (Salt Master) meant increased risk of down time even with a Multi-Master approach. Bolt doesn’t have any middle tier or master, which significantly reduces the risk of down time and allows better scalability.

Future work

Below is a glimpse of some future work we plan to do in this space.

Resources Capping

As an engineer, one of the biggest concern when running extra agent/process is: “Could it take over my system memory/cpu?”. In this regard, we plan to invest into memory/cpu capping as an optional configuration in Bolt. We already reduce processing priority of Bolt and the actions processes with ‘nice’, but memory is currently unconstrained. The current plan is to use cgroups.

Integrate with Spinnaker

The Spinnaker team is planning on using Bolt to gather metadata information about running instances as well as other general use-cases that applies to all Netflix micro-services (Restart Tomcat, …). For this reason, we are planning to add Bolt to our BaseAMI, which will automatically make it available on all Netflix micro-services.

Summary

Since its creation to solve specific needs for our Cloud Database Engineering group to its broader adoption, and soon to be included in all Netflix EC2 instances, Bolt has evolved into a reliable and effective tool.


By: Jean-Sebastien Jeannotte and Vinay Shah
on behalf of the Fault Detection Engineering team