Wednesday, March 30, 2016

Global Cloud - Active-Active and Beyond

This is a continuing post on the Netflix architecture for Global Availability.  In the past we talked about efforts like Isthmus and Active-Active.  We continue the story from where we left off at the end of the Active-Active project in 2013.  We had achieved multi-regional resiliency for our members in the Americas, where the vast majority of Netflix members were located at the time.  Our European members, however, were still at risk from a single point of failure.
Global Cloud - Q3 2014.png
Our expansion around the world since then, has resulted in a growing percentage of international members who were exposed to this single point of failure, so we set out to make our cloud deployment even more resilient.

Creating a Global Cloud

We decided to create a global cloud where we would be able to serve requests from any member in any AWS region where we are deployed.  The diagram below shows the logical structure of our multi-region deployment and the default routing of member traffic to AWS region.
Global Cloud - Q1 2016.png

Getting There

Getting to the end state, while not disrupting our ongoing operations and the development of new features, required breaking the project down into a number of stages.  From an availability perspective, removing AWS EU-West-1 as a single point of failure was the most important goal, so we started in the Summer of 2014 by identifying the tasks that we needed to execute in order to be able to serve our European members from US-East-1.

Data Replication

When we initially launched service in Europe in 2012, we made an explicit decision to build regional data islands for most, but not all, of the member related data.  In particular, while a member’s subscription allowed them to stream anywhere that we offered service, information about what they watched while in Europe would not be merged with the information about what they watched while in the Americas.  Since we figured we would have relatively few members travelling across the Atlantic, we felt that the isolation that these data islands created was a win as it would mitigate the impact of a region specific outage.


In order to serve our EU members a normal experience from US-East-1, we needed to replicate the data in the EU Cassandra island data sets to the Cassandra clusters in US-East-1 and US-West-2.  We considered replicating this data into separate keyspaces in US clusters or merging the data with our Americas data.  While using separate keyspaces would have been more cost efficient, merging the datasets was more in line with our longer term goal of being able to serve any member from any region as the Americas data would be replicated to the Cassandra clusters in EU-West-1.

Merging the EU and Americas data was more complicated than the replication work that was part of the 2013 Active-Active project as we needed to examine each component data set to understand how to merge the data.  Some data sets were appropriately keyed such that the result was the union of the two island data sets.  To simplify the migration of such data sets, the Netflix Cloud Database Engineering (CDE) team enhanced the Astyanax Cassandra client to support writing to two keyspaces in parallel.  This dual write functionality was sometimes used in combination with another tool built by the CDE that could be used to forklift data from one cluster or keyspace to another.  For other data sets, such as member viewing history, custom tools were needed to handle combining the data associated with each key.  We also discovered one or two data sets in which there were unexpected inconsistencies in the data that required deeper analysis to determine which particular values to keep.


As described in the blog post on the Active-Active project, we built a mechanism to allow updates to EVCache clusters in one region to invalidate the entry in the corresponding cluster in the other US region using an SQS message.  EVCache now supports both full replication and invalidation of data in other regions, which allows application teams to select the strategy that is most appropriate to their particular data set.  Additional details about the current EVCache architecture are available in a recent Tech Blog post.

Personalization Data

Historically the personalization data for any given member has been pre-computed in only one of our AWS regions and then replicated to whatever other regions might service requests for that member.  When a member interacted with the Netflix service in a way that was supposed to trigger an update of the recommendations, this would only happen if the interaction was serviced in the member’s “home” region, or its active-active replica, if any.

This meant that when a member was serviced from a different region during a traffic migration, their personalized information would not be updated.  Since there are regular, clock driven, updates to the precomputed data sets, this was considered acceptable for the first phase of the Global Cloud project.  In the longer term, however, the precomputation system was enhanced to allow the events that triggered recomputation to be delivered across all three regions.  This change also allowed us to redistribute the precomputation workload based on resource availability.

Handling Misrouted Traffic

In the past, Netflix has used a variety of application level mechanisms to redirect device traffic that has landed in the “wrong” AWS region, due to DNS anomalies, back to the member’s “home” region.  While these mechanisms generally worked, they were often a source of confusion due the differences in their implementations.  As we started moving towards the Global Cloud, we decided that, rather than redirecting the misrouted traffic, we would use the same Zuul-to-Zuul routing mechanism that we use when failing over traffic to another region to transparently proxy traffic from the “wrong” region to the “home” region.

As each region became capable of serving all members, we could then update the Zuul configuration to stop proxying the “misrouted” traffic to the member’s home region and simply serve it locally.  While this potentially added some latency versus sticky redirects, it allowed several teams to simplify their applications by removing the often crufty redirect code.  Application teams were given the guidance that they should no longer worry about whether a member was in the “correct” region and instead serve them the best response that they could give the locally available information.

Evolving Chaos Kong

With the Active-Active deployment model, our Chaos Kong exercises involved failing over a single region into another region.  This is also the way we did our first few Global Cloud failovers.  The following graph shows our traffic steering during a production issue in US-East-1.  We steered traffic first from US-East-1 to US-West-2 and then later in the day to EU-West-1.  The upper graph shows that the aggregate, global, stream starts tracked closely to the previous week’s pattern, despite the shifts in the amount of traffic being served by each region.  The thin light blue line shows SPS traffic for each region the previous week and allows you to see the amount of traffic we are shifting.
Cleaned up version of traffic steering during INC-1453 in mid-October.
By enhancing our traffic steering tools, we are now able to steer traffic from one region to both remaining regions to make use of available capacity.  The graphs below show a situation where we evacuated all traffic from US-East-1, sending most of the traffic to EU-West-1 and a smaller portion to US-West-2.

SPS During 2016-01-14 Failover.png
We have done similar evacuations for the other two regions, each of them involving rerouted traffic being split between both remaining regions based on available capacity and minimizing member impact.  For more details on the evolution of the Kong exercises and our Chaos philosophy behind them, see our earlier post.

Are We Done?

Not even close.  We will continue to explore new ways in which to efficiently and reliably deliver service to our millions of global members.  We will report on those experiments in future updates here.

-Peter Stout on behalf of all the teams that contributed to the Global Cloud Project

Wednesday, March 23, 2016

Performance without Compromise

Last week we hosted our latest Netflix JavaScript Talks event at our headquarters in Los Gatos, CA. We gave two talks about our unflinching stance on performance. In our first talk, Steve McGuire shared how we achieved a completely declarative, React-based architecture that’s fast on the devices in your living room. He talked about our architecture principles (no refs, no observation, no mixins or inheritance, immutable state, and top-down rendering) and the techniques we used to hit our tough performance targets. In our second talk, Ben Lesh explained what RxJS is, and why we use and love it. He shared the motivations behind a new version of RxJS and how we built it from the ground up with an eye on performance and debugging.

React.js for TV UIs

RxJS Version 5

Videos from our past talks can always be found on our Netflix UI Engineering channel on YouTube. If you’re interested in being notified of future events, just sign up on our notification list.

By Kim Trott

Monday, March 21, 2016

Extracting image metadata at scale

We have a collection of nearly two million images that play very prominent roles in helping members pick what to watch. This blog describes how we use computer vision algorithms to address the challenges of focal point, text placement and image clustering at a large scale.

Focal point
All images have a region that is the most interesting (e.g. a character’s face, sharpest region, etc.) part of the image. In order to effectively render an image on a variety of canvases like a phone screen or TV, it is often required to display only the interesting region of the image and dynamically crop the rest of an image depending on the available real-estate and desired user experience. The goal of the focal point algorithm is to use a series of signals to identify the most interesting region of an image, then use that information to dynamically display it.
70177057_StoryArt_1536x864.jpg80004288_StoryArt_1536x864 (2).jpg
[Examples of face and full-body features to determine the focal point of the image]

We first try to identify all the people and their body positioning using Haar-cascade like features. We also built haar based features to also identify if it is close-up, upper-body or a full-body shot of the person(s). With this information, we were able to build an algorithm that auto-selects what is considered the "best' or "most interesting" person and then focuses in on that specific location.

However, not all images have humans in them. So, to identify interesting regions in those cases, we created a different signal - edges. We heuristically identify the focus of an image based on first applying gaussian blur and then calculating edges for a given image.

Here is one example of applying such a transformation:


///Remove noise by blurring with a Gaussian filter
GaussianBlur( src, src, Size(n,n ), 0, 0, BORDER_DEFAULT );
/// Convert the image to grayscale
cvtColor( src, src_gray, CV_BGR2GRAY );

/// Apply Laplace function
Laplacian( src_gray, dst, ddepth, kernel_size, scale, delta, BORDER_CONSTANT );
convertScaleAbs( dst, abs_dst );

Below are a few examples of dynamically cropped images based on focal point for different canvases:


Text Placement
Another interesting challenge is determining what would be the best place to put text on an image. Examples of this are the ‘New Episode’ Badge and placement of subtitles in a video frame.

[Example of “New Episode” badge hiding the title of the show]

In both cases, we’d like to avoid placing new text on top of existing text on these images.

Using a text detection algorithm allows us to automatically detect and correct such cases. However, text detection algorithms have many false positives. We apply several transformations like watershed and thresholding before applying text detection. With such transformations, we can get fairly accurate probability of text in a region of interest for image in large corpus of images.

[Results of text detection on some of the transformations of the same image]

Image Clustering
Images play an important role in a member’s decision to watch a particular video. We constantly test various flavors of artwork for different titles to decide which one performs the best. In order to learn which image is more effective globally, we would like to see how an image performs in a given region. To get an overall global view of how well a particular set of visually similar images performed globally, it is required to group them together based on their visual similarity.

We have several derivatives of the same image to display for different users. Although visually similar, not all of these images come from the same source. These images have varying degrees of image cropping, resizing, color correction and title treatment to serve a global audience.

As a global company that is constantly testing and experimenting with imagery, we have a collection of millions of images that we are continuously shifting and evolving. Manually grouping these images and maintaining those images can be expensive and time consuming, so we wanted to create a process that was smarter and more efficient.

[An example of two images with slight color correction, cropping and localized title treatment]

These images are often transformed and color corrected so a traditional color histogram based comparison does not always work for such automated grouping. Therefore, we came up with an algorithm that uses the following combination of parameters to determine a similarity index - measurement of visual similarity among group of images.

We calculate similarity index based on following 4 parameters:
  1. Histogram based distance
  2. Structural similarity between two images
  3. Feature matching between two images
  4. Earth mover’s distance algorithm to measure overall color similarity

Using all 4 methods, we can get a numerical value of similarity between two images in a relatively fast comparison.

Below is example of images grouped based on a similarity index that is invariant to color correction, title treatment, cropping and other transformations:
[Final result with similarity index values for group of images]

Images play a crucial role in first impression of a large collection of videos, and we are just scratching the surface on what we can learn from media and we have many more ambitious and interesting problems to tackle in the road ahead.

If you are excited and passionate about solving big problems, we are hiring. Contact us

Monday, March 14, 2016

Stream-processing with Mantis

Back in January of 2014 we wrote about the need for better visibility into our complex operational environments.  The core of the message in that post was about the need for fine-grained, contextual and scalable insights into the experiences of our customers and behaviors of our services.  While our execution has evolved somewhat differently from our original vision, the underlying principles behind that vision are as relevant today as they were then.  In this post we’ll share what we’ve learned building Mantis, a stream-processing service platform that’s processing event streams of up to 8 million events per second and running hundreds of stream-processing jobs around the clock.  We’ll describe the architecture of the platform and how we’re using it to solve real-world operational problems.

Why Mantis?

There are more than 75 million Netflix members watching 125 million hours of content every day in over 190 countries around the world.  To provide an incredible experience for our members, it’s critical for us to understand our systems at both the coarse-grained service level and fine-grained device level.  We’re good at detecting, mitigating, and resolving issues at the application service level - and we’ve got some excellent tools for service-level monitoring - but when you get down to the level of individual devices, titles, and users, identifying and diagnosing issues gets more challenging.

We created Mantis to make it easy for teams to get access to realtime events and build applications on top of them.  We named it after the Mantis shrimp, a freakish yet awesome creature that is both incredibly powerful and fast.  The Mantis shrimp has sixteen photoreceptors in its eyes compared to humans’ three.  It has one of the most unique visual systems of any creature on the planet.  Like the shrimp, the Mantis stream-processing platform is all about speed, power, and incredible visibility.  

So Mantis is a platform for building low-latency, high throughput stream-processing apps but why do we need it?  It’s been said that the Netflix microservices architecture is a metrics generator that occasionally streams movies.  It’s a joke, of course, but there’s an element of truth to it; our systems do produce billions of events and metrics on a daily basis.  Paradoxically, we often experience the problem of having both too much data and too little at the same time.  Situations invariably arise in which you have thousands of metrics at your disposal but none are quite what you need to understand what’s really happening.  There are some cases where you do have access to relevant metrics, but the granularity isn’t quite good enough for you to understand and diagnose the problem you’re trying to solve.  And there are still other scenarios where you have all the metrics you need, but the signal-to-noise ratio is so high that the problem is virtually impossible to diagnose.  Mantis enables us to build highly granular, realtime insights applications that give us deep visibility into the interactions between Netflix devices and our AWS services.  It helps us better understand the long tail of problems where some users, on some devices, in some countries are having problems using Netflix.

By making it easier to get visibility into interactions at the device level, Mantis helps us “see” details that other metrics systems can’t.  It’s the difference between 3 photoreceptors and 16.

A Deeper Dive

With Mantis, we wanted to abstract developers away from the operational overhead associated with managing their own cluster of machines.  Mantis was built from ground up to be cloud native.  It manages a cluster of EC2 servers that is used to run stream-processing jobs.  Apache Mesos is used to abstract the cluster into a shared pool of computing resources.  We built, and open-sourced, a custom scheduling library called Fenzo to intelligently allocate these resources among jobs.

Architecture Overview

The Mantis platform comprises a master and an agent cluster.  Users submit stream-processing applications as jobs that run as one or more workers on the agent cluster.  The master consists of a Resource Manager that uses Fenzo to optimally assign resources to a jobs’ workers.  A Job Manager embodies the operational behavior of a job including metadata, SLAs, artifact locations, job topology and life cycle.

The following image illustrates the high-level architecture of the system.

Mantis Jobs

Mantis provides a flexible model for defining a stream-processing job. A mantis job can be defined as single-stage for basic transformation/aggregation use cases or multi-stage for sharding and processing high-volume, high-cardinality event streams.

There are three main parts to a Mantis job. 
  • The source is responsible for fetching data from an external source
  • One or more processing stages which are responsible for processing incoming event streams using high order RxJava functions
  • The sink to collect and output the processed data
RxNetty provides non-blocking access to the event stream for a job and is used to move data between its stages.

To give you a better idea of how a job is structured, let's take a look at a typical ‘aggregate by group’ example.

Imagine that we are trying to process logs sent by devices to calculate error rates per device type.  The job is composed of three stages. The first stage is responsible for fetching events from a device log source job and grouping them based on device ID. The grouped events are then routed to workers in stage 2 such that all events for the same group (i.e., device ID) will get routed to the same worker.  Stage 2 is where stateful computations like windowing and reducing - e.g., calculating error rate over a 30 second rolling window - are performed.  Finally the aggregated results for each device ID are collected by Stage 3 and made available for dashboards or other applications to consume.

Job Chaining

One of the unique features of Mantis is the ability to chain jobs together.  Job chaining allows for efficient data and code reuse.  The image below shows an example of an anomaly detector application composed of several jobs chained together.  The anomaly detector streams data from a job that serves Zuul request/response events (filtered using a simple SQL-like query) along with output from a “Top N” job that aggregates data from several other source jobs.

Scaling in Action

At Netflix the amount of data that needs to be processed varies widely based on the time of the day.  Running with peak capacity all the time is expensive and unnecessary. Mantis autoscales both the cluster size and the individual jobs as needed.

The following chart shows how Fenzo autoscales the Mesos worker cluster by adding and removing EC2 instances in response to demand over the course of a week.

And the chart below shows an individual job’s autoscaling in action, with additional workers being added or removed based on demand over a week.

UI for Self-service, API for Integration

Mantis sports a dedicated UI and API for configuring and managing jobs across AWS regions.  Having both a UI and API improves the flexibility of the platform.  The UI gives users the ability to quickly and manually interact with jobs and platform functionality while the API enables easy programmatic integration with automated workflows.

The jobs view in the UI, shown below, lets users quickly see which jobs are running across AWS regions along with how many resources the jobs are consuming.

Each job instance is launched as part of a job cluster, which you can think of as a class definition or template for a Mantis job.  The job cluster view shown in the image below provides access to configuration data along with a view of running jobs launched from the cluster config. From this view, users are able to update cluster configurations and submit new job instances to run.

How Mantis Helps Us

Now that we’ve taken a quick look at the overall architecture for Mantis, let’s turn our attention to how we’re using it to improve our production operations.  Mantis jobs currently process events from about 20 different data sources including services like Zuul, API, Personalization, Playback, and Device Logging to name a few.

Of the growing set of applications built on these data sources, one of the most exciting use cases we’ve explored involves alerting on individual video titles across countries and devices.

One of the challenges of running a large-scale, global Internet service is finding anomalies in high-volume, high-cardinality data in realtime.  For example, we may need access to fine-grained insights to figure out if there are playback issues with House of Cards, Season 4, Episode 1 on iPads in Brazil.  To do this we have to track millions of unique combinations of data (what we call assets) all the time, a use case right in Mantis’ wheelhouse.

Let’s consider this use case in more detail.  The rate of events for a title asset (title * devices * country) shows a lot of variation.  So a popular title on a popular device can have orders of magnitude more events than lower usage title and device combinations.  Additionally for each asset, there is high variability in event rate based on the time of the day.  To detect anomalies, we track rolling windows of unique events per asset.  The size of the window and alert thresholds vary dynamically based on the rate of events.  When the percentage of anomalous events exceeds the threshold, we generate an alert for our playback and content platform engineering teams.  This approach has allowed us to quickly identify and correct problems that would previously go unnoticed or, best case, would be caught by manual testing or be reported via customer service.

Below is a screen from an application for viewing playback stats and alerts on video titles. It surfaces data that helps engineers find the root cause for errors.

In addition to alerting at the individual title level, we also can do realtime alerting on our key performance indicator: SPS.  The advantage of Mantis alerting for SPS is that it gives us the ability to ratchet down our time to detect (TTD) from around 8 minutes to less than 1 minute.  Faster TTD gives us a chance to resolve issues faster (time to recover, or TTR), which helps us win more moments of truth as members use Netflix around the world.

Where are we going?

We’re just scratching the surface of what’s possible with realtime applications, and we’re exploring ways to help more teams harness the power of stream-processing.  For example, we’re working on improving our outlier detection system by integrating Mantis data sources, and we’re working on usability improvements to get teams up and running more quickly using self-service tools provided in the UI.

Mantis has opened up insights capabilities that we couldn’t easily achieve with other technologies and we’re excited to see stream-processing evolve as an important and complementary tool in our operational and insights toolset at Netflix.  

If the work described here sounds exciting to you, head over to our jobs page; we’re looking for great engineers to join us on our quest to reinvent TV! 

by Ben Schmaus, Chris Carey, Neeraj Joshi, Nick Mahilani, and Sharma Podila

Wednesday, March 9, 2016

How We Build Code at Netflix

How does Netflix build code before it’s deployed to the cloud? While pieces of this story have been told in the past, we decided it was time we shared more details. In this post, we describe the tools and techniques used to go from source code to a deployed service serving movies and TV shows to more than 75 million global Netflix members.
The above diagram expands on a previous post announcing Spinnaker, our global continuous delivery platform. There are a number of steps that need to happen before a line of code makes it way into Spinnaker:
  • Code is built and tested locally using Nebula
  • Changes are committed to a central git repository
  • A Jenkins job executes Nebula, which builds, tests, and packages the application for deployment
  • Builds are “baked” into Amazon Machine Images
  • Spinnaker pipelines are used to deploy and promote the code change
The rest of this post will explore the tools and processes used at each of these stages, as well as why we took this approach. We will close by sharing some of the challenges we are actively addressing. You can expect this to be the first of many posts detailing the tools and challenges of building and deploying code at Netflix.

Culture, Cloud, and Microservices

Before we dive into how we build code at Netflix, it’s important to highlight a few key elements that drive and shape the solutions we use: our culture, the cloud, and microservices.
The Netflix culture of freedom and responsibility empowers engineers to craft solutions using whatever tools they feel are best suited to the task. In our experience, for a tool to be widely accepted, it must be compelling, add tremendous value, and reduce the overall cognitive load for the majority of Netflix engineers. Teams have the freedom to implement alternative solutions, but they also take on additional responsibility for maintaining these solutions. Tools offered by centralized teams at Netflix are considered to be part of a “paved road”. Our focus today is solely on the paved road supported by Engineering Tools.
In addition, in 2008 Netflix began migrating our streaming service to AWS and converting our monolithic, datacenter-based Java application to cloud-based Java microservices. Our microservice architecture allows teams at Netflix to be loosely coupled, building and pushing changes at a speed they are comfortable with.


Naturally, the first step to deploying an application or service is building. We created Nebula, an opinionated set of plugins for the Gradle build system, to help with the heavy lifting around building applications. Gradle provides first-class support for building, testing, and packaging Java applications, which covers the majority of our code. Gradle was chosen because it was easy to write testable plugins, while reducing the size of a project's build file. Nebula extends the robust build automation functionality provided by Gradle with a suite of open source plugins for dependency management, release management, packaging, and much more.
A simple Java application build.gradle file.
The above ‘build.gradle’ file represents the build definition for a simple Java application at Netflix. This project’s build declares a few Java dependencies as well as applying 4 Gradle plugins, 3 of which are either a part of Nebula or are internal configurations applied to Nebula plugins. The ‘nebula’ plugin is an internal-only Gradle plugin that provides convention and configuration necessary for integration with our infrastructure. The ‘nebula.dependency-lock’ plugin allows the project to generate a .lock file of the resolved dependency graph that can be versioned, enabling build repeatability. The ‘netflix.ospackage-tomcat’ plugin and the ospackage block will be touched on below.
With Nebula, we provide reusable and consistent build functionality, with the goal of reducing boilerplate in each application’s build file. A future techblog post will dive deeper into Nebula and the various features we’ve open sourced. For now, you can check out the Nebula website.


Once a line of code has been built and tested locally using Nebula, it is ready for continuous integration and deployment. The first step is to push the updated source code to a git repository. Teams are free to find a git workflow that works for them.
Once the change is committed, a Jenkins job is triggered. Our use of Jenkins for continuous integration has evolved over the years. We started with a single massive Jenkins master in our datacenter and have evolved to running 25 Jenkins masters in AWS. Jenkins is used throughout Netflix for a variety of automation tasks above just simple continuous integration.
A Jenkins job is configured to invoke Nebula to build, test and package the application code. If the repository being built is a library, Nebula will publish the .jar to our artifact repository. If the repository is an application, then the Nebula ospackage plugin will be executed. Using the Nebula ospackage (short for “operating system package”) plugin, an application’s build artifact will be bundled into either a Debian or RPM package, whose contents are defined via a simple Gradle-based DSL. Nebula will then publish the Debian file to a package repository where it will be available for the next stage of the process, “baking”.


Our deployment strategy is centered around the Immutable Server pattern. Live modification of instances is strongly discouraged in order to reduce configuration drift and ensure deployments are repeatable from source. Every deployment at Netflix begins with the creation of a new Amazon Machine Image, or AMI. To generate AMIs from source, we created “the Bakery”.
The Bakery exposes an API that facilitates the creation of AMIs globally. The Bakery API service then schedules the actual bake job on worker nodes that use Aminator to create the image.  To trigger a bake, the user declares the package to be installed, as well the foundation image onto which the package is installed. That foundation image, or Base AMI, provides a Linux environment customized with the common conventions, tools, and services required for seamless integration with the greater Netflix ecosystem.
When a Jenkins job is successful, it typically triggers a Spinnaker pipeline. Spinnaker pipelines can be triggered by a Jenkins job or by a git commit. Spinnaker will read the operating system package generated by Nebula, and call the Bakery API to trigger a bake.


Once a bake is complete, Spinnaker makes the resultant AMI available for deployment to tens, hundreds, or thousands of instances. The same AMI is usable across multiple environments as Spinnaker exposes a runtime context to the instance which allows applications to self-configure at runtime.  A successful bake will trigger the next stage of the Spinnaker pipeline, a deploy to the test environment. From here, teams will typically exercise the deployment using a battery of automated integration tests. The specifics of an application’s deployment pipeline becomes fairly custom from this point on. Teams will use Spinnaker to manage multi-region deployments, canary releases, red/black deployments and much more. Suffice to say that Spinnaker pipelines provide teams with immense flexibility to control how they deploy code.

The Road Ahead

Taken together, these tools enable a high degree of efficiency and automation. For example, it takes just 16 minutes to move our cloud resiliency and maintenance service, Janitor Monkey, from code check-in to a multi-region deployment.
A Spinnaker bake and deploy pipeline triggered from Jenkins.
That said, we are always looking to improve the developer experience and are constantly challenging ourselves to do it better, faster, and while making it easier.
One challenge we are actively addressing is how we manage binary dependencies at Netflix. Nebula provides tools focused on making Java dependency management easier. For instance, the Nebula dependency-lock plugin allows applications to resolve their complete binary dependency graph and produce a .lock file which can be versioned. The Nebula resolution rules plugin allows us to publish organization-wide dependency rules that impact all Nebula builds. These tools help make binary dependency management easier, but still fall short of reducing the pain to an acceptable level.
Another challenge we are working to address is bake time. It wasn’t long ago that 16-minutes from commit to deployment was a dream, but as other parts of the system have gotten faster, this now feels like an impediment to rapid innovation. From the Simian Army example deployment above, the bake process took 7 minutes or 44% of the total bake and deploy time. We have found the biggest drivers of bake time to be installing packages (including dependency resolution) and the AWS snapshot process itself.
As Netflix grows and evolves, there is an increasing demand for our build and deploy toolset to provide first-class support for non-JVM languages, like JavaScript/Node.js, Python, Ruby and Go. Our current recommendation for non-JVM applications is to use the Nebula ospackage plugin to produce a Debian package for baking, leaving the build and test pieces to the engineers and the platform’s preferred tooling. While this solves the needs of teams today, we are expanding our tools to be language agnostic.
Containers provide an interesting potential solution to the last two challenges and we are exploring how containers can help improve our current build, bake, and deploy experience. If we can provide a local container-based environment that closely mimics that of our cloud environments, we potentially reduce the amount of baking required during the development and test cycles, improving developer productivity and accelerating the overall development process. A container that can be deployed locally just as it would be in production without modification reduces cognitive load and allows our engineers to focus on solving problems and innovating rather than trying to determine if a bug is due to environmental differences.
You can expect future posts providing updates on how we are addressing these challenges. If these challenges sound exciting to you, come join the Engineering Tools team. You can check out our open jobs and apply today!