Friday, October 2, 2015

Netflix at AWS re:Invent 2015

Ever since AWS started the re:Invent conference, Netflix has actively participated each and every year.  This year is no exception, and we’re planning on presenting at 8 different sessions. The topics span the domains of availability, engineering velocity, security, real-time analytics, big data, operations, cost management, and efficiency all at web scale.

In the past, our sessions have received a lot of interest, so we wanted to share the schedule in advance, and provide a summary of the topics and how they might be relevant to you and your company.  Please join us at re:Invent if you’re attending. After the conference, we will link slides and videos to this same post.

ISM301 - Engineering Global Operations in the Cloud
Wednesday, Oct 7, 11:00AM - Palazzo N
Josh Evans, Director of Operations Engineering

Abstract: Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, Operations Engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault-injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.

ISM309 - Efficient Innovation - High Velocity Cost Management at Netflix
Wednesday, Oct 7, 2:45PM - Palazzo C
Andrew Park, Manager FPNA

Abstract: At many high growth companies, staying at the bleeding edge of innovation and maintaining the highest level of availability often sideline financial efficiency goals. This problem is exacerbated in a micro-service environment where decentralized engineering teams can spin up thousands of instances at a moment’s notice, with no governing body tracking financial or operational budgets. But instead of allowing costs to spin out of control causing senior leaders to have a “knee-jerk” reaction to rein in costs, there are  proactive and reactive initiatives one can pursue to replace high velocity cost with efficient innovation. Primarily, these initiatives revolve around developing a positive cost-conscious culture and assigning the responsibility of efficiency to the appropriate business owners.

At Netflix, our Finance and Operations Engineering teams bear that responsibility to ensure the rate of innovation is not only fast, but also efficient. In the following presentation, we’ll cover the building blocks of AWS cost management and discuss the best practices used at Netflix.

BDT318 -  Netflix Keystone - How Netflix handles Data Streams up to 8 Million events per second
Wednesday, Oct 7, 2:45PM - San Polo 3501B
Peter Bakas, Director of Event and Data Pipelines

Abstract: In this talk, we will provide an overview of Keystone - Netflix's new Data Pipeline. We will cover our migration from Suro to Keystone - including the reasons behind the transition and the challenges of zero loss to the over 400 billion events we process daily.  We will discuss in detail how we deploy, operate and scale Kafka, Samza, Docker and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.

DVO203 - A Day in the Life of a Netflix Engineer using 37% of the Internet
Wednesday, Oct 7, 4:15PM - Venetian H
Dave Hahn, Senior Systems Engineer & AWS Liaison

Abstract: Netflix is a large and ever-changing ecosystem made up of:
* hundreds of production changes every hour
* thousands of micro services
* tens of thousands of instances
* millions of concurrent customers
* billions of metrics every minute

And I'm the guy with the pager.

An in-the-trenches look at what operating at Netflix scale in the cloud is really like. How Netflix views the velocity of innovation, expected failures, high availability, engineer responsibility, and obsessing over the quality of the customer experience. Why Freedom & Responsibility key, trust is required, and why chaos is your friend.

SPOT302 -  Availability: The New Kind of Innovator’s Dilemma
Wednesday, Oct 7, 4:15PM - Marcello 4501B
Coburn Watson, Director of Reliability and Performance Engineering

Abstract: Successful companies, while focusing on their current customers' needs, often fail to embrace disruptive technologies and business models. This phenomenon, known as the "Innovator's Dilemma," eventually leads to many companies' downfall and is especially relevant in the fast-paced world of online services. In order to protect its leading position and grow its share of the highly competitive global digital streaming market, Netflix has to continuously increase the pace of innovation by constantly refining recommendation algorithms and adding new product features, while maintaining a high level of service uptime. The Netflix streaming platform consists of hundreds of microservices that are constantly evolving, and even the smallest production change may cause a cascading failure that can bring the entire service down. We face a new kind of Innovator's Dilemma, where product changes may not only disrupt the business model but also cause production outages that deny customers service access. This talk will describe various architectural, operational and organizational changes adopted by Netflix in order to reconcile rapid innovation with service availability.

BDT207 -  Real-Time Analytics In Service of Self-Healing Ecosystems
Wednesday, Oct 7, 4:15PM - Lido 3001B
Roy Rappoport, Manager of Insight Engineering
Chris Sanden, Senior Analytics Engineer

Abstract: Netflix strives to provide an amazing experience to each member.  To accomplish this, we need to maintain very high availability across our systems.  However, at a certain scale humans can no longer scale their ability to monitor the status of all systems, making it critical for us to build tools and platforms that can automatically monitor our production environments and make intelligent real-time operational decisions to remedy the problems they identify.

In this talk, we'll discuss how Netflix uses data mining and machine learning techniques to automate decisions in real-time with the goal of supporting operational availability, reliability, and consistency.  We'll review how we got to the current states, the lessons we learned, and the future of Real-Time Analytics at Netflix.  

While Netflix's scale is larger than most other companies, we believe the approaches and technologies we intend to discuss are highly relevant to other production environments, and an audience member will come away with actionable ideas that should be implementable in, and will benefit, most other environments.  

BDT303 - Running Spark and Presto in Netflix Big Data Platform
Thursday, Oct 8, 11:00AM - Palazzo F
Eva Tse, Director of Engineering - Big Data Platform
Daniel Weeks, Engineering Manager - Big Data Platform

Abstract: In this talk, we will discuss how Spark & Presto complement our big data platform stack that started with Hadoop; and the use cases that they address. Also, we will discuss how we run Spark and Presto on top of the EMR infrastructure. Specifically, how we use S3 as our DW and how we leverage EMR as a generic data processing cluster management framework.

SEC310 - Splitting the Check on Compliance and Security: Keeping Developers and Auditors Happy in the Cloud
Thursday, Oct 8, 11:00AM - Marcello 4501B
Jason Chan, Director of Cloud Security

Abstract: Often times - developers and auditors can be at odds. The agile, fast-moving environments that developers enjoy will typically give auditors heartburn. The more controlled and stable environments that auditors prefer to demonstrate and maintain compliance are traditionally not friendly to developers or innovation. We'll walk through how Netflix moved its PCI and SOX environments to the cloud and how we were able to leverage the benefits of the cloud and agile development to satisfy both auditors and developers. Topics covered will include shared responsibility, using compartmentalization and microservices for scope control, immutable infrastructure, and continuous security testing.

We also have a booth on the show floor where the speakers and other Netflix engineers will hold office hours.  We hope you join us for these talks and stop by our booth and say hello!

Thursday, October 1, 2015

Flux: A New Approach to System Intuition

First level of Flux

On the Traffic and Chaos Teams at Netflix, our mission requires that we have a holistic understanding of our complex microservice architecture. At any given time, we may be called upon to move the request traffic of many millions of customers from one side of the planet to the other. More frequently, we want to understand in real time what effect a variable is having on a subset of request traffic during a Chaos Experiment. We require a tool that can give us this holistic understanding of traffic as it flows through our complex, distributed system.

The two use cases have some common requirements. We need:
  • Realtime data.
  • Data on the volume, latency, and health of requests.
  • Insight into traffic at the network edge.
  • The ability to drill into IPC traffic.
  • Dependency information about the microservices as requests travel through the system.

So far, these requirements are rather standard fare for a network monitoring dashboard. Aside from the actual amount of traffic that Netflix handles, you might find a tool at that accomplishes the above at any undifferentiated online service.

Here’s where it gets interesting.

In general, we assume that if anything is best represented numerically, then we don’t need to visualize it. If the best representation is a numerical one, then a visualization could only obscure a quantifiable piece of information that can be measured, compared, and acted upon. Anything that we can wrap in alerts or some threshold boundary should kick off some automated process. No point in ruining a perfectly good system by introducing a human into the mix.

Instead of numerical information, we want a tool that surfaces relevant information to a human, for situations that would be too onerous to create a heuristic. These situations require an intuition that we can’t codify.

If we want to be able to intuit decisions about the holistic state of the system, then we are going to need a tool that gives us an intuitive understanding of the system. The network monitoring dashboards that we are familiar with won’t suffice. The current industry tools present data and charts, but we want a something that will let us feel the traffic and the state of the system.

In trying to explain this requirement for a visceral, gut-level understanding of the system, we came up with a metaphor that helps illustrate the point. It’s absurd, but explanatory.

Let's call it the "Pain Suit."
Imagine a suit that is wired with tens of thousands of electrodes. Electrode bundles correspond to microservices within Netflix. When a Site Reliability Engineer is on call, they have to wear this suit. As a microservice experiences failures, the corresponding electrodes cause a painful sensation. We call this the “Pain Suit.”

Now imagine that you are wearing the Pain Suit for a few short days. You wake up one morning and feel a pain in your shoulder. “Of course,” you think. “Microservice X is misbehaving again.” It would not take you long to get a visceral sense of the holistic state of the system. Very quickly, you would have an intuitive understanding of the entire service, without having any numerical facts about any events or explicit alerts.

It is our contention that this kind of understanding, this mechanical proprioception, is not only the most efficient way for us to instantly have a holistic understanding, it is also the best way to surface relevant information in a vast amount of data to a human decision maker. Furthermore, we contend that even brief exposure to this type of interaction with the system leads to insights that are not easily attained in any other way.

Of course, we haven’t built a pain suit. [Not yet. ;-)]

Instead, we decided to take advantage of the brain’s ability to process massive amounts of visual information in multiple dimensions, in parallel, visually. We call this tool Flux.

In the home screen of Flux, we get a representation of all traffic coming into Netflix from the Internet, and being directed to one of our three AWS Regions. Below is a video capture of this first screen in Flux during a simulation of a Regional failover:

The circle in the center represents the Internet. The moving dots represent requests coming in to our service from the Internet. The three Regions are represented by the three peripheral circles. Requests are normally represented in the bluish-white color, but errors and fallbacks are indicated by other colors such as red.

In this simulation, you can see request errors building up in the region in the upper left [victim region] for the first twenty seconds or so. The cause of the errors could be anything, but the relevant effect is that we can quickly see that bad things are happening in the victim region.

Around twenty seconds into the video, we decide to initiate a traffic failover. For the following 20 seconds, the requests going to the victim region are redirected to the upper right region [savior region] via an internal proxy layer. We take this step so that we can programmatically control how much traffic is redirected to the savior region while we scale it up. In this situation we don’t have enough extra capacity running hot to instantly fail over, so scaling up takes some time.

The inter-region traffic from victim to savior increases while the savior region scales up. At that point, we switch DNS to point to the savior region. For about 10 seconds you see traffic to the victim region die down as DNS propagates. At this point, about 56 seconds in, nearly all of the victim region’s traffic is now pointing to the savior region. We hold the traffic there for about 10 seconds while we ‘fix’ the victim region, and then we revert the process.

The victim region has been fixed, and we end the demo with traffic more-or-less evenly distributed. You may have noticed that in this demonstration we only performed a 1:1 mapping of victim to savior region traffic. We will speak to more sophisticated failover strategies in future posts.


Even before Flux v1.0 was up and running, when it was still in Alpha on a laptop, it found an issue in our production system. As we were testing real data, Justin noticed a stream that was discolored in one region. “Hey, what’s that?” led to a short investigation which revealed that our proxy layer had not scaled to a proper size on the most recent push in that region and was rejecting SSO requests. Flux in action!

Even a split-second glance at the Flux interface is enough to show us the health of the system. Without reading any numbers or searching for any particular signal, we instantly know by the color and motion of the elements on the screen whether the service is running properly. Of course if something is really wrong with the service, it will be highly visible. More interesting to us, we start to get a feeling when things are right in the system even before the disturbance is quantifiable.


This blog post is part of a series. In the next post on Flux, we will look at two layers that are deeper than the regional view, and talk specifically about the implementation. If you have thoughts on experiential tools like this or how to advance the state of the art in this field, we’d love to hear your feedback. Feel free to reach out to

-Traffic Team at Netflix
Luke Kosewski, Jeremy Tatelman, Justin Reynolds, Casey Rosenthal

Wednesday, September 30, 2015

Moving from Asgard to Spinnaker

Six years ago, Netflix successfully jumped headfirst into the AWS Cloud and along the way we ended up writing quite a lot of software to help us out. One particular project proved instrumental in allowing us to efficiently automate AWS deployments: Asgard

Asgard created an intuitive model for cloud-based applications that has made deployment and ongoing management of AWS resources easy for hundreds of engineers at Netflix. Introducing the notion of clusters, applications, specific naming conventions, and deployment options like rolling push and red/black has ultimately yielded more productive teams who can spend more time coding business logic rather than becoming AWS experts.  What’s more, Asgard has been a successful OSS project adopted by various companies. Indeed, the utility of Asgard’s battle-hardened AWS deployment and management features is undoubtedly due to the hard work and innovation of its contributors both within Netflix and the community.

Netflix, nevertheless, has evolved since first embracing the cloud. Our footprint within AWS has expanded to meet the demand of an increasingly global audience; moreover, the number of applications required to service our customers has swelled. Our rate of innovation, which maintains our global competitive edge, has also grown. Consequently, our desire to move code rapidly, with a high degree of confidence and overall visibility, has also increased. In this regard Asgard has fallen short.

Everything required to produce a deployment artifact, in this case an AMI, has never been addressed in Asgard. Consequently, many teams at Netflix constructed their own Continuous Delivery workflows. These workflows were typically related Jenkins jobs that tied together code check-ins with building and testing, then AMI creations and, finally, deployments via Asgard. This final step involved automation against Asgard’s REST API, which was never intended to be leveraged as a first class citizen.

Roughly a year ago a new project, dubbed Spinnaker, kicked off to enable end-to-end global Continuous Delivery at Netflix. The goals of this project were to create a Continuous Delivery platform that would:
  • enable repeatable automated deployments captured as flexible pipelines and configurable pipeline stages
  • provide a global view across all the environments that an application passes through in its deployment pipeline
  • offer programmatic configuration and execution via a consistent and reliable API
  • be easy to configure, maintain, and extend  
  • be operationally resilient  
  • provide the existing benefits of Asgard without a migration

What’s more, we wanted to leverage a few lessons learned from Asgard. One particular goal of this new platform is to facilitate innovation within its umbrella. The original Asgard model was difficult to extend so the community forked Asgard to provide alternative implementations. Since these changes weren’t merged back into Asgard, those innovations were lost to the wider community. Spinnaker aims to make it easier to extend and enhance cloud deployment models in a way that doesn't require forking. Whether the community desires additional cloud providers, different deployment artifacts or new stages in a Continuous Delivery pipeline, extensions to Spinnaker will be available to everyone in the community without the need to fork. 

We additionally wanted to create a platform that, while replacing Asgard, doesn’t exclude it. A big-bang migration process off Asgard would be out of the question for Netflix and for the community. Consequently, changes to cloud assets via Asgard are completely compatible with changes to those same assets via our new platform. And vice versa!

Finally, we deliberately chose not to reimplement everything in Asgard. Ultimately, Asgard took on too much undifferentiated heavy lifting from the AWS console. Consequently, for those features that are not directly related to cluster management, such as SNS, SQS, and RDS Management, Netflix users and the community are encouraged to use the AWS Console.

Our new platform only implements those Asgard-like features related to cluster management from the point of view of an application (and even a group of related applications: a project). This application context allows you to work with a particular application’s related clusters, ASGs, instances, Security Groups, and ELBs, in all the AWS accounts in which the application is deployed.

Today, we have both systems running side by side with the vast majority of all deployments leveraging our new platform. Nevertheless, we’re not completely done with gaining the feature parity we desire with Asgard. That gap is closing rapidly and in the near future we will be sunsetting various Asgard instances running in our infrastructure. At this point, Netflix engineers aren’t committing code to Asgard’s Github repository; nevertheless, we happily encourage the OSS community’s active participation in Asgard going forward. 

Asgard served Netflix well for quite a long time. We learned numerous lessons along our journey and are ready to focus on the future with a new platform that makes Continuous Delivery a first-class citizen at Netflix and elsewhere. We plan to share this platform, Spinnaker, with the Open Source Community in the coming months.

-Delivery Engineering Team at Netflix 

Monday, September 28, 2015

Creating Your Own EC2 Spot Market

by: Andrew Park, Darrell Denlinger, & Coburn Watson

Netflix prioritizes innovation and reliability above efficiency, and as we continue to scale globally, finding opportunities that balance these three variables becomes increasingly difficult. However, every so often there is a process or application that can shift the curve out on all three factors; for Netflix this process was incorporating hybrid autoscaling engines for our services via Scryer & Amazon Auto Scaling.

Currently over 15% of our EC2 footprint autoscales, and the majority of this usage is covered by reserved instances as we value the pricing and capacity benefits. The combination of these two factors have created an “internal spot market” that has a daily peak of over 12,000 unused instances. We have been steadily working on building an automated system that allows us to effectively utilize these troughs.

Creating the internal spot capacity is straightforward: implement auto scaling and purchase reserved instances. In this post we’ll focus on how to leverage this trough given the complexities that stem from our large scale and decentralized microservice architecture. In the upcoming posts, we will discuss the technical details in automating Netflix’s internal spot market and highlight some of the lessons learned.

How the internal spot began

The initial foray into large scale borrowing started in the Spring of 2015. A new algorithm for one of our personalization services ballooned their video ranking precompute cluster, expanding the size by 5x overnight. Their precompute cluster had an SLA to complete their daily jobs between midnight and 11am, leaving over 1,500 r3.4xlarges unused during the afternoon and evening.

Motivated by the inefficiencies, we actively searched for another service that had relatively interruptible jobs that could run during the off-hours. The Encoding team, who is responsible for converting the raw master video files into consumable formats for our device ecosystem, was the perfect candidate. The initial approach applied was a borrowing schedule based on historical availability, with scale-downs manually communicated between the Personalization, Encoding, and Cloud Capacity teams.

Preliminary Manual Borrowing

As the Encoding team continued to reap the benefits of the extra capacity, they became interested in borrowing from the various sizable troughs in other instance types. Because of a lack of real time data exposing the unused capacity between our accounts, we embarked on a multi-team effort to create the necessary tooling and processes to allow borrowing to occur on a larger, more automated scale.

Current Automated Borrowing

Borrowing considerations

The first requirement to automated borrowing is building out the telemetry exposing unused reservation counts. Given our autoscaling engines operate at a minute granularity, we could not leverage AWS’ billing file as our data source. Instead, the Engineering Tools team built an API inside our deployment platform that exposed real time unused reservations at the minute level. This unused calculation combined input data from our deployment tool, monitoring system, and AWS’ reservation system.

The second requirement is finding batch jobs that are short in duration or interruptible in nature. Our batch Encoding jobs had a minimum duration SLA between five minutes to an hour, making them a perfect fit for our initial twelve hour borrowing window. An additional benefit is having jobs that are resource agnostic, allowing for more borrowing opportunities as our usage landscape creates various troughs by instance type.

The last requirement is for teams to absorb the telemetry data and to set appropriate rules for when to borrow instances. The main concern was whether or not this borrowing would jeopardize capacity for services in the critical path. We alleviated this issue by placing all of our borrowing into a separate account from our production account and leveraging the financial advantages of consolidated billing. Theoretically, a perfectly automated borrowing system would have the same operational and financial results regardless of account structure, but leveraging consolidated billing creates a capacity safety net.


In the ideal state, the internal spot market can be the most efficient platform for running short duration or interruptible jobs through instance level bin-packing. A series of small steps moved us in the right direction, such as:
  • Identifying preliminary test candidates for resource sharing
  • Creating shorter run-time jobs or modifying jobs to be more interruptible
  • Communicating broader messaging about resource sharing
In our next post in this series, the Encoding team will talk through their use cases of the internal spot market, depicting the nuances of real time borrowing at such scale. Their team is actively working through this exciting efficiency problem and many others at Netflix; please check our Jobs site if you want to help us solve these challenges!

Friday, September 25, 2015

Chaos Engineering Upgraded

Several years ago we introduced a tool called Chaos Monkey. This service pseudo-randomly plucks a server from our production deployment on AWS and kills it. At the time we were met with incredulity and skepticism. Are we crazy? In production?!?

Our reasoning was sound, and the results bore that out. Since we knew that server failures are guaranteed to happen, we wanted those failures to happen during business hours when we were on hand to fix any fallout. We knew that we could rely on engineers to build resilient solutions if we gave them the context to *expect* servers to fail. If we could align our engineers to build services that survive a server failure as a matter of course, then when it accidentally happened it wouldn’t be a big deal. In fact, our members wouldn’t even notice. This proved to be the case.

Chaos Kong

Building on the success of Chaos Monkey, we looked at an extreme case of infrastructure failure. We built Chaos Kong, which doesn’t just kill a server. It kills an entire AWS Region1.

It is very rare that an AWS Region becomes unavailable, but it does happen. This past Sunday (September 20th, 2015) Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1 Region. That instability caused more than 20 additional AWS services that are dependent on DynamoDB to fail. Some of the Internet’s biggest sites and applications were intermittently unavailable during a six- to eight-hour window that day.

Netflix did experience a brief availability blip in the affected Region, but we sidestepped any significant impact because Chaos Kong exercises prepare us for incidents like this. By running experiments on a regular basis that simulate a Regional outage, we were able to identify any systemic weaknesses early and fix them. When US-EAST-1 actually became unavailable, our system was already strong enough to handle a traffic failover.

Below is a chart of our video play metrics during a Chaos Kong exercise. These are three views of the same eight hour window. The top view shows the aggregate metric, while the bottom two show the same metric for the west region and the east region, respectively.

Chaos Kong exercise in progress

In the bottom row, you can clearly see traffic evacuate from the west region. The east region gets a corresponding bump in traffic as it steps up to play the role of savior. During the exercise, most of our attention stays focused on the top row. As long as the aggregate metric follows that relatively smooth trend, we know that our system is resilient to the failover. At the end of the exercise, you see traffic revert to the west region, and the aggregate view shows that our members did not experience an adverse effects. We run Chaos Kong exercises like this on a regular basis, and it gives us confidence that even if an entire region goes down, we can still serve our customers.


We looked around to see what other engineering practices could benefit from these types of exercises, and we noticed that Chaos meant different things to different people. In order to carry the practice forward, we need a best-practice definition, a model that we can apply across different projects and different departments to make our services more resilient.

We want to capture the value of these exercises in a methodology that we can use to improve our systems and push the state of the art forward. At Netflix we have an extremely complex distributed system (microservice architecture) with hundreds of deploys every day. We don’t want to remove the complexity of the system; we want to thrive on it. We want to continue to accelerate flexibility and rapid development. And with that complexity, flexibility, and rapidity, we still need to have confidence in the resiliency of our system.

To have our cake and eat it too, we set out to develop a new discipline around Chaos. We developed an empirical, systems-based approach which addresses the chaos inherent in distributed systems at scale. This approach specifically builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it in a controlled experiment, and we use those learnings to fortify our systems before any systemic effect can disrupt the quality service that we provide our customers. We call this new discipline Chaos Engineering.

We have published the Principles of Chaos Engineering as a living document, so that other organizations can contribute to the concepts that we outline here.


We put these principles into practice. At Netflix we have a microservice architecture. One of our services is called Subscriber, which handles certain user management activities and authentication. It is possible that under some rare or even unknown situation Subscriber will be crippled. This might be due to network errors, under-provisioning of resources, or even by events in downstream services upon which Subscriber depends. When you have a distributed system at scale, sometimes bad things just happen that are outside any person’s control. We want confidence that our service is resilient to situations like this.

We have a steady-state definition: Our metric of interest is customer engagement, which we measure as the number of video plays that start each second. In some experiments we also look at load average and error rate on an upstream service (API). The lines that those metrics draw over time are predictable, and provide a good proxy for the steady-state of the system. We have a hypothesis: We will see no significant impact on our customer engagement over short periods of time on the order of an hour, even when Subscriber is in a degraded state. We have variables: We add latency of 30ms first to 20% then to 50% of traffic from Subscriber to its primary cache. This simulates a situation in which the Subscriber cache is over-stressed and performing poorly. Cache misses increase, which in turn increases load on other parts of the Subscriber service. Then we look for a statistically significant deviation between the variable group and the control group with respect to the system’s steady-state level of customer engagement.

If we find a deviation from steady-state in our variable group, then we have disproved our hypothesis. That would cause us to revisit the fallbacks and dependency configuration for Subscriber. We would undertake a concerted effort to improve the resiliency story around Subscriber and the services that it touches, so that customers can count on our service even when Subscriber is in a degraded state.

If we don’t find any deviation in our variable group, then we feel more confident in our hypothesis. That translates to having more confidence in our service as a whole.

In this specific case, we did see a deviation from steady-state when 30ms latency was added to 50% of the traffic going to this service. We identified a number of steps that we could take, such as decreasing the thread pool count in an upstream service, and subsequent experiments have confirmed the bolstered resiliency of Subscriber.


We started Chaos Monkey to build confidence in our highly complex system. We don’t have to simplify or even understand the system to see that over time Chaos Monkey makes the system more resilient. By purposefully introducing realistic production conditions into a controlled run, we can uncover weaknesses before they cause bigger problems. Chaos Engineering makes our system stronger, and gives us the confidence to move quickly in a very complex system.


This blog post is part of a series. In the next post on Chaos Engineering, we will take a deeper dive into the Principles of Chaos Engineering and hypothesis building with additional examples from our production experiments. If you have thoughts on Chaos Engineering or how to advance the state of the art in this field, we’d love to hear from you. Feel free to reach out to

-Chaos Team at Netflix
Ali Basiri, Lorin Hochstein, Abhijit Thosar, Casey Rosenthal

1. Technically, it only simulates killing an AWS Region. For our purposes, simulating this giant infrastructure failure is sufficient, and AWS doesn’t yet provide us with a way of turning off an entire region. ;-)

Thursday, September 24, 2015

John Carmack on Developing the Netflix App for Oculus

Hi, this is Anthony Park, VP of Engineering at Netflix. We've been working with Oculus to develop a Netflix app for Samsung Gear VR. The app includes a Netflix Living Room, allowing members to get the Netflix experience from the comfort of a virtual couch, wherever they bring their Gear VR headset. It's available to Oculus users today. We've been working closely with John Carmack, CTO of Oculus and programmer extraordinaire, to bring our TV user interface to the Gear VR headset. Well, honestly, John did most of the development himself(!), so I've asked him to be a guest blogger today and share his experience with implementing the new app. Here's a sneak peek at the experience, and I'll let John take it from here...

Netflix Living Room on Gear VR

The Netflix Living Room

Despite all the talk of hardcore gamers and abstract metaverses, a lot of people want to watch movies and shows in virtual reality. In fact, during the development of Gear VR, Samsung internally referred to it as the HMT, for "Head Mounted Theater." Current VR headsets can't match a high end real world home theater, but in many conditions the "best seat in the house" may be in the Gear VR that you pull out of your backpack.

Some of us from Oculus had a meeting at Netflix HQ last month, and when things seemed to be going well, I blurted out "Grab an engineer, let's do this tomorrow!"

That was a little bit optimistic, but when Vijay Gondi and Anthony Park came down from Netflix to Dallas the following week, we did get the UI running in VR on the second day, and video playing shortly thereafter.

The plan of attack was to take the Netflix TV codebase and present it on a virtual TV screen in VR. Ideally, the Netflix code would be getting events and drawing surfaces, not even really aware that it wasn't showing up on a normal 2D screen.

I wrote a "VR 2D Shell" application that functioned like a very simplified version of our Oculus Cinema application; the big screen is rendered with our peak-quality TimeWarp layer support, and the environment gets a neat dynamic lighting effect based on the screen contents. Anything we could get into a texture could be put on the screen.

The core Netflix application uses two Android Surfaces – one for the user interface layer, and one for the decoded video layer. To present these in VR I needed to be able to reference them as OpenGL textures, so the process was: create an OpenGL texture ID, use that to initialize a SurfaceTexture object, then use that to initialize a Surface object that could be passed to Netflix.

For the UI surface, this worked great -- when the Netflix code does a swapbuffers, the VR code can have the SurfaceTexture do an update, which will latch the latest image into an EGL external image, which can then be texture mapped onto geometry by the GPU.

The video surface was a little more problematic. To provide smooth playback, the video frames are queued a half second ahead, tagged with a "release time" that the Android window compositor will use to pick the best frame each update. The SurfaceTexture interface that I could access as a normal user program only had an "Update" method that always returned the very latest frame submitted. This meant that the video came out a half second ahead of the audio, and stuttered a lot.

To fix this, I had to make a small change in the Netflix video decoding system so it would call out to my VR code right after it submitted each frame, letting me know that it had submitted something with a particular release time. I could then immediately update the surface texture and copy it out to my own frame queue, storing the release time with it. This is an unfortunate waste of memory, since I am duplicating over a dozen video frames that are also being buffered on the surface, but it gives me the timing control I need.

Initially input was handled with a Bluetooth joypad emulating the LRUD / OK buttons of a remote control, but it was important to be able to control it using just the touchpad on the side of Gear VR. Our preferred VR interface is "gaze and tap", where a cursor floats in front of you in VR, and tapping is like clicking a mouse. For most things, this is better than gamepad control, but not as good as a real mouse, especially if you have to move your head significant amounts. Netflix has support for cursors, but there is the assumption that you can turn it on and off, which we don't really have.

We wound up with some heuristics driving the behavior. I auto-hide the cursor when the movie starts playing, inhibit cursor updates briefly after swipes, and send actions on touch up instead of touch down so you can perform swipes without also triggering touches. It isn't perfect, but it works pretty well.

Layering of the Android Surfaces within the Netflix Living Room


The screens on the Gear VR supported phones are all 2560x1440 resolution, which is split in half to give each eye a 1280x1440 view that covers approximately 90 degrees of your field of view. If you have tried previous Oculus headsets, that is more than twice the pixel density of DK2, and four times the pixel density of DK1. That sounds like a pretty good resolution for videos until you consider that very few people want a TV screen to occupy a 90 degree field of view. Even quite large screens are usually placed far enough away to be about half of that in real life.

The optics in the headset that magnify the image and allow your eyes to focus on it introduce both a significant spatial distortion and chromatic aberration that needs to be corrected. The distortion compresses the pixels together in the center and stretches them out towards the outside, which has the positive effect of giving a somewhat higher effective resolution in the middle where you tend to be looking, but it also means that there is no perfect resolution for content to be presented in. If you size it for the middle, it will need mip maps and waste pixels on the outside. If you size it for the outside, it will be stretched over multiple pixels in the center.

For synthetic environments on mobile, we usually size our 3D renderings close to the outer range, about 1024x1024 pixels per eye, and let it be a little blurrier in the middle, because we care a lot about performance. On high end PC systems, even though the actual headset displays are lower resolution than Gear VR, sometimes higher resolution scenes are rendered to extract the maximum value from the display in the middle, even if the majority of the pixels wind up being blended together in a mip map for display.

The Netflix UI is built around a 1280x720 resolution image. If that was rendered to a giant virtual TV covering 60 degrees of your field of view in the 1024x1024 eye buffer, you would have a very poor quality image as you would only be seeing a quarter of the pixels. If you had mip maps it would be a blurry mess, otherwise all the text would be aliased fizzing in and out as your head made tiny movements each frame.

The technique we use to get around this is to have special code for just the screen part of the view that can directly sample a single textured rectangle after the necessary distortion calculations have been done, and blend that with the conventional eye buffers. These are our "Time Warp Layers". This has limited flexibility, but it gives us the best possible quality for virtual screens (and also the panoramic cube maps in Oculus 360 Photos). If you have a joypad bound to the phone, you can toggle this feature on and off by pressing the start button. It makes an enormous difference for the UI, and is a solid improvement for the video content.

Still, it is drawing a 1280 pixel wide UI over maybe 900 pixels on the screen, so something has to give. Because of the nature of the distortion, the middle of the screen winds up stretching the image slightly, and you can discern every single pixel in the UI. As you get towards the outer edges, and especially the corners, more and more of the UI pixels get blended together. Some of the Netflix UI layout is a little unfortunate for this; small text in the corners is definitely harder to read.

So forget 4K, or even full-HD. 720p HD is the highest resolution video you should even consider playing in a VR headset today.

This is where content protection comes into the picture. Most studios insist that HD content only be played in a secure execution environment to reduce opportunities for piracy. Modern Android systems' video CODECs can decode into special memory buffers that literally can't be read by anything other than the video screen scanning hardware; untrusted software running on the CPU and GPU have no ability to snoop into the buffer and steal the images. This happens at the hardware level, and is much more difficult to circumvent than software protections.

The problem for us is that to draw a virtual TV screen in VR, the GPU fundamentally needs to be able to read the movie surface as a texture. On some of the more recent phone models we have extensions to allow us to move the entire GPU framebuffer into protected memory and then get the ability to read a protected texture, but because we can't write anywhere else, we can't generate mip maps for it. We could get the higher resolution for the center of the screen, but then the periphery would be aliasing, and we lose the dynamic environment lighting effect, which is based on building a mip map of the screen down to 1x1. To top it all off, the user timing queue to get the audio synced up wouldn't be possible.

The reasonable thing to do was just limit the streams to SD resolution – 720x480. That is slightly lower than I would have chosen if the need for a secure execution environment weren't an issue, but not too much. Even at that resolution, the extreme corners are doing a little bit of pixel blending.

Flow diagram for SD video frames to allow composition with VR

In an ideal world, the bitrate / resolution tradeoff would be made slightly differently for VR. On a retina class display, many compression artifacts aren't really visible, but the highly magnified pixels in VR put them much more in your face. There is a hard limit to how much resolution is useful, but every visible compression artifact is correctable with more bitrate.

Power Consumption

For a movie viewing application, power consumption is a much bigger factor than for a short action game. My target was to be able to watch a two hour movie in VR starting at 70% battery. We hit this after quite a bit of optimization, but the VR app still draws over twice as much power as the standard Netflix Android app.

When a modern Android system is playing video, the application is only shuffling the highly compressed video data from the network to the hardware video CODEC, which decompresses it to private buffers, which are then read by the hardware composer block that performs YUV conversion and scaling directly as it feeds it to the display, without ever writing intermediate values to a framebuffer. The GPU may even be completely powered off. This is pretty marvelous – it wasn't too long ago when a PC might use 100x the power to do it all in software.

For VR, in addition to all the work that the standard application is doing, we are rendering stereo 3D scenes with tens of thousands of triangles and many megabytes of textures in each one, and then doing an additional rendering pass to correct for the distortion of the optics.

When I first brought up the system in the most straightforward way with the UI and video layers composited together every frame, the phone overheated to the thermal limit in less than 20 minutes. It was then a process of finding out what work could be avoided with minimal loss in quality.

The bulk of a viewing experience should be pure video. In that case, we only need to mip-map and display a 720x480 image, instead of composing it with the 1280x720 UI. There were no convenient hooks in the Netflix codebase to say when the UI surface was completely transparent, so I read back the bottom 1x1 pixel mip map from the previous frame's UI composition and look at the alpha channel: 0 means the UI was completely transparent, and the movie surface can be drawn by itself. 255 means the UI is solid, and the movie can be ignored. Anything in between means they need to be composited together. This gives the somewhat surprising result that subtitles cause a noticeable increase in power consumption.

I had initially implemented the VR gaze cursor by drawing it into the UI composition surface, which was a useful check on my intersection calculations, but it meant that the UI composition had to happen every single frame, even when the UI was completely static. Moving the gaze cursor back to its own 3D geometry allowed the screen to continue reusing the previous composition when nothing was changing, which is usually more than half of the frames when browsing content.

One of the big features of our VR system is the "Asynchronous Time Warp", where redrawing the screen and distortion correcting in response to your head movement is decoupled from the application's drawing of the 3D world. Ideally, the app draws 60 stereo eye views a second in sync with Time Warp, but if the app fails to deliver a new set of frames then Time Warp will reuse the most recent one it has, re-projecting it based on the latest head tracking information. For looking around in a static environment, this works remarkably well, but it starts to show the limitations when you have smoothly animating objects in view, or your viewpoint moves sideways in front of a surface.

Because the video content is 30 or 24 fps and there is no VR viewpoint movement, I cut the scene update rate to 30 fps during movie playback for a substantial power savings. The screen is still redrawn at 60 fps, so it doesn't feel any choppier when you look around. I go back to 60 fps when the lights come up, because the gaze cursor and UI scrolling animations look significantly worse at 30 fps.

If you really don't care about the VR environment, you can go into a "void theater", where everything is black except the video screen, which obviously saves additional power. You could even go all the way to a face-locked screen with no distortion correction, which would be essentially the same power draw as the normal Netflix application, but it would be ugly and uncomfortable.

A year ago, I had a short list of the top things that I felt Gear VR needed to be successful. One of them was Netflix. It was very rewarding to be able to do this work right before Oculus Connect and make it available to all of our users in such a short timeframe. Plus, I got to watch the entire season of Daredevil from the comfort of my virtual couch. Because testing, of course.