Monday, November 18, 2013

Building the New Netflix Experience for TV

by Joubert Nel

We just launched a new Netflix experience for TV and game consoles. The new design is based on our premise that each show or movie has a tone and a narrative that should be conveyed by the UI. To tell a richer story we provide relevant evidence and cinematic art that better explain why we think you should watch a show or movie.

The new user interface required us to question our paradigms about what can be delivered on TV – not only is this UI more demanding of game consoles than any of our previous UIs, but we also wanted budget devices to deliver a richer experience than what was previously possible.

For the first time we needed a single UI that could accept navigation using a TV remote or game controller, as well as voice commands and remotes that direct a mouse cursor on screen.

Before we get into how we developed for performance and built for different input methods, let’s take a look at our UI stack.

UI Stack

My team builds Netflix UIs for the devices in your living room: PlayStation 3, PlayStation 4, Xbox 360, Roku 3, and recent Smart TVs and Blu-ray players.

We deploy UI updates with new A/B tests, support for new locales like the Netherlands, and new features like Profiles. While remaining flexible, we also want to take advantage of as much of the underlying hardware as possible in a cross-platform way.

So, a few years ago we broke our device client code into two parts: an SDK that runs on the metal, and a UI written in JavaScript. The SDK provides a rendering engine, JavaScript runtime, networking, security, video playback, and other platform hooks. Depending on the device, SDK updates range from quarterly to annually to never. The UI, in contrast, can be updated at any time and is downloaded (or retrieved from disk cache) when the user fires up Netflix.

Key, Voice, Pointer

The traditional way for users to control our UI on a game console or TV is via an LRUD input (left/right/up/down) such as a TV remote control or game controller. Additionally, Xbox 360 users should be able to navigate with voice commands and folks with an LG Magic Remote Smart TV must be able to navigate by pointing their remote control at elements on screen. Our new UI is our first to incorporate all three input methods in a single design.

We wanted to build our view components in such a way that their interaction behaviors are encapsulated. This code proximity makes code more maintainable and reusable and the class hierarchy more robust. We needed a consistent way to dispatch the three kinds of user input events to the view hierarchy.

We created a new JavaScript event dispatcher that routes key, pointer, and voice input in a uniform way to views. We needed an incremental solution that didn’t require refactoring the whole codebase, so we designed it to coexist with our legacy key handling and provide a migration path.

We must produce JavaScript builds that only contain code for those methods supported by the target device because reduced code size yields faster code parsing, in turn reducing startup time.

To produce lean builds, we use a text preprocessor to strip out input handling code that is irrelevant to a target platform. The advantage of using a text preprocessor instead of, for example, using mixins to layer in additional appearances and interactions, is that we get much higher levels of code proximity and simplicity.


Devices in the living room use DirectFB or OpenGL for graphics (or something OpenGL-like) and can use hardware acceleration for animating elements of the UI. Leveraging the GPU is key in creating a smooth experience that is responsive to user input – we’ve done it on WebKit using accelerated compositing (see WebKit in Your Living Room and Building the Netflix UI for Wii U).

The typical implementation of hardware accelerated animation of a rectangle requires width x height x bytes per pixel of memory. In our UI we animate entire scenes when transitioning between them; animating one scene at 1080p would require close to 8MB of memory (1920 x 1080 x 4) but at 720p requires 3.5MB (1280 x 720 x 4). We see devices with as little as 20MB memory allocated to a hardware-accelerated rendering cache. Moreover, other system resources such as main memory, disk cache, and CPU may also be severely constrained as compared to a mobile phone, laptop, or game console. 

How can we squeeze as much performance as possible out of budget devices and add more cinematic animations on game consoles?

We think JavaScript, HTML and CSS are great technologies to build compelling experiences with, such as our HTML 5 player UI. But we wanted more fine-grained control of the graphics layer and wanted optimizations for apps that do not need reflowable content. Our SDK team built a new rendering engine with which we can deliver animations on very resource constrained devices, making it possible to give customers our best UI. We can also enrich the experience with cinematic animations & effects on game consoles.

The second strategy is by grouping devices into performance classes that give us entry points to turn different knobs such as pool sizes, prefetch ranges, effects, animations, and caching, to take advantage of fewer/more resources while maintaining the integrity of the UI design & interaction.

Delivering great experiences

In the coming weeks we will be diving into more details of our JavaScript code base on this blog.

Building the new Netflix experience for TV was a lot of work, but it gave us a chance to be a PlayStation 4 launch partner, productize our biggest A/B test successes of 2013, and delight tens of millions of Netflix customers.

If this excites you and want to help build the future UIs for discovering and watching shows and movies, join our team!

Preparing the Netflix API for Deployment

At Netflix, we are committed to bringing new features and product enhancements to customers rapidly and frequently. Consequently, dozens of engineering teams are constantly innovating on their services, resulting in a rate of change to the overall service is vast and unending. Because of our appetite for providing such improvements to our members, it is critical for us to maintain a Software Delivery pipeline that allows us to deploy changes to our production environments in an easy, seamless, and quick way, with minimal risk.

Thursday, November 14, 2013

Netflix Open Source Software Cloud Prize Winners

We launched the Netflix Open Source Software Cloud Prize in March 2013 and it got a lot of attention in the press and blogosphere. Six months later we closed the contest, took a good look at the entrants, picked the best as nominees and the panel of distinguished judges decided the winners in each category, The final winners were announced during Werner Vogel’s keynote at AWS Re:Invent in on November 14th 2013.

The ten winners all put in a lot of work to earn their prizes. They each won a trip to Las Vegas and a ticket for AWS Re:Invent, a Cloud Monkey trophy, $10,000 prize money from Netflix and $5000 in AWS credits from Amazon. After the keynote was over we all went on stage to get our photo taken with Werner Vogels.

Peter Sankauskas (@pas256) is a software engineer living in Silicon Valley, and the founder of Answers for AWS (@Answers4AWS). He specializes in automation, scaling and Amazon Web Services. Peter contributed Ansible playbooks, Cloud Formation templates and many pre-built AMIs to make it easier for everyone else to get started with NetflixOSS, and put Asgard and Edda in the AWS Marketplace. He recently started his own company AnsWerS to help people who want to move to AWS. Getting started with NetflixOSS is made harder because there are 35 different projects to figure out. Peter has created an extremely useful and simple on-ramp for new NetflixOSS users.

Chris Grzegorczyk (eucaflix, grze, of Goleta California) is Chief Architect and Co-Founder, and Vic Iglesias is Quality and Release Manager at Eucalyptus Systems. Eucalyptus have been using NetflixOSS to provide a proof point for portability of applications from AWS to private clouds based on Eucalyptus. Eucalyptus is open source software for building private and hybrid clouds that are compatible with AWS APIs. Their submission enables NetflixOSS projects to treat Eucalyptus as an additional AWS region and to deploy applications to AWS regions and Eucalyptus datacenters from the same Asgard console.

In June 2013 they shipped a major update to Eucalyptus that included advanced AWS features such as Autoscale Groups that NetflixOSS depends on. Eucalyptus have demonstrated working code at several Netflix meetups and have really helped promote the NetflixOSS ecosystem.

IBM had previously created a demonstration application called Acme Air for their Websphere tools running on IBM Smartcloud. It was a fairly conventional enterprise architecture application, with a Java front end and a database back end. For their winning prize entry, Andrew Spyker (aspyker, of Raleigh North Carolina) figured out how to re-implement Acme Air as a cloud native example application using NetflixOSS libraries and component services, running on AWS. He then ran some benchmark stress tests to demonstrate scalability. This was demonstrated at a Netflix Meetup last summer.  The Acme Air example application combines several NetflixOSS projects. The Eureka service registry, Hystrix circuit breaker pattern, Karyon base server framework, Ribbon http client and Asgard provisioning portal. IBM used NetflixOSS to get a deeper understanding of Cloud Native architecture and tools, which it can apply to helping enterprise customers make the transition to cloud.

The Reactive Extensions (Rx) pattern is one of the most advanced and powerful concepts for structuring code to come out in recent years. The original work on Rx at Microsoft by Eric Meijer inspired Netflix to create the RxJava project. We started with a subset of Rx functionality and left a lot of “to do” areas. As the project matured we began to extend RxJava to include other JVM based languages and Joachim Hofer (jmhofer, Möhrendorf, Germany) has made major contribution to type safety and Scala support, with over thirty pull requests.

Joachim works at Imbus AG, Möhrendorf, Germany, he’s lead developer of an agile product team and a Scala enthusiast working on moving their stack from J2EE to Scala/Play/Akka/Spray/RxJava.

Anyone familiar with Hadoop tools and the big data ecosystem knows about the Pig language. It provides a way to specify a high level dataflow for processing but the Pig scripts can get complex and hard to debug. Netflix built and open sourced a visualization and monitoring tool called Lipstick, and it was adopted by Mark Roddy at a vendor called Mortar (mortardata, of New York, NY) who worked with us to generalize some of the interfaces and integrate it with their own Pig based Hadoop platform. We saved Mortar from having to create their own tool to do this, and Netflix now has an enthusiastic partner to help to improve and extend Lipstick so everyone who uses it benefits.

Jakub Narloch (jmnarloch, of Szczecin, Poland) created a test suite for NetflixOSS Karyon based on JBoss Arquillian. The extension integrated with Karyon/Google Guice dependency injection functionality allowing to write tests that directly access the application auto scanned components. The tests are executed in the application container. Arquillian brings wide support for different containers including Tomcat, Jetty and JBoss AS. Karyon is the base server that underpins NetflixOSS services and acts as the starting point for developing new services. Since Genie is based on Karyon, we were able to leverage this integration to use Arquilian to test Genie, and the changes have been merged into the code that Netflix uses internally.

Jakub Narloch is a software engineer working at Samsung Electronics. He received the JBoss Community Recognition Award this year for his open source contributions. In the past year he has been actively helping to develop the JBoss Arquillian project, authoring four completely new extensions and helping to shape many others. His adventure with the open source world began a couple of years earlier and he has also contributed code to projects like Spring Framework, Castor XML and NetflixOSS. Last year he graduated with honors Computer Science from Warsaw University of Technology with an MSc degree. In the past he took part in two editions of Google Summer of Code and in his free time he likes to solve the software development contests held by TopCoder Inc.

In the real world complex web service APIs are hard to manage, and NetflixOSS includes the Zuul API gateway, which is used to authenticate process and route http requests. The next winner  is Neil Beveridge (neilbeveridge, of Kent, United Kingdom). He was interested in porting the Zuul container from Tomcat to Netty, which also provides non-blocking output requests, and benchmarking the difference. Neil ran the benchmarks with help from Raamnath Mani, Fanta Gizaw and Will Tomlin at They ran into an interesting problem with Netty consuming excess CPU and running slower than the original Tomcat version, and then ran into the contest deadline, but have since continued work to debug and tune the Netty code and come up with higher performance for Netty and some comparisons of cloud and bare metal performance for Zuul. Since Netflix is also looking at moving some of our services from Tomcat to Netty, this is a useful and timely contribution. It’s also helpful to other people considering using Zuul to have some published benchmarks to show the throughput on common AWS instance types.

Although the primary storage used by Netflix is based on Cassandra, we also use AWS RDS to create many small MySQL databases for specific purposes. Other AWS customers use RDS much more heavily. Jiaqi Guo (jiaqi, Chicago, Illinois) has built Datamung to automate backup of RDS to S3 and replication of backups across regions for disaster recovery. Datamung is a web-based, Simple Workflow driven application that backs up RDS MySQL databases into S3 objects by launching an EC2 instance and running the mysqldump command. It makes it possible to replicate RDS across regions, VPC, accounts or outside the AWS network.

When we started to build the Denominator library for portable DNS management we contacted Neustar to discuss their UltraDNS product, and made contact with Jeff Damick (jdamick, of South Riding, Virginia). His input as we structured the early versions of Denominator was extremely useful, and provides a great example of the power of developing code in public. We were able to tap into his years of experience with DNS management, and he was able to contribute code, tests and fixes to the Denominator code and fixes to the UltraDNS API itself.

Justin Santa Barbara (justinsb of San Franciso, California) decided to make the Chaos Monkey far more evil, and created fourteen new variants, a “barrel of chaos monkeys”. They interfere with the network, causing routing failure, packet loss, network data corruption and extra network latency. They block access to DNS, S3, DynamoDB and the EC2 control plane. They interfere with storage, by disconnecting EBS volumes, filling up the root disk, and saturating the disks with IO requests. They interfere with the CPU by consuming all the spare cycles, or killing off all the processes written in Python or Java. When run, a random selection is made, and the victims suffer the consequences. This is an excellent but scary workout for our monitoring and repair/replacement automation.

We are pleased to have such a wide variety of winners, from individuals around the world, small and large companies, vendors and end users. Many thanks to all of them for the work they have put into helping grow the NetflixOSS ecosystem, and thanks to everyone else who just uses NetflixOSS or entered the contest but didn’t make the final cut.

The winners, judges and support team got "Cloud Monkey" trophies custom made by Bleep Labs.

Tuesday, November 5, 2013

Scryer: Netflix’s Predictive Auto Scaling Engine

To deliver the best possible experience to Netflix customers around the world, it is critical for us to maintain a robust, scalable, and resilient system. That is why we have built (and open sourced) applications ranging from Hystrix to Chaos Monkey. All of these tools better enable us to prevent or minimize outages, respond effectively to outages, and/or anticipate the kinds of operational gaps that may eventually result in outages. Recently we have built another such tool that has been helping us in this ongoing challenge: Scryer
Scryer is a new system that allows us to provision the right number of AWS instances needed to handle the traffic of our customers. But Scryer is different from Amazon Auto Scaling (AAS), which reacts to real-time metrics and adjusts instance counts accordingly. Rather, Scryer predicts what the needs will be prior to the time of need and provisions the instances based on those predictions.
This post is the first in a series that will provide greater details on what Scryer is, how it works, how it differs from the Amazon Auto Scaling, and how we employ it at Netflix.

Amazon Auto Scaling and the Netflix Use Case

At the core, AAS is a reactive auto scaling model. That is, AAS dynamically adjusts server counts based on a cluster’s current workload (most often the metric of choice will be something like load average). When spiking or dropping beyond a certain point, AAS policies will trigger the addition or removal of instances. For Netflix, this has proven to be quite effective at improving system availability, optimizing costs, and in some cases reducing latencies.  Overall, AAS is a big win and companies with any kind of scale in AWS should be employing this service.  
For Netflix, however, there are a range of use cases that are not fully addressed by AAS. The following are some examples:
  • Rapid spike in demand: Instance startup times range from 10 to 45 minutes. During that time our existing servers are vulnerable, especially if the workload continues to increase.
  • Outages: A sudden drop in incoming traffic from an outage is sometimes followed by a retry storm (after the underlying issue has been resolved). A reactive system is vulnerable in such conditions because a drop in workload usually triggers a down scale event, leaving the system under provisioned to handle the ensuing retry storm.
  • Variable traffic patterns: Different times of the day have different workload characteristics and fleet sizes. Some periods show a rapid increase in workload with a relatively small fleet size (20% of maximum), while other periods show a modest increase with a fleet size 80% of the maximum, making it difficult to handle such variations in optimal ways.
Some of these issues can be mitigated by scaling up aggressively, but this is often undesirable as it may lead to scale up - scale down oscillations. Another option is to always run more servers than required which is clearly not optimal from a cost perspective.

Scryer: Our Predictive Auto Scaling Engine

Scryer was inspired in part by these unaddressed use cases, but its genesis was triggered more by our relatively predictable traffic patterns. The following is an example of five days worth of traffic:

In this chart, there are very clearly spikes and troughs that sync up with consistent patterns and times of day. There are definitely going to be spikes and valleys that we cannot predict and the traffic does evolve over longer periods of time.  That said, over any given week or month, we have a very good idea of what the traffic will look like as the basic curves are the same. Moreover, these same five days of the week are likely to have the same patterns the week before and the week after (assuming no outages or special events).  
Because of these trends, we believed that we would be able to generate a set of algorithms that could predict our capacity needs before our actual needs, rather than simply relying on the reactive model of AAS. The following chart shows the result of that effort, which is that the output from our prediction algorithms aligns very closely to our actual metrics.  

Once these predictions were optimized, we attached these predictions to AWS APIs to trigger changes in capacity needs.  The following chart shows that our scheduled scaling action plan closely matches our actual traffic with each step minimized to achieve best performance.
We have been running Scryer in production for a few months. The following is a list of the key benefits that we have seen with it:
  • Improved cluster performance
  • Better service availability
  • Reduced EC2 costs

Predictive-Reactive Auto Scaling - A Hybrid Approach

As effective as Scryer has been in predicting and managing our instance counts, the real strength of Scryer is in how it operates in tandem with AAS’s reactive model.  

If we are able to predict the workload of a cluster in advance, then we can proactively scale the cluster ahead of time to accurately meet workload needs. But there will certainly be cases where Scryer cannot predict our needs, such as an unexpected surge in workload.  In these cases, AAS serves as an excellent safety net for us, adding instances based on those unanticipated, unpredicted needs.
The two auto scaling systems combined provide a much more robust and efficient solution as they complement each other.


Overall, Scryer has been incredibly effective at predicting our metrics and traffic patterns, allowing us to better manage our instance counts and stabilizing our systems. We are still rolling this out to the breadth of services within Netflix and will continue to explore its use cases and optimize the algorithms. So far, though, we are excited about the results and are eager to see how it behaves in different environments and conditions.
In the coming weeks, we plan to publish several more posts discussing Scryer in greater detail, digging deeper into its features, design, technology and algorithms. We are exploring the possibility of open sourcing Scryer in the future as well. 

Finally, we work on these kinds of exciting challenges all the time at Netflix.  If you would like to join us in tackling such problems, check out our Jobs site.