Monday, March 31, 2014

Going Reactive - Asynchronous JavaScript at Netflix

We held the first in a series of JavaScript talks at our Los Gatos, Calif. headquarters on March 19th. Our own Jafar Husain shared how we’re using the Reactive Extensions (Rx) library to build responsive UIs across our device experiences.

The talk goes into detail on how Netflix uses Rx to:
  • Declaratively build complex events out of simple events (ex. drag and drop)
  • Coordinate and sequence multiple Ajax requests
  • Reactively update the UI in response to data changes
  • Eliminate memory leaks caused by neglecting to unsubscribe from events
  • Gracefully propagate and handle asynchronous exceptions

There was a great turnout for the event and the audience asked some really good questions. We hope to see you at our next event!

You can watch the video from the talk and videos of our past JavaScript Talks events on our YouTube Channel at:

Slides from the presentation are also available.

Thursday, March 13, 2014

NetflixOSS Season 2, Episode 1

Wondering what this headline means? It means that NetflixOSS continues to grow, both in the number of projects that are now available and in the use by others.

We held another NetflixOSS Meetup in our Los Gatos, Calif., headquarters last night. Four companies came specifically to share what they’re doing with the NetflixOSS Platform:
  • Matt Bookman from Google shared how to leverage NetlixOSS Lipstick on top of Google Compute Engine.   Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes, giving developers insight that previously required a lot of sifting through logs (or a Pig expert) to piece together.
  • Andrew Spyker from IBM shared how they’re leveraging many of NetflixOSS components on top of the SoftLayer infrastructure to build real-world applications - beyond the AcmeAir app that won last year’s Cloud Prize.
  • Peter Sankauskas from Answers4AWS talked about the motivation behind his work on NetflixOSS components setup automation, and his work towards 0-click setup for many of the components.
  • Daniel Chia from Coursera shared how they utilize NetflixOSS Priam and Aegithus to work with Cassandra.

Since our previous NetflixOSS Meetup we have open sourced several new projects in many areas: Big Data tools and solutions, Scalable Data Pipelines, language agnostic storage solutions and more.   At the yesterday’s Meetup Netflix engineers talking about recent projects and gave previews of the projects that may be soon released.

  • Zeno - in memory data serialization and distribution platform
  • Suro - a distributed data pipeline which enables services to move, aggregate, route and store data
  • STAASH - a language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems
  • A preview of Dynomite - a thin Dynamo-based replication for cached data
  • Aegithus - a bulk data pipeline out of Cassandra
  • PigPen - Map-Reduce for Clojure
  • S3mper - a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
  • A preview of Inviso - a performance focused Big Data tool

All the slides are available on Slideshare:

In preparation for the event, we spruced up our Github OSS site - all components now feature new cool icons:

The event itself was a full house - people at the demo stations were busy all evening answering many questions about the components they wrote and opened.

It was great to see how many of our Open Source components are being used outside of Netflix.  We hear of many more companies that are using and contributing to NetflixOSS components.  If you’re one of them, and would like to have your logo featured on our Github site “Powered by NetflixOSS” page - contact us at

If you’re interested to hear more about upcoming NetflixOSS projects and events, follow @NetflixOSS on Twitter, and join our Meetup group.  The slides for this event are available on Slideshare, videos will be uploaded shortly.

Monday, March 3, 2014

The Netflix Signup Flow - Our Journey to a Responsive Design

by Joel Sass

In the Spring of 2013, the User Experience team was gearing up for the impending Netherlands country launch scheduled for September. To reduce barriers to adoption, we wanted to launch with a smooth signup experience on smartphones and tablets. Additionally, we were planning to implement the iDEAL online payment method, which is commonly used in the Netherlands but new to us both technically and from a user experience perspective.

At that time, we had two very different technology stacks that served our signup experience: one for desktop browsers and one for mobile and tablet browsers. There were substantial differences in the way these two websites worked, and they shared no common platform. Each website had unique capabilities, but the desktop site provided a much larger superset of features compared to the mobile optimized site.

To move forward with enabling iDEAL and other new payment methods across multiple platforms, we quickly came to the conclusion that the best way forward was a unified approach to supporting multiple platforms using a single UI. This ultimately started us on our path towards responsive web design (RWD). Responsive design is a technique for delivering a consistent set of functionality across a wide range of screen sizes, from a single website. For our effort, we focused on the following goals:

  • Enable access to all features and capabilities, regardless of device
  • Deliver a consistent user experience that is optimized for device capabilities, including screen size and input method

Cross-functional alignment and prototyping

In order to successfully tackle a responsive design project, we first needed to answer the following question for our team: what is responsive design? In initial brainstorming meetings, we aligned around a common definition that emphasizes the use of CSS and JS to adapt a common user experience to varying screen sizes and input methods.

At Netflix, we like to move quickly and not let unnecessary process slow down our ability to innovate and move fast. Rapid design, rapid development. To kickstart this project, we assembled a core group of designers and user interface engineers, and held a weeklong workshop. The end goal was to produce a working prototype of a responsively designed signup flow.

This workshop approach allowed us to streamline the entire design and development process. Developers and designers working side by side to immediately tackle issues as they came up. In effect, we were pair programming. This allowed us to minimize the need for comps and wireframes, and develop straight in the browser. It provided the freedom to easily experiment with different design and engineering techniques, and identify common patterns that could be used across the entire signup flow. By the end of that week we had created a prototype of a fully functional, looks-good-on-whatever-device-you’re-using, Netflix signup experience.

A/B testing and rollout

With a functional responsive flow now built, we came to the final stretch. Being highly data-driven at Netflix, we wanted to measure that our changes were having a positive impact on how customers interact with the signup flow. So we cleaned up our prototype, turned it into a production quality experience and tested it globally against the current split stacks. We saw no impact to desktop signups, but we did a see an increase in conversion rates on phones and tablets due to the additional payment types we enabled for those devices. The results of our A/B test gave us the confidence to roll out this new signup flow as the default experience in all markets, and retire our separate phone/tablet stack.

App integration

However, we didn’t stop there. We had also supported signup from within our Android app prior to this project. Following our global rollout for browser-based signups, we quickly integrated our newly responsive signup experience into our Android app as an embedded web view. This enabled us to get much more leverage out of our responsive design investment. More about this unique approach in a future post.

Many devices, one platform

Whether on a 30” screen or a 4” one, our customers are now provided with an experience that works well and looks great across a wide range of devices. Development of the signup experience has been streamlined, increasing developer productivity. We now have a single platform that serves as foundation for all signup A/B testing and innovation and our customers are afforded the same options regardless of device.

Our journey towards responsive design has not ended, however. As device platforms evolve, and as user expectations change, our designers and engineers are constantly working towards enhancing cross-platform experiences for our customers.

What’s next?

In upcoming posts, we will further explore how we built our responsive signup flow. Some of the topics we will cover are: a deeper dive into the client and server-side techniques we used to integrate our browser experience into our Android app, a more in-depth look at the decisions we made in regards to mobile-first vs desktop-first approaches, and a review of the challenges in dealing with responsive images.

Can’t wait until the next post? Interested in learning more about responsive design? Come see Chris Saint-Amant, Netflix UI Engineering Manager, discuss the next generation of responsive design at SXSW Interactive in Austin, TX on March 8th.

Are you interested in working on cross-platform design and engineering challenges? We’re hiring front-end innovators to helps us realize our vision. Check out our jobs.

The Netflix Dynamic Scripting Platform

The Engine that powers the Netflix API
Over the past couple of years, we have optimized the Netflix API with a view towards improving performance and increasing agility. In doing so, the API has evolved from a provider of RESTful web services to a platform that distributes development and deployment of new functionality across various teams within Netflix.

At the core of the redesign is a Dynamic Scripting Platform which provides us the ability to inject code into a running Java application at any time. This means we can alter the behavior of the application without a full scale deployment. As you can imagine, this powerful capability is useful in many scenarios. The API Server is one use case, and in this post, we describe how we use this platform to support a distributed development model at Netflix.

As a reminder, devices make HTTP requests to the API to access content from a distributed network of services. Device Engineering teams use the Dynamic Scripting Platform to deploy new or updated functionality to the API Server based on their own release cycles. They do so by uploading adapter code in the form of Groovy scripts to a running server and defining custom endpoints to front those scripts. The scripts are responsible for calling into the Java API layer and constructing responses that the client application expects. Client applications can access the new functionality within minutes of the upload by requesting the new or updated endpoints. The platform can support any JVM-compatible language, although at the moment we primarily use Groovy.

Architecting for scalability is a key goal for any system we build at Netflix and the Scripting Platform is no exception. In addition, as our platform has gained adoption, we are developing supporting infrastructure that spans the entire application development lifecycle for our users. In some ways, the API is now like an internal PaaS system that needs to provide a highly performant, scalable runtime; tools to address development tasks like revision control, testing and deployment; and operational insight into system health. The sections that follow explore these areas further.


Here is a view into the API Server under the covers.

[1] Endpoint Controller routes endpoint requests to the corresponding groovy scripts. It consults a registry to identify the mapping between an endpoint URI and its backing scripts.

[2] Script Manager handles requests for script management operations. The management API is exposed as a RESTful interface.

[3] Script source, compiled bytecode and related metadata are stored in a Cassandra cluster, replicated across all AWS regions.

[4] Script Cache is an in memory cache that holds compiled bytecode fetched from Cassandra. This eliminates the Cassandra lookup during endpoint request processing. Scripts are compiled by the first few servers running a new API build and the compiled bytecode is persisted in Cassandra. At startup, a server looks for persisted bytecode for a script before attempting to compile it in real time. Because deploying a set of canary instances is a standard step in our delivery workflow, the canary servers are the ones to incur the one-time penalty for script compilation. The cache is refreshed periodically to pick up new scripts.

[5] Admin Console and Deployment Tools are built on top of the script management API.

Script Development and Operations

Our experience in building a Delivery Pipeline for the API Server has influenced our thinking around the workflows for script management. Now that a part of the client code resides on the server in the form of scripts, we want to simplify the ways in which Device teams integrate script management activities into their workflows. Because technologies and release processes vary across teams, our goal is to provide a set of offerings from which they can pick and choose the tools that best suit their requirements.

The diagram below illustrates a sample script workflow and calls out the tools that can be used to support it. It is worth noting that such a workflow would represent just a part of a more comprehensive build, test and deploy process used for a client application.


Distributed Development

To get script developers started, we provide them with canned recipes to help with IDE setup and dependency management for the Java API and related libraries.  In order to facilitate testing of the script code, we have built a Script Test Framework based on JUnit and Mockito. The Test Framework looks for test methods within a script that have standard JUnit annotations, executes them and generates a standard JUnit result report. Tests can also be run against a live server to validate functionality in the scripts.

Additionally, we have built a REPL tool to facilitate ad-hoc testing of small scripts or snippets of groovy code that can be shared as samples, for debugging etc.

Distributed Deployment

As mentioned earlier, the release cycles of the Device teams are decoupled from those of the API Team. Device teams have the ability to dynamically create new endpoints, update the scripts backing existing endpoints or delete endpoints as part of their releases. We provide command line tools built on top of our Endpoint Management API that can be used for all deployment related operations. Device teams use these tools to integrate script deployment activities with their automated build processes and manage the lifecycle of their endpoints. The tools also integrate with our internal auditing system to track production deployments.

Admin Operations & Insight

Just as the operation of the system is distributed across several teams, so is the responsibility of monitoring and maintaining system health. Our role as the platform provider is to equip our users with the appropriate level of insight and tooling so they can assume full ownership of their endpoints. The Admin Console provides users full visibility into real time health metrics, deployment activity, server resource consumption and traffic levels for their endpoints.

Engineers on the API team can get an aggregate view of the same data, as well as other metrics like script compilation activity that are crucial to track from a server health perspective. Relevant metrics are also integrated into our real time monitoring and alerting systems.

The screenshot to the left is from a top-level dashboard view that tracks script deployment activity.

Experiences and Lessons Learnt

Operating this platform with an increasing number of teams has taught us a lot and we have had several great wins!
  • Client application developers are able to tailor the number of network calls and the size of the payload to their applications. This results in more efficient client-side development and overall, an improved user experience for Netflix customers.
  • Distributed, decoupled API development has allowed us to increase our rate of innovation.
  • Response and recovery times for certain classes of bugs have gone from days or hours down to minutes. This is even more powerful in the case of devices that cannot easily be updated or require us to go through an update process with external partners.
  • Script deployments are inherently less risky than server deployments because the impact of the former is isolated to a certain class or subset of devices. This opens the door for increased nimbleness.
  • We are building an internal developer community around the API. This provides us an opportunity to promote collaboration and sharing of  resources, best practices and code across the various Device teams.

As expected, we have had our share of learnings as well.
  • The flexibility of our platform permits users to utilize the system in ways that might be different from those envisioned at design time. The strategy that several teams employed to manage their endpoints put undue stress on the API server in terms of increased system resources, and in one instance, caused a service outage. We had to react quickly with measures in the form of usage quotas and additional self-protect switches while we identified design changes to allow us to handle such use cases.
  • When we chose Cassandra as the repository for script code for the server, our goal was to have teams use their SCM tool of choice for managing script source code. Over time, we are finding ourselves building SCM-like features into our Management API and tools, as we work to address the needs of various teams. It has become clear to us that we need to offer a unified set of tools that cover deployment and SCM functionality.
  • The distributed development model combined with the dynamic nature of the scripts makes it challenging to understand system behavior and debug issues. RxJava introduces another layer of complexity in this regard because of its asynchronous nature. All of this highlights the need for detailed insights into scripts’ usage of the API.

We are actively working on solutions for the above and will follow up with another post when we are ready to share details.


The evolution of the Netflix API from a web services provider to a dynamic platform has allowed us to increase our rate of innovation while providing a high degree of flexibility and control to client application developers. We are investing in infrastructure to simplify the adoption and operation of this platform. We are also continuing to evolve the platform itself as we find new use cases for it. If this type of work excites you, reach out to us - we are always looking to add talented engineers to our team!