Tuesday, April 1, 2014

Women in Technology meetup at Netflix!

Last week, Netflix welcomed about 150 women to its campus for a set of lightning talks followed by demos and networking. Organized in coordination with CloudNOW, the event was high on the fun and high on the tech.


Lightning talks included:

Devika Chawla, Director of Engineering, Netflix. Devika’s talk was titled “In Pursuit of Rapid Messaging Innovation”. The Messaging Platform has probably sent you a message via email, push or in-app messaging channels. We learned how her team is building a platform for rapid innovation by de-coupling from senders and moving to a system driven by dynamic metadata.

Alolita Sharma from Wikipedia talked about “Rich Media Content Delivery and Open Source”. The challenges that her team faces regarding the sheer number of languages that they support are interesting -- and potentially foreshadow what many of us will face in the future.

Tracy Wright traveled up from the Netflix LA office to talk about how her team managed technology migration from a data center based high touch workflow to a the cloud based exception based approach in “Cloud Migration for Large Scale Content Operations”. The success of this transition depended on a team mindset migration as well as the toolset migration to the cloud which underscored the value of effective change management.

Seema Jethani, from Enstratius talked about the challenges that we all face in “Approach to Tool and Technology Choices”.

Wondering how (and why) Netflix has implemented a “Flexible Billing Integration Architecture”? Nirmal Varadarajan described this, and tied it back to the earlier talk given by Devika. The presentation focused on ways to build components that can be reused for large scale incremental migration and a flexible events pipeline to facilitate an easy way to share data between closely aligned but loosely coupled systems.

Evelyn De Souza joined us from Cisco Systems. She presented from her work on “Cloud Data Protection Certification”. You can learn more about Evelyn and her work at the Cloud Security Alliance.

Continuing with the theme of “Agility at Scale”, Sangeeta Narayanan explained how the Netflix Edge Engineering team tackles the challenge of moving fast at scale. She described the importance of building agility into system architecture as well as investing in Continuous Delivery. She showed us some screen shots of their Continuous Delivery dashboard, which was presented at the Edge demo station as well.

Eva Tse wrapped up the lightning talks with her discussion of “Running a Big Data Platform in the Cloud”. She showcased how they leverage the Hadoop ecosystem and AWS S3 service to architect the Netflix’s cloud native big data platform. Eva’s team is very active in open source, and many of the tools/services that built by the team are available on netflix.github.com.

There were several demo stations set up, and attendees lingered with food and drink, enjoying the demos and networking. The CodeChix featured a Pytheas demo, based on Netflix OSS!

If you missed the event, you can watch the recording here.

It was great to see so many people at this meetup!

Monday, March 31, 2014

Going Reactive - Asynchronous JavaScript at Netflix

We held the first in a series of JavaScript talks at our Los Gatos, Calif. headquarters on March 19th. Our own Jafar Husain shared how we’re using the Reactive Extensions (Rx) library to build responsive UIs across our device experiences.

The talk goes into detail on how Netflix uses Rx to:
  • Declaratively build complex events out of simple events (ex. drag and drop)
  • Coordinate and sequence multiple Ajax requests
  • Reactively update the UI in response to data changes
  • Eliminate memory leaks caused by neglecting to unsubscribe from events
  • Gracefully propagate and handle asynchronous exceptions

There was a great turnout for the event and the audience asked some really good questions. We hope to see you at our next event!

You can watch the video of the presentation at: https://www.youtube.com/watch?v=XRYN2xt11Ek

Thursday, March 13, 2014

NetflixOSS Season 2, Episode 1

Wondering what this headline means? It means that NetflixOSS continues to grow, both in the number of projects that are now available and in the use by others.

We held another NetflixOSS Meetup in our Los Gatos, Calif., headquarters last night. Four companies came specifically to share what they’re doing with the NetflixOSS Platform:
  • Matt Bookman from Google shared how to leverage NetlixOSS Lipstick on top of Google Compute Engine.   Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes, giving developers insight that previously required a lot of sifting through logs (or a Pig expert) to piece together.
  • Andrew Spyker from IBM shared how they’re leveraging many of NetflixOSS components on top of the SoftLayer infrastructure to build real-world applications - beyond the AcmeAir app that won last year’s Cloud Prize.
  • Peter Sankauskas from Answers4AWS talked about the motivation behind his work on NetflixOSS components setup automation, and his work towards 0-click setup for many of the components.
  • Daniel Chia from Coursera shared how they utilize NetflixOSS Priam and Aegithus to work with Cassandra.

Since our previous NetflixOSS Meetup we have open sourced several new projects in many areas: Big Data tools and solutions, Scalable Data Pipelines, language agnostic storage solutions and more.   At the yesterday’s Meetup Netflix engineers talking about recent projects and gave previews of the projects that may be soon released.

  • Zeno - in memory data serialization and distribution platform
  • Suro - a distributed data pipeline which enables services to move, aggregate, route and store data
  • STAASH - a language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems
  • A preview of Dynomite - a thin Dynamo-based replication for cached data
  • Aegithus - a bulk data pipeline out of Cassandra
  • PigPen - Map-Reduce for Clojure
  • S3mper - a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
  • A preview of Inviso - a performance focused Big Data tool

All the slides are available on Slideshare:

In preparation for the event, we spruced up our Github OSS site - all components now feature new cool icons:

The event itself was a full house - people at the demo stations were busy all evening answering many questions about the components they wrote and opened.

It was great to see how many of our Open Source components are being used outside of Netflix.  We hear of many more companies that are using and contributing to NetflixOSS components.  If you’re one of them, and would like to have your logo featured on our Github site “Powered by NetflixOSS” page - contact us at netflixoss@netflix.com

If you’re interested to hear more about upcoming NetflixOSS projects and events, follow @NetflixOSS on Twitter, and join our Meetup group.  The slides for this event are available on Slideshare, videos will be uploaded shortly.

Monday, March 3, 2014

The Netflix Signup Flow - Our Journey to a Responsive Design

by Joel Sass

In the Spring of 2013, the User Experience team was gearing up for the impending Netherlands country launch scheduled for September. To reduce barriers to adoption, we wanted to launch with a smooth signup experience on smartphones and tablets. Additionally, we were planning to implement the iDEAL online payment method, which is commonly used in the Netherlands but new to us both technically and from a user experience perspective.

At that time, we had two very different technology stacks that served our signup experience: one for desktop browsers and one for mobile and tablet browsers. There were substantial differences in the way these two websites worked, and they shared no common platform. Each website had unique capabilities, but the desktop site provided a much larger superset of features compared to the mobile optimized site.

To move forward with enabling iDEAL and other new payment methods across multiple platforms, we quickly came to the conclusion that the best way forward was a unified approach to supporting multiple platforms using a single UI. This ultimately started us on our path towards responsive web design (RWD). Responsive design is a technique for delivering a consistent set of functionality across a wide range of screen sizes, from a single website. For our effort, we focused on the following goals:

  • Enable access to all features and capabilities, regardless of device
  • Deliver a consistent user experience that is optimized for device capabilities, including screen size and input method

Cross-functional alignment and prototyping

In order to successfully tackle a responsive design project, we first needed to answer the following question for our team: what is responsive design? In initial brainstorming meetings, we aligned around a common definition that emphasizes the use of CSS and JS to adapt a common user experience to varying screen sizes and input methods.

At Netflix, we like to move quickly and not let unnecessary process slow down our ability to innovate and move fast. Rapid design, rapid development. To kickstart this project, we assembled a core group of designers and user interface engineers, and held a weeklong workshop. The end goal was to produce a working prototype of a responsively designed signup flow.

This workshop approach allowed us to streamline the entire design and development process. Developers and designers working side by side to immediately tackle issues as they came up. In effect, we were pair programming. This allowed us to minimize the need for comps and wireframes, and develop straight in the browser. It provided the freedom to easily experiment with different design and engineering techniques, and identify common patterns that could be used across the entire signup flow. By the end of that week we had created a prototype of a fully functional, looks-good-on-whatever-device-you’re-using, Netflix signup experience.

A/B testing and rollout

With a functional responsive flow now built, we came to the final stretch. Being highly data-driven at Netflix, we wanted to measure that our changes were having a positive impact on how customers interact with the signup flow. So we cleaned up our prototype, turned it into a production quality experience and tested it globally against the current split stacks. We saw no impact to desktop signups, but we did a see an increase in conversion rates on phones and tablets due to the additional payment types we enabled for those devices. The results of our A/B test gave us the confidence to roll out this new signup flow as the default experience in all markets, and retire our separate phone/tablet stack.

App integration

However, we didn’t stop there. We had also supported signup from within our Android app prior to this project. Following our global rollout for browser-based signups, we quickly integrated our newly responsive signup experience into our Android app as an embedded web view. This enabled us to get much more leverage out of our responsive design investment. More about this unique approach in a future post.

Many devices, one platform

Whether on a 30” screen or a 4” one, our customers are now provided with an experience that works well and looks great across a wide range of devices. Development of the signup experience has been streamlined, increasing developer productivity. We now have a single platform that serves as foundation for all signup A/B testing and innovation and our customers are afforded the same options regardless of device.

Our journey towards responsive design has not ended, however. As device platforms evolve, and as user expectations change, our designers and engineers are constantly working towards enhancing cross-platform experiences for our customers.

What’s next?

In upcoming posts, we will further explore how we built our responsive signup flow. Some of the topics we will cover are: a deeper dive into the client and server-side techniques we used to integrate our browser experience into our Android app, a more in-depth look at the decisions we made in regards to mobile-first vs desktop-first approaches, and a review of the challenges in dealing with responsive images.

Can’t wait until the next post? Interested in learning more about responsive design? Come see Chris Saint-Amant, Netflix UI Engineering Manager, discuss the next generation of responsive design at SXSW Interactive in Austin, TX on March 8th.

Are you interested in working on cross-platform design and engineering challenges? We’re hiring front-end innovators to helps us realize our vision. Check out our jobs.

The Netflix Dynamic Scripting Platform

The Engine that powers the Netflix API
Over the past couple of years, we have optimized the Netflix API with a view towards improving performance and increasing agility. In doing so, the API has evolved from a provider of RESTful web services to a platform that distributes development and deployment of new functionality across various teams within Netflix.

At the core of the redesign is a Dynamic Scripting Platform which provides us the ability to inject code into a running Java application at any time. This means we can alter the behavior of the application without a full scale deployment. As you can imagine, this powerful capability is useful in many scenarios. The API Server is one use case, and in this post, we describe how we use this platform to support a distributed development model at Netflix.

As a reminder, devices make HTTP requests to the API to access content from a distributed network of services. Device Engineering teams use the Dynamic Scripting Platform to deploy new or updated functionality to the API Server based on their own release cycles. They do so by uploading adapter code in the form of Groovy scripts to a running server and defining custom endpoints to front those scripts. The scripts are responsible for calling into the Java API layer and constructing responses that the client application expects. Client applications can access the new functionality within minutes of the upload by requesting the new or updated endpoints. The platform can support any JVM-compatible language, although at the moment we primarily use Groovy.

Architecting for scalability is a key goal for any system we build at Netflix and the Scripting Platform is no exception. In addition, as our platform has gained adoption, we are developing supporting infrastructure that spans the entire application development lifecycle for our users. In some ways, the API is now like an internal PaaS system that needs to provide a highly performant, scalable runtime; tools to address development tasks like revision control, testing and deployment; and operational insight into system health. The sections that follow explore these areas further.


Here is a view into the API Server under the covers.

[1] Endpoint Controller routes endpoint requests to the corresponding groovy scripts. It consults a registry to identify the mapping between an endpoint URI and its backing scripts.

[2] Script Manager handles requests for script management operations. The management API is exposed as a RESTful interface.

[3] Script source, compiled bytecode and related metadata are stored in a Cassandra cluster, replicated across all AWS regions.

[4] Script Cache is an in memory cache that holds compiled bytecode fetched from Cassandra. This eliminates the Cassandra lookup during endpoint request processing. Scripts are compiled by the first few servers running a new API build and the compiled bytecode is persisted in Cassandra. At startup, a server looks for persisted bytecode for a script before attempting to compile it in real time. Because deploying a set of canary instances is a standard step in our delivery workflow, the canary servers are the ones to incur the one-time penalty for script compilation. The cache is refreshed periodically to pick up new scripts.

[5] Admin Console and Deployment Tools are built on top of the script management API.

Script Development and Operations

Our experience in building a Delivery Pipeline for the API Server has influenced our thinking around the workflows for script management. Now that a part of the client code resides on the server in the form of scripts, we want to simplify the ways in which Device teams integrate script management activities into their workflows. Because technologies and release processes vary across teams, our goal is to provide a set of offerings from which they can pick and choose the tools that best suit their requirements.

The diagram below illustrates a sample script workflow and calls out the tools that can be used to support it. It is worth noting that such a workflow would represent just a part of a more comprehensive build, test and deploy process used for a client application.


Distributed Development

To get script developers started, we provide them with canned recipes to help with IDE setup and dependency management for the Java API and related libraries.  In order to facilitate testing of the script code, we have built a Script Test Framework based on JUnit and Mockito. The Test Framework looks for test methods within a script that have standard JUnit annotations, executes them and generates a standard JUnit result report. Tests can also be run against a live server to validate functionality in the scripts.

Additionally, we have built a REPL tool to facilitate ad-hoc testing of small scripts or snippets of groovy code that can be shared as samples, for debugging etc.

Distributed Deployment

As mentioned earlier, the release cycles of the Device teams are decoupled from those of the API Team. Device teams have the ability to dynamically create new endpoints, update the scripts backing existing endpoints or delete endpoints as part of their releases. We provide command line tools built on top of our Endpoint Management API that can be used for all deployment related operations. Device teams use these tools to integrate script deployment activities with their automated build processes and manage the lifecycle of their endpoints. The tools also integrate with our internal auditing system to track production deployments.

Admin Operations & Insight

Just as the operation of the system is distributed across several teams, so is the responsibility of monitoring and maintaining system health. Our role as the platform provider is to equip our users with the appropriate level of insight and tooling so they can assume full ownership of their endpoints. The Admin Console provides users full visibility into real time health metrics, deployment activity, server resource consumption and traffic levels for their endpoints.

Engineers on the API team can get an aggregate view of the same data, as well as other metrics like script compilation activity that are crucial to track from a server health perspective. Relevant metrics are also integrated into our real time monitoring and alerting systems.

The screenshot to the left is from a top-level dashboard view that tracks script deployment activity.

Experiences and Lessons Learnt

Operating this platform with an increasing number of teams has taught us a lot and we have had several great wins!
  • Client application developers are able to tailor the number of network calls and the size of the payload to their applications. This results in more efficient client-side development and overall, an improved user experience for Netflix customers.
  • Distributed, decoupled API development has allowed us to increase our rate of innovation.
  • Response and recovery times for certain classes of bugs have gone from days or hours down to minutes. This is even more powerful in the case of devices that cannot easily be updated or require us to go through an update process with external partners.
  • Script deployments are inherently less risky than server deployments because the impact of the former is isolated to a certain class or subset of devices. This opens the door for increased nimbleness.
  • We are building an internal developer community around the API. This provides us an opportunity to promote collaboration and sharing of  resources, best practices and code across the various Device teams.

As expected, we have had our share of learnings as well.
  • The flexibility of our platform permits users to utilize the system in ways that might be different from those envisioned at design time. The strategy that several teams employed to manage their endpoints put undue stress on the API server in terms of increased system resources, and in one instance, caused a service outage. We had to react quickly with measures in the form of usage quotas and additional self-protect switches while we identified design changes to allow us to handle such use cases.
  • When we chose Cassandra as the repository for script code for the server, our goal was to have teams use their SCM tool of choice for managing script source code. Over time, we are finding ourselves building SCM-like features into our Management API and tools, as we work to address the needs of various teams. It has become clear to us that we need to offer a unified set of tools that cover deployment and SCM functionality.
  • The distributed development model combined with the dynamic nature of the scripts makes it challenging to understand system behavior and debug issues. RxJava introduces another layer of complexity in this regard because of its asynchronous nature. All of this highlights the need for detailed insights into scripts’ usage of the API.

We are actively working on solutions for the above and will follow up with another post when we are ready to share details.


The evolution of the Netflix API from a web services provider to a dynamic platform has allowed us to increase our rate of innovation while providing a high degree of flexibility and control to client application developers. We are investing in infrastructure to simplify the adoption and operation of this platform. We are also continuing to evolve the platform itself as we find new use cases for it. If this type of work excites you, reach out to us - we are always looking to add talented engineers to our team!

Thursday, February 27, 2014

Netflix Hack Day

by Daniel Jacobson, Ruslan Meshenberg, Matt McCarthy and Leslie Posada

At Netflix, we pride ourselves in creating a culture of innovation and experimentation. We are constantly running A/B tests on virtually every enhancement to the Netflix experience. There are other ways in which we instill this culture within Netflix, including internal events such as Netflix Hack Day, which was held last week. For Hack Day, our primary goal is to provide a fun, experimental, and creative outlet for our engineers. If something interesting and potentially useful comes from it, that is fine, but the real motivation is fun. With that spirit in mind, most teams started hacking on Thursday morning, hacked through the night, and they wrapped up by Friday morning to present a demo to their peers.

It is not unusual for us to see a lot of really good ideas come from Hack Day, but last week we saw some really spectacular work. The hackers generated a wide range of ideas on just about anything, including ideas to improve developer productivity, ways to help troubleshooting, funky data visualizations, and of course a diversity of product feature ideas. These ideas get categorized, then to determine the winner for each category the audience of Netflix employees rated each hack, in true Netflix fashion, on a 5-star scale.

The following are some examples of our favorite hacks and/or videos to give you a taste. Most of these hacks and videos were conceived of and produced in about 24 hours. We should also note that, while we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or be used beyond Hack Day. We are surfacing them here publicly to share the spirit of the Netflix Hack Day.

Netflix Beam
by Sassda Chap

by Jia Pu, Aaron Tull, George Campbell

Custom Playlists
by Marco Vinicius Caldiera, Ian Kirk, Adam Excelsior Butterworth, Glenn Cho

Sleep Tracking with Fitbit
by Sam Horner, Rachel Nordman, Arlene Aficial, Sam Park, Bogdan Ciuca

Pin Protected Profiles
by Mike Kim, Dianne Marsh, Nick Ryabov

Here are some images from the event:

Thanks to all of the hackers and we look forward to the next one. If you are interested in being a part of our next Hack Day, let us know!

Also, we will be hosting our next Open Source meetup at Netflix HQ in Los Gatos on March 12th at 6:30pm. If you are interested, please sign up while there are still slots.

Monday, February 10, 2014

Distributed Neural Networks with GPUs in the AWS Cloud

by Alex Chen, Justin Basilico, and Xavier Amatriain

As we have described previously on this blog, at Netflix we are constantly innovating by looking for better ways to find the best movies and TV shows for our members. When a new algorithmic technique such as Deep Learning shows promising results in other domains (e.g. Image Recognition, Neuro-imaging, Language Models, and Speech Recognition), it should not come as a surprise that we would try to figure out how to apply such techniques to improve our product. In this post, we will focus on what we have learned while building infrastructure for experimenting with these approaches at Netflix. We hope that this will be useful for others working on similar algorithms, especially if they are also leveraging the Amazon Web Services (AWS) infrastructure. However, we will not detail how we are using variants of Artificial Neural Networks for personalization, since it is an active area of research.

Many researchers have pointed out that most of the algorithmic techniques used in the trendy Deep Learning approaches have been known and available for some time. Much of the more recent innovation in this area has been around making these techniques feasible for real-world applications. This involves designing and implementing architectures that can execute these techniques using a reasonable amount of resources in a reasonable amount of time. The first successful instance of large-scale Deep Learning made use of 16000 CPU cores in 1000 machines in order to train an Artificial Neural Network in a matter of days. While that was a remarkable milestone, the required infrastructure, cost, and computation time are still not practical.

Andrew Ng and his team addressed this issue in follow up work . Their implementation used GPUs as a powerful yet cheap alternative to large clusters of CPUs. Using this architecture, they were able to train a model 6.5 times larger in a few days using only 3 machines. In another study, Schwenk et al. showed that training these models on GPUs can improve performance dramatically, even when comparing to high-end multicore CPUs.

Given our well-known approach and leadership in cloud computing, we sought out to implement a large-scale Neural Network training system that leveraged both the advantages of GPUs and the AWS cloud. We wanted to use a reasonable number of machines to implement a powerful machine learning solution using a Neural Network approach. We also wanted to avoid needing special machines in a dedicated data center and instead leverage the full, on-demand computing power we can obtain from AWS.

In architecting our approach for leveraging computing power in the cloud, we sought to strike a balance that would make it fast and easy to train Neural Networks by looking at the entire training process. For computing resources, we have the capacity to use many GPU cores, CPU cores, and AWS instances, which we would like to use efficiently. For an application such as this, we typically need to train not one, but multiple models either from different datasets or configurations (e.g. different international regions). For each configuration we need to perform hyperparameter tuning, where each combination of parameters requires training a separate Neural Network. In our solution, we take the approach of using GPU-based parallelism for training and using distributed computation for handling hyperparameter tuning and different configurations.

Distributing Machine Learning: At what level?

Some of you might be thinking that the scenario described above is not what people think of as a distributed Machine Learning in the traditional sense.  For instance, in the work by Ng et al. cited above, they distribute the learning algorithm itself between different machines. While that approach might make sense in some cases, we have found that to be not always the norm, especially when a dataset can be stored on a single instance. To understand why, we first need to explain the different levels at which a model training process can be distributed.

In a standard scenario, we will have a particular model with multiple instances. Those instances might correspond to different partitions in your problem space. A typical situation is to have different models trained for different countries or regions since the feature distribution and even the item space might be very different from one region to the other. This represents the first initial level at which we can decide to distribute our learning process. We could have, for example, a separate machine train each of the 41 countries where Netflix operates, since each region can be trained entirely independently.

However, as explained above, training a single instance actually implies training and testing several models, each corresponding to a different combinations of hyperparameters. This represents the second level at which the process can be distributed. This level is particularly interesting if there are many parameters to optimize and you have a good strategy to optimize them, like Bayesian optimization with Gaussian Processes. The only communication between runs are hyperparameter settings and test evaluation metrics.

Finally, the algorithm training itself can be distributed. While this is also interesting, it comes at a cost. For example, training ANN is a comparatively communication-intensive process. Given that you are likely to have thousands of cores available in a single GPU instance, it is very convenient if you can squeeze the most out of that GPU and avoid getting into costly across-machine communication scenarios. This is because communication within a machine using memory is usually much faster than communication over a network.

The following pseudo code below illustrates the three levels at which an algorithm training process like us can be distributed.

for each region -> level 1 distribution
for each hyperparameter combination -> level 2 distribution
train model -> level 3 distribution
end for
end for

In this post we will explain how we addressed level 1 and 2 distribution in our use case. Note that one of the reasons we did not need to address level 3 distribution is because our model has millions of parameters (compared to the billions in the original paper by Ng).

Optimizing the CUDA Kernel

Before we addressed distribution problem though, we had to make sure the GPU-based parallel training was efficient. We approached this by first getting a proof-of-concept to work on our own development machines and then addressing the issue of how to scale and use the cloud as a second stage. We started by using a Lenovo S20 workstation with a Nvidia Quadro 600 GPU. This GPU has 98 cores and provides a useful baseline for our experiments; especially considering that we planned on using a more powerful machine and GPU in the AWS cloud. Our first attempt to train our Neural Network model took 7 hours.

We then ran the same code to train the model in on a EC2’s cg1.4xlarge instance, which has a more powerful Tesla M2050 with 448 cores. However, the training time jumped from 7 to over 20 hours.  Profiling showed that most of the time was spent on the function calls to Nvidia Performance Primitive library, e.g. nppsMulC_32f_I, nppsExp_32f_I.  Calling the npps functions repeatedly took 10x more system time on the cg1 instance than in the Lenovo S20.

While we tried to uncover the root cause, we worked our way around the issue by reimplementing the npps functions using the customized cuda kernel, e.g. replace nppsMulC_32f_I function with:

void KernelMulC(float c, float *data, int n)
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     if (i < n) {
          data[i] = c * data[i];

Replacing all npps functions in this way for the Neural Network code reduced the total training time on the cg1 instance from over 20 hours to just 47 minutes when training on 4 million samples. Training 1 million samples took 96 seconds of GPU time. Using the same approach on the Lenovo S20 the total training time also reduced from 7 hours to 2 hours.  This makes us believe that the implementation of these functions is suboptimal regardless of the card specifics.

PCI configuration space and virtualized environments

While we were implementing this “hack”, we also worked with the AWS team to find a principled solution that would not require a kernel patch. In doing so, we found that the performance degradation was related to the NVreg_CheckPCIConfigSpace parameter of the kernel. According to RedHat, setting this parameter to 0 disables very slow accesses to the PCI configuration space. In a virtualized environment such as the AWS cloud, these accesses cause a trap in the hypervisor that results in even slower access.

NVreg_CheckPCIConfigSpace is a parameter of kernel module nvidia-current, that can be set using:

sudo modprobe nvidia-current NVreg_CheckPCIConfigSpace=0

We tested the effect of changing this parameter using a benchmark that calls MulC repeatedly (128x1000 times). Below are the results (runtime in sec) on our cg1.4xlarge instances:

KernelMulC npps_MulC
CheckPCI=1 3.37 103.04
CheckPCI=0 2.56 6.17

As you can see, disabling accesses to PCI space had a spectacular effect in the original npps functions, decreasing the runtime by 95%. The effect was significant even in our optimized Kernel functions saving almost 25% in runtime. However, it is important to note that even when the PCI access is disabled, our customized functions performed almost 60% better than the default ones.

We should also point out that there are other options, which we have not explored so far but could be useful for others. First, we could look at optimizing our code by applying a kernel fusion trick that combines several computation steps into one kernel to reduce the memory access. Finally, we could think about using Theano, the GPU Match compiler in Python, which is supposed to also improve performance in these cases.

G2 Instances

While our initial work was done using cg1.4xlarge EC2 instances, we were interested in moving to the new EC2 GPU g2.2xlarge instance type, which has a GRID K520 GPU (GK104 chip) with 1536 cores. Currently our application is also bounded by GPU memory bandwidth and the GRID K520‘s memory bandwidth is 198 GB/sec, which is an improvement over the Tesla M2050’s at 148 GB/sec. Of course, using a GPU with faster memory would also help (e.g. TITAN’s memory bandwidth is 288 GB/sec).

We repeated the same comparison between the default npps functions and our customized ones (with and without PCI space access) on the g2.2xlarge instances.

KernelMulC npps_MulC
CheckPCI=1 2.01 299.23
CheckPCI=0 0.97 3.48

One initial surprise was that we measured worse performance for npps on the g2 instances than the cg1 when PCI space access was enabled. However, disabling it improved performance between 45% and 65% compared to the cg1 instances. Again, our KernelMulC customized functions are over 70% better, with benchmark times under a second. Thus, switching to G2 with the right configuration allowed us to run our experiments faster, or alternatively larger experiments in the same amount of time.

Distributed Bayesian Hyperparameter Optimization

Once we had optimized the single-node training and testing operations, we were ready to tackle the issue of hyperparameter optimization. If you are not familiar with this concept, here is a simple explanation: Most machine learning algorithms have parameters to tune, which are called often called hyperparameters to distinguish them from model parameters that are produced as a result of the learning algorithm. For example, in the case of a Neural Network, we can think about optimizing the number of hidden units, the learning rate, or the regularization weight. In order to tune these, you need to train and test several different combinations of hyperparameters and pick the best one for your final model. A naive approach is to simply perform an exhaustive grid search over the different possible combinations of reasonable hyperparameters. However, when faced with a complex model where training each one is time consuming and there are many hyperparameters to tune, it can be prohibitively costly to perform such exhaustive grid searches. Luckily, you can do better than this by thinking of parameter tuning as an optimization problem in itself.

One way to do this is to use a Bayesian Optimization approach where an algorithm’s performance with respect to a set of hyperparameters is modeled as a sample from a Gaussian Process. Gaussian Processes are a very effective way to perform regression and while they can have trouble scaling to large problems, they work well when there is a limited amount of data, like what we encounter when performing hyperparameter optimization. We use package spearmint to perform Bayesian Optimization and find the best hyperparameters for the Neural Network training algorithm. We hook up spearmint with our training algorithm by having it choose the set of hyperparameters and then training a Neural Network with those parameters using our GPU-optimized code. This model is then tested and the test metric results used to update the next hyperparameter choices made by spearmint.

We’ve squeezed high performance from our GPU but we only have 1-2 GPU cards per machine, so we would like to make use of the distributed computing power of the AWS cloud to perform the hyperparameter tuning for all configurations, such as different models per international region. To do this, we use the distributed task queue Celery to send work to each of the GPUs. Each worker process listens to the task queue and runs the training on one GPU. This allows us, for example, to tune, train, and update several models daily for all international regions.

Although the Spearmint + Celery system is working, we are currently evaluating more complete and flexible solutions using HTCondor or StarCluster. HTCondor  can be used to manage the workflow of any Directed Acyclic Graph (DAG). It handles input/output file transfer and resource management. In order to use Condor, we need each compute node register into the manager with a given ClassAd (e.g. SLOT1_HAS_GPU=TRUE; STARD_ATTRS=HAS_GPU). Then the user can submit a job with a configuration "Requirements=HAS_GPU" so that the job only runs on AWS instances that have an available GPU. The main advantage of using Condor is that it also manages the distribution of the data needed for the training of the different models. Condor also allows us to run the Spearmint Bayesian optimization on the Manager instead of having to run it on each of the workers.

Another alternative is to use StarCluster , which is an open source cluster computing framework for AWS EC2 developed at MIT. StarCluster runs on the Oracle Grid Engine (formerly Sun Grid Engine) in a fault-tolerant way and is fully supported by Spearmint.

Finally, we are also looking into integrating Spearmint with Jobman in order to better manage the hyperparameter search workflow.
Figure below illustrates the generalized setup using Spearmint plus Celery, Condor, or StarCluster:


Implementing bleeding edge solutions such as using GPUs to train large-scale Neural Networks can be a daunting endeavour. If you need to do it in your own custom infrastructure, the cost and the complexity might be overwhelming. Levering the public AWS cloud can have obvious benefits, provided care is taken in the customization and use of the instance resources. By sharing our experience we hope to make it much easier and straightforward for others to develop similar applications.

We are always looking for talented researchers and engineers to join our team. So if you are interested in solving these types of problems, please take a look at some of our open positions on the Netflix jobs page .