Wednesday, May 25, 2016

Application data caching using SSDs

The Moneta project: Next generation EVCache for better cost optimization

With the global expansion of Netflix earlier this year came the global expansion of data. After the Active-Active project and now with the N+1 architecture, the latest personalization data needs to be everywhere at all times to serve any member from any region. Caching plays a critical role in the persistence story for member personalization as detailed in this earlier blog post.

There are two primary components to the Netflix architecture. The first is the control plane that runs on the AWS cloud for generic, scalable computing for member signup, browsing and playback experiences. The second is the data plane, called Open Connect, which is our global video delivery network. This blog is about how we are bringing the power and economy of SSDs to EVCache - the primary caching system in use at Netflix for applications running in the control plane on AWS.

One of the main use cases of EVCache is to act as globally replicated storage for personalized data for each of our more than 81 million members. EVCache plays a variety of roles inside Netflix besides holding this data, including acting as a standard working-set cache for things like subscriber information. But its largest role is for personalization. Serving anyone from anywhere means that we must hold all of the personalized data for every member in each of the three regions that we operate in. This enables a consistent experience in all AWS regions and allows us to easily shift traffic during regional outages or during regular traffic shaping exercises to balance load. We have spoken at length about the replication system used to make this happen in a previous blog post.

During steady state, our regions tend to see the same members over and over again. Switching between regions is not a very common phenomenon for our members. Even though their data is in RAM in all three regions, only one region is being used regularly per member. Extrapolating from this, we can see that each region has a different working set for these types of caches. A small subset is hot data and the rest is cold.

Besides the hot/cold data separation, the cost of holding all of this data in memory is growing along with our member base. As well, different A/B tests and other internal changes can add even more data. For our working set of members, we have billions of keys already and that number will only grow. We have the challenge of continuing to support Netflix use cases while balancing cost. To meet this challenge, we are introducing a multi-level caching scheme using both RAM and SSDs.

The EVCache project to take advantage of this global request distribution and cost optimization is called Moneta, named for the Latin Goddess of Memory, and Juno Moneta, the Protectress of Funds for Juno.

Current Architecture

We will talk about the current architecture of the EVCache servers and then talk about how this is evolving to enable SSD support.

The picture below shows a typical deployment for EVCache and the relationship between a single client instance and the servers. A client of EVCache will connect to several clusters of EVCache servers. In a region we have multiple copies of the whole dataset, separated by AWS Availability Zone. The dashed boxes delineate the in-region replicas, each of which has a full copy of the data and acts as a unit. We manage these copies as separate AWS Auto Scaling groups. Some caches have 2 copies per region, and some have many. This high level architecture is still valid for us for the foreseeable future and is not changing. Each client connects to all of the servers in all zones in their own region. Writes are sent to all copies and reads prefer topologically close servers for read requests. To see more detail about the EVCache architecture, see our original announcement blog post.

Client - Server architecture.png

The server as it has evolved over the past few years is a collection of a few processes, with two main ones: stock Memcached, a popular and battle tested in-memory key-value store, and Prana, the Netflix sidecar process. Prana is the server's hook into the rest of Netflix's ecosystem, which is still primarily Java-based. Clients connect directly to the Memcached process running on each server. The servers are independent and do not communicate with one another.

Old Server Closeup.png


As one of the largest subsystems of the Netflix cloud, we're in a unique position to apply optimizations across a significant percentage of our cloud footprint. The cost of holding all of the cached data in memory is growing along with our member base. The output of a single stage of a single day's personalization batch process can load more than 5 terabytes of data into its dedicated EVCache cluster. The cost of storing this data is multiplied by the number of global copies of data that we store. As mentioned earlier, different A/B tests and other internal changes can add even more data. For just our working set of members, we have many billions of keys today, and that number will only grow.

To take advantage of the different data access patterns that we observe in different regions, we built a system to store the hot data in RAM and cold data on disk. This is a classic two-level caching architecture (where L1 is RAM and L2 is disk), however engineers within Netflix have come to rely on the consistent, low-latency performance of EVCache. Our requirements were to be as low latency as possible, use a more balanced amount of (expensive) RAM, and take advantage of lower-cost SSD storage while still delivering the low latency our clients expect.

In-memory EVCache clusters run on the AWS r3 family of instance types, which are optimized for large memory footprints. By moving to the i2 family of instances, we gain access to 10 times the amount of fast SSD storage as we had on the r3 family (80 → 800GB from r3.xlarge to i2.xlarge) with the equivalent RAM and CPU. We also downgraded instance sizes to a smaller amount of memory. Combining these two, we have a potential of substantial cost optimization across our many thousands of servers.

Moneta architecture

The Moneta project introduces two new processes to the EVCache server: Rend and Mnemonic. Rend is a high-performance proxy written in Go with Netflix use cases as the primary driver for development. Mnemonic is a disk-backed key-value store based on RocksDB. Mnemonic reuses the Rend server components that handle protocol parsing (for speaking the Memcached protocols), connection management, and parallel locking (for correctness). All three servers actually speak the Memcached text and binary protocols, so client interactions between any of the three have the same semantics. We use this to our advantage when debugging or doing consistency checking.

New Server Closeup.png

Where clients previously connected to Memcached directly, they now connect to Rend. From there, Rend will take care of the L1/L2 interactions between Memcached and Mnemonic. Even on servers that do not use Mnemonic, Rend still provides valuable server-side metrics that we could not previously get from Memcached, such as server-side request latencies. The latency introduced by Rend, in conjunction with Memcached only, averages only a few dozen microseconds.

As a part of this redesign, we could have integrated the three processes together. We chose to have three independent processes running on each server to maintain separation of concerns. This setup affords better data durability on the server. If Rend crashes, the data is still intact in Memcached and Mnemonic. The server is able to serve customer requests once they reconnect to a resurrected Rend process. If Memcached crashes, we lose the working set but the data in L2 (Mnemonic) is still available. Once the data is requested again, it will be back in the hot set and served as it was before. If Mnemonic crashes, it wouldn't lose all the data, but only possibly a small set that was written very recently. Even if it did lose the data, at least we have the hot data still in RAM and available for those users who are actually using the service. This resiliency to crashes is on top of the resiliency measures in the EVCache client.


Rend, as mentioned above, acts as a proxy in front of the two other processes on the server that actually store the data. It is a high-performance server that speaks the binary and text Memcached protocols. It is written in Go and relies heavily on goroutines and other language primitives to handle concurrency. The project is fully open source and available on Github. The decision to use Go was deliberate, because we needed something that had lower latency than Java (where garbage collection pauses are an issue) and is more productive for developers than C, while also handling tens of thousands of client connections. Go fits this space well.

Rend has the responsibility of managing the relationship between the L1 and L2 caches on the box. It has a couple of different policies internally that apply to different use cases. It also has a feature to cut data into fixed size chunks as the data is being inserted into Memcached to avoid pathological behavior of the memory allocation scheme inside Memcached. This server-side chunking is replacing our client-side version, and is already showing promise. So far, it's twice as fast for reads and up to 30 times faster for writes. Fortunately, Memcached, as of 1.4.25, has become much more resilient to the bad client behavior that caused problems before. We may drop the chunking feature in the future as we can depend on L2 to have the data if it is evicted from L1.


The design of Rend is modular to allow for configurable functionality. Internally, there are a few layers: Connection management, a server loop, protocol-specific code, request orchestration, and backend handlers. To the side is a custom metrics package that enables Prana, our sidecar, to poll for metrics information while not being too intrusive. Rend also comes with a testing client library that has a separate code base. This has helped immensely in finding protocol bugs or other errors such as misalignment, unflushed buffers, and unfinished responses.

Rend Internals.png

Rend's design allows different backends to be plugged in with the fulfillment of an interface and a constructor function. To prove this design out, an engineer familiar with the code base took less than a day to learn LMDB and integrate it as a storage backend. The code for this experiment can be found at

Usage in Production

For the caches that Moneta serves best, there are a couple of different classes of clients that a single server serves. One class is online traffic in the hot path, requesting personalization data for a visiting member. The other is traffic from the offline and nearline systems that produce personalization data. These typically run in large batches overnight and continually write for hours on end.

The modularity allows our default implementation to optimize for our nightly batch compute by inserting data into L2 directly and smartly replacing hot data in L1, rather than letting those writes blow away our L1 cache during the nightly precompute. The replicated data coming from other regions can also be inserted directly into L2, since data replicated from another region is unlikely to be “hot” in its destination region. The diagram below shows the multiple open ports in one Rend process that both connect to the backing stores. With the modularity of Rend, it was easy to introduce another server on a different port for batch and replication traffic with only a couple more lines of code.

Rend Ports.png


Rend itself is very high throughput. While testing Rend separately, we consistently hit network bandwidth or packet processing limits before maxing CPU. A single server, for requests that do not need to hit the backing store, has been driven to 2.86 million requests per second. This is a raw, but unrealistic, number. With Memcached as the only backing storage, Rend can sustain 225k inserts per second and 200k reads per second simultaneously on the largest instance we tested. An i2.xlarge instance configured to use both L1 and L2 (memory and disk) and data chunking, which is used as a standard instance for our production clusters, can perform 22k inserts per second (with sets only), 21k reads per second (with gets only), and roughly 10k sets and 10k gets per second if both are done simultaneously. These are lower bounds for our production traffic, because the test load consisted of random keys thus affording no data locality benefits during access. Real traffic will hit the L1 cache much more frequently than random keys do.

As a server-side application, Rend unlocks all kinds of future possibilities for intelligence on the EVCache server. Also, the underlying storage is completely disconnected from the protocol used to communicate. Depending on Netflix needs, we could move L2 storage off-box, replace the L1 Memcached with another store, or change the server logic to add global locking or consistency. These aren't planned projects, but they are possible now that we have custom code running on the server.


Mnemonic is our RocksDB-based L2 solution. It stores data on disk. The protocol parsing, connection management, and concurrency control of Mnemonic are all managed by the same libraries that power Rend. Mnemonic is another backend that is plugged into a Rend server. The native libraries in the Mnemonic project expose a custom C API that is consumed by a Rend handler.
Mnemonic stack.png

The interesting parts of Mnemonic are in the C++ core layer that wraps RocksDB. Mnemonic handles the Memcached-style requests, implementing each of the needed operations to conform to Memcached behavior, including TTL support. It includes one more important feature: it shards requests across multiple RocksDB databases on a local system to reduce the work for each individual instance of RocksDB. The reasons why will be explored in the next section.


After looking at some options for efficiently accessing SSDs, we picked RocksDB, an embedded key-value store which uses a Log Structured Merge Tree data structure design. Write operations are first inserted into a in-memory data structure (a memtable) that is flushed to disk when full. When flushed to disk, the memtable becomes an immutable SST file. This makes most writes sequential to the SSD, which reduces the amount of internal garbage collection that the SSD must perform and thus improve latency on long running instances while also reducing wear.

One type of work that is done in the background by each separate instance of RocksDB includes compaction. We initially used the Level style compaction configuration, which was the main reason to shard the requests across multiple databases. However, while we were evaluating this compaction configuration with production data and production-like traffic, we found that compaction was causing a great deal of extra read/write traffic to the SSD, increasing latencies past what we found acceptable. The SSD read traffic surpassed 200MB/sec at times. Our evaluation traffic included a prolonged period where the number of write operations was high, simulating daily batch compute processes. During that period, RocksDB was constantly moving new L0 records into the higher levels, causing a very high write amplification.

To avoid this overhead, we switched to FIFO style compaction. In this configuration, no real compaction operation is done. Old SST files are deleted based on the maximum size of the database. Records stay on disk in level 0, so the records are only ordered by time across the multiple SST files. The downside of this configuration is that a read operation must check each SST file in reverse chronological order before a key is determined to be missing. This check does not usually require a disk read, as the RocksDB bloom filters prevent a high percentage of the queries from requiring a disk access on each SST. However, the number of SST files causes the overall effectiveness of the set of bloom filters to be less than the normal Level style compaction. The initial sharding of the incoming read and write requests across the multiple RocksDB instances helps lessen the negative impact of scanning so many files.


Re-running our evaluation test again with the final compaction configuration, we are able to achieve a 99th percentile latency of ~9ms for read queries during our precompute load. After the precompute load completed, the 99th percentile read latency reduced to ~600μs on the same level of read traffic. All of these tests were run without Memcached and without RocksDB block caching.

To allow this solution to work with more varied uses, we will need to reduce the number of SST files that needs to be checked per query. We are exploring options like RocksDB’s Universal style compaction or our own custom compaction where we could better control the compaction rate thereby lowering the amount of data transferred to and from the SSD.


We are rolling out our solution in phases in production. Rend is currently in production serving some of our most important personalization data sets. Early numbers show faster operations with increased reliability, as we are less prone to temporary network problems. We are in the process of deploying the Mnemonic (L2) backend to our early adopters. While we're still in the process of tuning the system, the results look promising with the potential for substantial cost savings while still allowing the ease of use and speed that EVCache has always afforded its users.

It has been quite a journey to production deployment, and there's still much to do: deploy widely, monitor, optimize, rinse and repeat. The new architecture for EVCache Server is allowing us to continue to innovate in ways that matter. If you want to help solve this or similar big problems in cloud architecture, join us.

Tuesday, May 24, 2016

Netflix Hack Day - Spring 2016

Keeping with our roughly six-month cadence, Netflix recently hosted another fantastic Hack Day event.  As was the case with our past installments, Hack Day is a way for our product development staff to take a break from everyday work to have fun, experiment with new technologies, collaborate with new people, and to generally be creative.

This time, we were lucky to have a team of Netflixers emerge to hack together a documentary video of the event.  The following provides a peek into how the event operates:

We had an excellent turnout, including over 200 engineers and designers working on more than 80 hacks.  The hacks themselves ranged from product ideas to internal tools to improvements in our recruiting process. We’ve embedded videos below, produced by the hackers, to some of our favorites. You can also see more hacks from several of our past events: November 2015, March 2015, Feb. 2014 & Aug. 2014.

As always, while we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or otherwise be used beyond Hack Day.  We are posting them here publicly to simply share the spirit of the event and our culture of innovation.

Thanks to all of the hackers for putting together some incredible work!

Netflix Zone
An alternate Netflix viewing experience created for the HTC Vive (VR). You just crossed over into...the Netflix Zone.

Integration of our open-source continuous delivery platform, Spinnaker, with Minecraft. Enough said.

Wish your recently watched row did not fall down from the top of your row list? Want to add other rows that we are not displaying to you yet? Presenting a desktop Netflix experience with drag and drop repositionable rows, pinned rows, and add/remove rows.
Family Catch-up Viewing
See what your other profiles are watching in the player so you can know exactly where they left off in the episode. Now there’s no excuse to (accidentally) watch ahead!
By Lyle Troxell

You want to watch Marvel's Daredevil on your Cast TV or Chromecast, but you just put your kid to bed in the next room. No problem, you plug in your headphones on your android tablet and cast the show. The audio plays from your phone while the cast target plays the video in silence.
By Kevin Morris, Baskar Odayarkoil and Shaomei Chen

Wednesday, May 11, 2016

Building High-Performance Mobile Applications at Netflix

On April 21, we hosted our first Mobile event at Netflix with two talks focusing on the challenges we’ve faced building performant mobile experiences, and how we addressed them.

Francois Goldfain, Engineering Manager for the Android Platform team, kicked off the event with his talk on optimizing the Netflix Android app.  Francois shared insights into how we’ve made the most of the limited memory, network, and battery resources available on Android devices, and shared some of the mobile engineering best practices we built into our application and why we did them.

For the second talk - Shiva Garlapati, Senior Software Engineer on the Android Innovation team, and Lawrence Jones, Senior UI Engineering on the Originals UI team, gave a joint talk sharing their experiences building smooth animations on Android and iOS, respectively.  They spoke about the challenges they faced as well as the best practices they used to overcome them across the wide range of devices we support.

Optimizing Netflix for Android

Smooth Animations on Android

Smooth Animations on iOS

These talks and many others from past Netflix Engineering events can be found on our YouTube channel.    

By Maxine Cheung

Thursday, May 5, 2016

Highlights from PRS2016 workshop

Personalized recommendations and search are the primary ways Netflix members find great content to watch. We’ve written much about how we build them and some of the open challenges. Recently we organized a full-day workshop at Netflix on Personalization, Recommendation and Search (PRS2016), bringing together researchers and practitioners in these three domains. It was a forum to exchange information, challenges and practices, as well as strengthen bridges between these communities. Seven invited speakers from the industry and academia covered a broad range of topics, highlighted below. We look forward to hearing more and continuing the fascinating conversations at PRS2017!

Semantic Entity Search, Maarten de Rijke, University of Amsterdam

Entities, such as people, products, organizations, are the ingredients around which most conversations are built. A very large fraction of the queries submitted to search engines revolve around entities. No wonder that the information retrieval community continues to devote a lot of attention to entity search. In this talk Maarten discussed recent advances in entity retrieval. Most of the talk was focused on unsupervised semantic matching methods for entities that are able to learn from raw textual evidence associated with entities alone. Maarten then pointed out challenges (and partial solutions) to learn such representations in a dynamic setting and to learn to improve such representations using interaction data.

Combining matrix factorization and LDA topic modeling for rating prediction and learning user interest profiles, Deborah Donato, StumbleUpon

Matrix Factorization through Latent Dirichlet Allocation (fLDA) is a generative model for concurrent rating prediction and topic/persona extraction. It learns topic structure of URLs and topic affinity vectors for users, and predicts ratings as well. The fLDA model achieves several goals for StumbleUpon in a single framework: it allows for unsupervised inference of latent topics in the URLs served to users and for users to be represented as mixtures over the same topics learned from the URLs (in the form of affinity vectors generated by the model). Deborah presented an ongoing effort inspired by the fLDA framework devoted to extend the original approach to an industrial environment. The current implementation uses a (much faster) expectation maximization method for parameter estimation, instead of Gibbs sampling as in the original work and implements a modified version of in which topic distributions are learned independently using LDA prior to training the main model. This is an ongoing effort but Deborah presented very interesting results.

Exploiting User Relationships to Accurately Predict Preferences in Large Scale Networks, Jennifer Neville, Purdue University

The popularity of social networks and social media has increased the amount of information available about users' behavior online--including current activities and interactions among friends and family. This rich relational information can be used to predict user interests and preferences even when individual data is sparse, since the characteristics of friends are often correlated. Although relational data offer several opportunities to improve predictions about users, the characteristics of online social network data also present a number of challenges to accurately incorporate the network information into machine learning systems. This talk outlined some of the algorithmic and statistical challenges that arise due to partially-observed, large-scale networks, and describe methods for semi-supervised learning and active exploration that address the challenges.

Why would you recommend me THAT!?, Aish Fenton, Netflix

With so many advances in machine learning recently, it’s not unreasonable to ask: why aren’t my recommendations perfect by now? Aish provided a walkthrough of the open problems in the area of recommender systems, especially as they apply to Netflix’s personalization and recommendation algorithms. He also provided a brief overview of recommender systems, and sketched out some tentative solutions for the problems he presented.

Diversity in Radio, David Ross, Google

Many services offer streaming radio stations seeded by an artist or song, but what does that mean? To get specific, what fraction of the songs in “Taylor Swift Radio” should be by Taylor Swift? David provided a short introduction to the YouTube Radio project, and dived into the diversity problem, sharing some insights Google has learned from live experiments and human evaluations.

Immersive Recommendation Using Personal Digital Traces, Deborah Estrin and Andy Hsieh, Cornell Tech

From topics referred to in Twitter or email, to web browser histories, to videos watched and products purchased online, our digital traces (small data) reflect who we are, what we do, and what we are interested in. In this talk, Deborah and Andy presented a new user-centric recommendation model, called Immersive Recommendation, that incorporate cross-platform, diverse personal digital traces into recommendations. They discussed techniques that infer users' interests from personal digital traces while suppressing context-specified noise replete in these traces, and propose a hybrid collaborative filtering algorithm to fuse the user interests with content and rating information to achieve superior recommendation performance throughout a user's lifetime, including in cold-start situations. They illustrated this idea with personalized news and local event recommendations. Finally they discussed future research directions and applications that incorporate richer multimodal user-generated data into recommendations, and the potential benefits of turning such systems into tools for awareness and aspiration.

Response prediction for display advertising, Olivier Chapelle, Criteo (paper)

Click-through and conversion rates estimation are two core predictions tasks in display advertising. Olivier presented a machine learning framework based on logistic regression that is specifically designed to tackle the specifics of display advertising. The resulting system has the following characteristics: it is easy to implement and deploy; it is highly scalable (they have trained it on terabytes of data); and it provides models with state-of-the-art accuracy. Olivier described how the system uses explore/exploit machinery to constantly vary and evolve its predictive model on live streaming data.

We hope you find these presentations stimulating. We certainly did, and look forward to organizing similar events in the future! If you’d like to help us tackle challenges in these areas, to help our members find great stories to enjoy, checkout our job postings.