The Moneta project: Next generation EVCache for better cost optimization
With the global expansion of Netflix earlier this year came the global expansion of data. After the Active-Active project and now with the N+1 architecture, the latest personalization data needs to be everywhere at all times to serve any member from any region. Caching plays a critical role in the persistence story for member personalization as detailed in this earlier blog post.
There are two primary components to the Netflix architecture. The first is the control plane that runs on the AWS cloud for generic, scalable computing for member signup, browsing and playback experiences. The second is the data plane, called Open Connect, which is our global video delivery network. This blog is about how we are bringing the power and economy of SSDs to EVCache - the primary caching system in use at Netflix for applications running in the control plane on AWS.
One of the main use cases of EVCache is to act as globally replicated storage for personalized data for each of our more than 81 million members. EVCache plays a variety of roles inside Netflix besides holding this data, including acting as a standard working-set cache for things like subscriber information. But its largest role is for personalization. Serving anyone from anywhere means that we must hold all of the personalized data for every member in each of the three regions that we operate in. This enables a consistent experience in all AWS regions and allows us to easily shift traffic during regional outages or during regular traffic shaping exercises to balance load. We have spoken at length about the replication system used to make this happen in a previous blog post.
During steady state, our regions tend to see the same members over and over again. Switching between regions is not a very common phenomenon for our members. Even though their data is in RAM in all three regions, only one region is being used regularly per member. Extrapolating from this, we can see that each region has a different working set for these types of caches. A small subset is hot data and the rest is cold.
Besides the hot/cold data separation, the cost of holding all of this data in memory is growing along with our member base. As well, different A/B tests and other internal changes can add even more data. For our working set of members, we have billions of keys already and that number will only grow. We have the challenge of continuing to support Netflix use cases while balancing cost. To meet this challenge, we are introducing a multi-level caching scheme using both RAM and SSDs.
The EVCache project to take advantage of this global request distribution and cost optimization is called Moneta, named for the Latin Goddess of Memory, and Juno Moneta, the Protectress of Funds for Juno.
We will talk about the current architecture of the EVCache servers and then talk about how this is evolving to enable SSD support.
The picture below shows a typical deployment for EVCache and the relationship between a single client instance and the servers. A client of EVCache will connect to several clusters of EVCache servers. In a region we have multiple copies of the whole dataset, separated by AWS Availability Zone. The dashed boxes delineate the in-region replicas, each of which has a full copy of the data and acts as a unit. We manage these copies as separate AWS Auto Scaling groups. Some caches have 2 copies per region, and some have many. This high level architecture is still valid for us for the foreseeable future and is not changing. Each client connects to all of the servers in all zones in their own region. Writes are sent to all copies and reads prefer topologically close servers for read requests. To see more detail about the EVCache architecture, see our original announcement blog post.
The server as it has evolved over the past few years is a collection of a few processes, with two main ones: stock Memcached, a popular and battle tested in-memory key-value store, and Prana, the Netflix sidecar process. Prana is the server's hook into the rest of Netflix's ecosystem, which is still primarily Java-based. Clients connect directly to the Memcached process running on each server. The servers are independent and do not communicate with one another.
As one of the largest subsystems of the Netflix cloud, we're in a unique position to apply optimizations across a significant percentage of our cloud footprint. The cost of holding all of the cached data in memory is growing along with our member base. The output of a single stage of a single day's personalization batch process can load more than 5 terabytes of data into its dedicated EVCache cluster. The cost of storing this data is multiplied by the number of global copies of data that we store. As mentioned earlier, different A/B tests and other internal changes can add even more data. For just our working set of members, we have many billions of keys today, and that number will only grow.
To take advantage of the different data access patterns that we observe in different regions, we built a system to store the hot data in RAM and cold data on disk. This is a classic two-level caching architecture (where L1 is RAM and L2 is disk), however engineers within Netflix have come to rely on the consistent, low-latency performance of EVCache. Our requirements were to be as low latency as possible, use a more balanced amount of (expensive) RAM, and take advantage of lower-cost SSD storage while still delivering the low latency our clients expect.
In-memory EVCache clusters run on the AWS r3 family of instance types, which are optimized for large memory footprints. By moving to the i2 family of instances, we gain access to 10 times the amount of fast SSD storage as we had on the r3 family (80 → 800GB from r3.xlarge to i2.xlarge) with the equivalent RAM and CPU. We also downgraded instance sizes to a smaller amount of memory. Combining these two, we have a potential of substantial cost optimization across our many thousands of servers.
The Moneta project introduces two new processes to the EVCache server: Rend and Mnemonic. Rend is a high-performance proxy written in Go with Netflix use cases as the primary driver for development. Mnemonic is a disk-backed key-value store based on RocksDB. Mnemonic reuses the Rend server components that handle protocol parsing (for speaking the Memcached protocols), connection management, and parallel locking (for correctness). All three servers actually speak the Memcached text and binary protocols, so client interactions between any of the three have the same semantics. We use this to our advantage when debugging or doing consistency checking.
Where clients previously connected to Memcached directly, they now connect to Rend. From there, Rend will take care of the L1/L2 interactions between Memcached and Mnemonic. Even on servers that do not use Mnemonic, Rend still provides valuable server-side metrics that we could not previously get from Memcached, such as server-side request latencies. The latency introduced by Rend, in conjunction with Memcached only, averages only a few dozen microseconds.
As a part of this redesign, we could have integrated the three processes together. We chose to have three independent processes running on each server to maintain separation of concerns. This setup affords better data durability on the server. If Rend crashes, the data is still intact in Memcached and Mnemonic. The server is able to serve customer requests once they reconnect to a resurrected Rend process. If Memcached crashes, we lose the working set but the data in L2 (Mnemonic) is still available. Once the data is requested again, it will be back in the hot set and served as it was before. If Mnemonic crashes, it wouldn't lose all the data, but only possibly a small set that was written very recently. Even if it did lose the data, at least we have the hot data still in RAM and available for those users who are actually using the service. This resiliency to crashes is on top of the resiliency measures in the EVCache client.
Rend, as mentioned above, acts as a proxy in front of the two other processes on the server that actually store the data. It is a high-performance server that speaks the binary and text Memcached protocols. It is written in Go and relies heavily on goroutines and other language primitives to handle concurrency. The project is fully open source and available on Github. The decision to use Go was deliberate, because we needed something that had lower latency than Java (where garbage collection pauses are an issue) and is more productive for developers than C, while also handling tens of thousands of client connections. Go fits this space well.
Rend has the responsibility of managing the relationship between the L1 and L2 caches on the box. It has a couple of different policies internally that apply to different use cases. It also has a feature to cut data into fixed size chunks as the data is being inserted into Memcached to avoid pathological behavior of the memory allocation scheme inside Memcached. This server-side chunking is replacing our client-side version, and is already showing promise. So far, it's twice as fast for reads and up to 30 times faster for writes. Fortunately, Memcached, as of 1.4.25, has become much more resilient to the bad client behavior that caused problems before. We may drop the chunking feature in the future as we can depend on L2 to have the data if it is evicted from L1.
The design of Rend is modular to allow for configurable functionality. Internally, there are a few layers: Connection management, a server loop, protocol-specific code, request orchestration, and backend handlers. To the side is a custom metrics package that enables Prana, our sidecar, to poll for metrics information while not being too intrusive. Rend also comes with a testing client library that has a separate code base. This has helped immensely in finding protocol bugs or other errors such as misalignment, unflushed buffers, and unfinished responses.
Rend's design allows different backends to be plugged in with the fulfillment of an interface and a constructor function. To prove this design out, an engineer familiar with the code base took less than a day to learn LMDB and integrate it as a storage backend. The code for this experiment can be found at https://github.com/Netflix/rend-lmdb.
Usage in Production
For the caches that Moneta serves best, there are a couple of different classes of clients that a single server serves. One class is online traffic in the hot path, requesting personalization data for a visiting member. The other is traffic from the offline and nearline systems that produce personalization data. These typically run in large batches overnight and continually write for hours on end.
The modularity allows our default implementation to optimize for our nightly batch compute by inserting data into L2 directly and smartly replacing hot data in L1, rather than letting those writes blow away our L1 cache during the nightly precompute. The replicated data coming from other regions can also be inserted directly into L2, since data replicated from another region is unlikely to be “hot” in its destination region. The diagram below shows the multiple open ports in one Rend process that both connect to the backing stores. With the modularity of Rend, it was easy to introduce another server on a different port for batch and replication traffic with only a couple more lines of code.
Rend itself is very high throughput. While testing Rend separately, we consistently hit network bandwidth or packet processing limits before maxing CPU. A single server, for requests that do not need to hit the backing store, has been driven to 2.86 million requests per second. This is a raw, but unrealistic, number. With Memcached as the only backing storage, Rend can sustain 225k inserts per second and 200k reads per second simultaneously on the largest instance we tested. An i2.xlarge instance configured to use both L1 and L2 (memory and disk) and data chunking, which is used as a standard instance for our production clusters, can perform 22k inserts per second (with sets only), 21k reads per second (with gets only), and roughly 10k sets and 10k gets per second if both are done simultaneously. These are lower bounds for our production traffic, because the test load consisted of random keys thus affording no data locality benefits during access. Real traffic will hit the L1 cache much more frequently than random keys do.
As a server-side application, Rend unlocks all kinds of future possibilities for intelligence on the EVCache server. Also, the underlying storage is completely disconnected from the protocol used to communicate. Depending on Netflix needs, we could move L2 storage off-box, replace the L1 Memcached with another store, or change the server logic to add global locking or consistency. These aren't planned projects, but they are possible now that we have custom code running on the server.
Mnemonic is our RocksDB-based L2 solution. It stores data on disk. The protocol parsing, connection management, and concurrency control of Mnemonic are all managed by the same libraries that power Rend. Mnemonic is another backend that is plugged into a Rend server. The native libraries in the Mnemonic project expose a custom C API that is consumed by a Rend handler.
The interesting parts of Mnemonic are in the C++ core layer that wraps RocksDB. Mnemonic handles the Memcached-style requests, implementing each of the needed operations to conform to Memcached behavior, including TTL support. It includes one more important feature: it shards requests across multiple RocksDB databases on a local system to reduce the work for each individual instance of RocksDB. The reasons why will be explored in the next section.
After looking at some options for efficiently accessing SSDs, we picked RocksDB, an embedded key-value store which uses a Log Structured Merge Tree data structure design. Write operations are first inserted into a in-memory data structure (a memtable) that is flushed to disk when full. When flushed to disk, the memtable becomes an immutable SST file. This makes most writes sequential to the SSD, which reduces the amount of internal garbage collection that the SSD must perform and thus improve latency on long running instances while also reducing wear.
One type of work that is done in the background by each separate instance of RocksDB includes compaction. We initially used the Level style compaction configuration, which was the main reason to shard the requests across multiple databases. However, while we were evaluating this compaction configuration with production data and production-like traffic, we found that compaction was causing a great deal of extra read/write traffic to the SSD, increasing latencies past what we found acceptable. The SSD read traffic surpassed 200MB/sec at times. Our evaluation traffic included a prolonged period where the number of write operations was high, simulating daily batch compute processes. During that period, RocksDB was constantly moving new L0 records into the higher levels, causing a very high write amplification.
To avoid this overhead, we switched to FIFO style compaction. In this configuration, no real compaction operation is done. Old SST files are deleted based on the maximum size of the database. Records stay on disk in level 0, so the records are only ordered by time across the multiple SST files. The downside of this configuration is that a read operation must check each SST file in reverse chronological order before a key is determined to be missing. This check does not usually require a disk read, as the RocksDB bloom filters prevent a high percentage of the queries from requiring a disk access on each SST. However, the number of SST files causes the overall effectiveness of the set of bloom filters to be less than the normal Level style compaction. The initial sharding of the incoming read and write requests across the multiple RocksDB instances helps lessen the negative impact of scanning so many files.
Re-running our evaluation test again with the final compaction configuration, we are able to achieve a 99th percentile latency of ~9ms for read queries during our precompute load. After the precompute load completed, the 99th percentile read latency reduced to ~600μs on the same level of read traffic. All of these tests were run without Memcached and without RocksDB block caching.
To allow this solution to work with more varied uses, we will need to reduce the number of SST files that needs to be checked per query. We are exploring options like RocksDB’s Universal style compaction or our own custom compaction where we could better control the compaction rate thereby lowering the amount of data transferred to and from the SSD.
We are rolling out our solution in phases in production. Rend is currently in production serving some of our most important personalization data sets. Early numbers show faster operations with increased reliability, as we are less prone to temporary network problems. We are in the process of deploying the Mnemonic (L2) backend to our early adopters. While we're still in the process of tuning the system, the results look promising with the potential for substantial cost savings while still allowing the ease of use and speed that EVCache has always afforded its users.
It has been quite a journey to production deployment, and there's still much to do: deploy widely, monitor, optimize, rinse and repeat. The new architecture for EVCache Server is allowing us to continue to innovate in ways that matter. If you want to help solve this or similar big problems in cloud architecture, join us.
- Scott Mansfield (@sgmansfield), Vu Tuan Nguyen , Sridhar Enugula, Shashi Madappa on behalf of the EVCache Team