Monday, November 3, 2014

Introducing Dynomite - Making Non-Distributed Databases, Distributed

Introduction & Overview

Netflix has long been a proponent of the microservices model. This model offers higher-availability, resiliency to failure and loose coupling. The downside to such an architecture is the potential for a latent user experience. Every time a customer loads up a homepage or starts to stream a movie, there are a number of microservices involved to complete that request. Most of these microservices use some kind of stateful system to store and serve data. A few milliseconds here and there can add up quickly and result in a multi-second response time.
The Cloud Database Engineering team at Netflix is always looking for ways to shave off milliseconds from an application’s database response time, while maintaining our goal of local high-availability and multi-datacenter high-availability. With that goal in mind, we created Dynomite.
Inspired by the Dynamo White Paper as well as our experience with Apache Cassandra, Dynomite is a sharding and replication layer. Dynomite can make existing non distributed datastores, such as Redis or Memcached, into a fully distributed & multi-datacenter replicating datastore.

Server Architecture


In the open source world, there are various single-server datastore solutions, e.g. Memcached, Redis, BerkeleyDb, LevelDb, Mysql (datastore).  The availability story for these single-server datastores usually ends up being a master-slave setup. Once traffic demands overrun this setup, the next logical progression is to introduce sharding.  Most would agree that it is non trivial to operate this kind of a setup. Furthermore, managing data from different shards is also a challenge for application developers.
In the age of high scalability and big data, Dynomite’s design goal is to turn those single-server datastore solutions into peer-to-peer, linearly scalable, clustered systems while still preserving the native client/server protocols of the datastores, e.g., Redis protocol.
Now we will introduce a few high level concepts that are core to the Dynomite server architecture design.

Dynomite Topology

A Dynomite cluster consists of multiple data centers (dc). A datacenter is a group of racks, and a rack is a group of nodes. Each rack consists of the entire dataset, which is partitioned across multiple nodes in that rack. Hence, multiple racks enable higher availability for data. Each node in a rack has a unique token, which helps to identify the dataset it owns.

Each Dynomite node (e.g., a1 or b1 or c1)  has a Dynomite process co-located with the datastore server, which acts as a proxy, traffic router, coordinator and gossiper. In the context of the Dynamo paper, Dynomite is the Dynamo layer with additional support for pluggable datastore proxy, with an effort to preserve the native datastore protocol as much as possible.  
A datastore can be either a volatile datastore such as Memcached or Redis, or persistent datastore such as Mysql, BerkeleyDb or LevelDb.  Our current open sourced Dynomite offering supports Redis and Memcached.


A client can connect to any node on a Dynomite cluster when sending write traffic.  If the Dynomite node happens to own the data based on its token, then the data is written to the local datastore server process and asynchronously replicated to other racks in the cluster across all data centers.  If the node does not own the data, it acts as a coordinator and sends the write to the node owning the data in the same rack. It also replicates the writes to the corresponding nodes in other racks and DCs.  
In the current implementation, a coordinator returns an Ok back to client if a node in the local rack successfully stores the write and all other remote replications will happen asynchronously.
The pic below shows an example for the latter case where client sends a write request to non-owning node. It belongs on nodes a2,b2,c2 and d2 as per the partitioning scheme. The request is sent to a1 which acts as the coordinator and sends the request to the appropriate nodes.

Highly available reads

Multiple racks and multiple data centers provide high availability. A client can connect to any node to read the data. Similar to writes, a node serves the read request if it owns the data, otherwise it forwards the read request to the data owning node in the same rack.  Dynomite clients can fail over to replicas in remote racks and/or data centers in case of node, rack, or data center failures.

Pluggable Datastores

Dynomite currently supports Redis and Memcached, thanks to the TwitterOSS Twemproxy project.  For each of the data stores, based on our usage experience, a pragmatic subset of the most useful Redis/Memcached APIs are supported. Support for additional APIs will be added as needed in the near future.

Standard open source Memcached/Redis ASCII protocol support

Any client that can talk to Memcached or Redis can talk to Dynomite - no change needed. However, there will be a few things missing, including failover strategy, request throttling, connection pooling, etc., unless our Dyno client is used (more details to in the Client Architecture section).

Scalable I/O event notification server

All incoming/outgoing data traffic is processed by a single threaded I/O event loop.  There are additional threads for background or administrative tasks.  All thread communications are based on lock-free circular queue message passing, and asynchronous message processing.  This style of implementation enables each Dynomite node to handle a very large number of client connections while still processing many non-client facing tasks in parallel.

Peer-to-peer,  and linearly scalable

Every Dynomite node in a cluster has the same role and responsibility. Hence, there is no single point of failure in a cluster.  With this advantage, one can simply add more nodes to a Dynomite cluster to meet traffic demands or loads.
Cold cache warm-up
Currently, this feature is available for Dynomite with the Redis datastore.  Dynomite can help to reduce the performance impact by filling up an empty node or nodes with data from its peers.  
Asymmetric multi-datacenter replications
As seen earlier, a write can be replicated over to multiple datacenters. In different datacenters, Dynomite can be configured with different number of racks with different number of nodes.  This helps greatly when there are unbalanced traffic into different datacenters.
Internode communication and Gossip
Dynomite with built-in gossip helps to maintain cluster membership as well as failure detection and recovery.  This simplifies the maintenance operations on Dynomite clusters.   
Functional in AWS and physical datacenter
In AWS environment, a datacenter is equivalent an AWS’ region and a rack is the same as an AWS’ availability zone.  At Netflix, we have more tools to support running Dynomite clusters within AWS but in general, both deployments in these two environments should be similar.

Client Architecture

Dynomite server implements the underlying datastore protocol and presents that as its public interface. Hence, one can use popular java clients like Jedis, Redisson and SpyMemcached to directly speak to Dynomite.
At Netflix, we see the benefit in encapsulating client side complexity and best practices in one place instead of having every application repeat the same engineering effort, e.g., topology-aware routing, effective failover, load shedding with exponential backoff, etc.
Dynomite ships with a Netflix homegrown client called Dyno. Dyno implements patterns inspired by Astyanax (the Cassandra client at Netflix), on top of popular clients like Jedis, Redisson and SpyMemcached, to ease the migration to Dyno and Dynomite.
Dyno Client Features
  • Connection pooling of persistent connections - this helps reduce connection churn on the Dynomite server with client connection reuse.
  • Topology aware load balancing (Token Aware) for avoiding any intermediate hops to a Dynomite coordinator node that is not the owner of the specified data.
  • Application specific local rack affinity based request routing to Dynomite nodes.
  • Application resilience by intelligently failing over to remote racks when local Dynomite rack nodes fail.
  • Application resilience against network glitches by constantly monitoring connection health and recycling unhealthy connections.   
  • Capability of surgically routing traffic away from any nodes that need to be taken offline for maintenance.
  • Flexible retry policies such as exponential backoff etc
  • Insight into connection pool metrics
  • Highly configurable and pluggable connection pool components for implementing your advanced features.
Here is an example of how Dyno does failover to improve app resilience against individual node problems.

Fun facts

Dyno client strives to maintain compatibility with client interfaces like Jedis, which greatly reduces the barrier for apps that are already using Jedis when performing a switch to Dynomite.  
Also, since Dynomite implements both Redis and Memcached protocols, one can use Dyno to  directly connect to Redis/Memcached itself and bypass Dynomite (if needed). Just switch the connection port from Dynomite server port to the redis server port.
Having a layer of indirection with our own homegrown client gives Netflix the flexibility to do other cool things such as
  1. Request interception - you should be able to plug in your own interceptor to do things such as
    1. Implement query trace or slow query logging.
    2. Implement fault injection for testing application resilience when things go south server side.
  2. Micro batching - submitting a batch or requests to a distributed db gets tricky since different keys map to different servers as per the sharding/hashing strategy. Dyno has the capability to take a user submitted batch, split it into shard aware micro-batches under the covers, execute them individually and then stitch the results back together before getting back to the user. Obviously one has to deal with partial failure here, and Dyno has the intelligence to retry just the failed micro-batch against the remote rack replica responsible for that hash partition.
  3. Load shedding - Dyno’s interceptor model for every request will give it the ability to do quota management and rate limiting in order to protect the backend Dynomite servers.

Linear scale test

We wanted to ensure that Dynomite could scale horizontally to meet traffic demands from hundreds of micro-services at Netflix as the company expands its global footprint.

We conducted a simple test with a static Dynomite cluster of size 6 and a load test harness that uses dyno client. The cluster was configured to have replication factor of 3 i.e it was a single data center with 3 racks.

We ramped up requests against the cluster while ensuring that 99 percentile latencies were still in the single digit ms range.

We then scaled up both server fleet and client fleet proportionally and repeated the test. We went through a few of cycles of scaling i.e  6 -> 12 -> 24 and at each stage we recorded the sustained throughput where the avg and 99 percentile latencies were within acceptable range.
i.e  < 1ms for avg latency and 3-6 ms for 99 percentile latency.  

We saw that Dynomite scales linearly as we add more nodes to the cluster. This is critical for a datastore at Netflix where we want surgical control on throughput and latency with a predictable cost model. Dynomite enables just that.

Long Term Vision & Roadmap

Dynomite has the potential to offer server-based sharding and replication for any datastore, as long as a proxy is created to intercept the desired API calls.
This initial version of Dynomite, supports Redis and Memcahed sharding and replication in clear text, backups and restore. In the next few weeks, we will be implementing encrypted inter-datacenter communication. We also have plans to implement reconciliation (repair) of the cluster’s data and support different read/write consistency setting, making this an eventually consistent datastore.
On the Dyno client side we plan on adding other cool features such as load shedding, distributed pipelining and micro-batching. We are also looking at integrating with RxJava to provide a reactive API to Redis/Memcached which will enable apps to observe sequences of data and events.
by: Minh Do, Puneet Oberai, Monal Daxini & Christos Kalantzis