Thursday, January 14, 2016

Dynomite with Redis on AWS - Benchmarks

About a year ago the Cloud Database Engineering (CDE) team published a post Introducing Dynomite. Dynomite is a proxy layer that provides sharding and replication and can turn existing non-distributed datastores into a fully distributed system with multi-region replication. One of its core features is the ability to scale a data store linearly to meet rapidly increasing traffic demands. Dynomite also provides high availability, and was designed and built to support Active-Active Multi-Regional Resiliency.
Dynomite, with Redis is now utilized as a production system within Netflix. This post is the first part of a series that pragmatically examines Dynomite's use cases and features. In this post, we will show performance results using Amazon Web Services (AWS) with and without the recently added consistency feature.

Dynomite Consistency

Dynomite extends eventual consistency to tunable consistency in the local region. The consistency level specifies how many replicas must respond to a write or read request before returning data to the client application. Read and write consistency can be configured to manage availability versus data accuracy. Consistency can be configured for read or write operations separately (cluster-wide). There are two configurations:
  • DC_ONE: Reads and writes are propagated synchronously only to the token owner in the local Availability Zone (AZ) and asynchronously replicated to other AZs and regions.
  • DC_QUORUM: Reads and writes are propagated synchronously to quorum number of nodes in the local region and asynchronously to the rest. The DC_QUORUM configuration writes to the number of nodes that make up a quorum. A quorum number is calculated by the formula ceiling((n+1)/2) where n is the number of nodes in a region. The operation succeeds if the read/write succeeded on a quorum number of nodes.

Test Setup

Client (workload generator) Cluster

For the workload generator, we used an internal Netflix tool called Pappy. Pappy is well integrated with other Netflix OSS services such as (Archaius for fast properties, Servo for metrics, and Eureka for discovery). However, any other other distributed load generator with Redis client plugin can be used to replicate the results. Pappy has support for modules, and one of them is Dyno Java client.
Dyno client uses topology aware load balancing (Token Aware) to directly connect to a Dynomite coordinator node that is the owner of the specified data. Dyno also uses zone awareness to send traffic to Dynomite nodes in the local ASG. To get full benefit of a Dynomite cluster a) the Dyno client cluster should be deployed across all ASGs, so all nodes can receive client traffic, and b) the number of client application nodes per ASG must be larger than the corresponding number of Dynomite nodes in the respective ASG so that the cumulative network capacity of the client cluster is at least equal to the corresponding one at the Dynomite layer.
Dyno also uses connection pooling for persistent connections to reduce the connection churn to the Dynomite nodes. However, in performance benchmarks tuning Dyno can tricky as the workload generator make become the bottleneck due to thread contention. In our benchmark, we observed the delay metrics to pick up a connection from the connection pool that Dyno exposes.
  • Client: Dyno Java client, using default configuration (token aware + zone aware)
  • Number of nodes: Equal to the number of Dynomite nodes in each experiment.
  • Region: us-west-2 (us-west-2a, us-west-2b and us-west-2c)
  • EC2 instance type: m3.2xlarge (30GB RAM, 8 CPU cores, Network throughput: high)
  • Platform: EC2-Classic
  • Data size: 1024 Bytes
  • Number of Keys: 1M random keys
  • Demo application used a simple workload of just key value pairs for read and writes i.e the Redis GET and SET api.
  • Read/Write ratio: 80:20 (the OPS was variable per test, but the ratio was kept 80:20)
  • Number of readers/writers: 80:20 ratio of reader to writer threads. 32 readers/8 writers  per Dynomite Node. We performed some experiments varying the number of readers and writers and found that in the context of our experiments, 32 readers and 8 writes per dynomite node gave the best throughput latency tradeoff.

Dynomite Cluster

  • Dynomite: Dynomite 0.5.6
  • Data store: Redis 2.8.9
  • Operating system: Ubuntu 14.04.2 LTS
  • Number of nodes: 3-48 (doubling every time)
  • Region: us-west-2 (us-west-2a, us-west-2b and us-west-2c)
  • EC2 instance type: r3.2xlarge (61GB RAM, 8 CPU cores, Network throughput: high)
  • Platform: EC2-Classic
The test was performed in a single region with 3 availability zones. Note that replicating to two other availability zones is not a requirement for Dynomite, but rather a deployment choice for high availability at Netflix. A Dynomite cluster of 3 nodes means that there was 1 node per availability zone. However all three nodes take client traffic as well as replication traffic from peer nodes. Our results were captured using Atlas. Each experiment was run 3 times, and our results are averaged over based on average of these times. Each run lasted 3h.
For the our benchmarks, we refer to the Dynomite node, as the node that contains the Dynomite layer and Redis. Hence we do not distinguish on whether Dynomite layer or Redis contributes to the average latency.

Linear Scale Test with DC_ONE

The graphs indicate that Dynomite can scale horizontally in terms of throughput. Therefore, it can handle more traffic by increasing the number of nodes per region. On a per node basis with r3.2xlarge nodes, Dynomite can fully process the traffic generated by the client workload generator (i.e. 32K reads OPS and 8K OPS on a per node basis). For a 1KB payload, as the above graphs show, the main bottleneck is the network (1Gbps EC2 instances). Therefore, Dynomite can potentially provide even faster throughput, if r3.4xlarge (2Gbps) or r3.8xlarge (10Gbps) EC2 instances are used. However, we need to note that 10Gbps optimizations will only be effective when the instances are launched in Amazon's VPC with instance types that can support Enhanced Networking using single root I/O virtualization (SR-IOV).
The average and median latency values show that Dynomite can provide sub-millisecond average latency to the client application. More specifically, Dynomite does not add extra latency as it scales to higher number of nodes, and therefore higher throughput. Overall, the Dynomite node contributes around 20% of the average latency, and the rest of it is a result of the network latency and client processing latency.
At the 95th percentile Dynomite's latency is 0.4ms and does not increase as we scale the cluster up/down. More specifically, the network and client is the major reason for the 95th percentile latency, as Dynomite's node effect is <10%.
It is evident from the 99th percentile graph that the latency for Dynomite pretty much remains the same while the client side increases indicating the variable nature of the network between the clusters.

Linear Scale Test With DC_QUORUM

The test setup was similar to what we used for DC_ONE tests above. Consistency was set to DC_QUORUM for both reads and writes on all Dynomite nodes. In DC_QUORUM, our expectations are that throughput will reduce and latency will increase because Dynomite waits for quorum number of responses.
Looking at the above graph it is clear that dynomite still scales well as the cluster nodes are increasing. Moreover Dynomite node achieves 18K OPS per node in our setup, when the cluster spans a single region. In comparison, Dynomite can achieve 40K OPS per node in DC_ONE.
The average and median latency remains <2.5ms even when DC_QUORUM consistency is enabled in Dynomite nodes. The average and median latency are slightly higher than the corresponding experiments with DC_ONE. In DC_ONE, the dynomite co-ordinator only waits for the local zone node to respond. In DC_QUORUM, the coordinator waits for quorum nodes to respond. Hence, in the overall read and write latency formula, the latency of the network hop to the other ASG, and the latency of performing the corresponding operation on those nodes in other ASGs must be included.
The 95th percentile at the Dynomite level is less than 2ms even after increasing the traffic on each Dyno client node (linear scale). At the client side it remains below 3ms.
At the 99th Percentile with DC_QUORUM enabled, Dynomite produces less than 3ms of latency. When considering the network from the cluster to the client, the latency remains well below 5ms opening the door for a number of applications that require consistency with low latency. Dynomite can report 90th percentile and 99.9th percentile through the statistics port. For brevity, we have decided to present the 99th percentile only.  

Pipelining

Redis Pipelining is client side batching that is also supported by Dynomite; the client application sends requests without waiting for a response from a previous request and later reads a single response for the whole batch. Pipelining makes it possible to increase the overall throughput at the expense of additional latency for individual operations. In the following experiments, the Dyno client randomly selected between 3 to 10 operations in one pipeline request. We believe that this configuration might be close to how a client application would use the Redis Pipelining. The experiments were performed for both DC_ONE and DC_QUORUM.
For comparison reason, we showcase both the non-pipelining and pipelining results. In our tests, pipelining increased the throughput upto 50%. For a small Dynomite cluster the improvement is larger, but as Dynomite horizontally scales the benefit of pipelining decreases.
Latency is a factor of how many requests are combined into one pipeline request so it will vary and will be higher than non pipelined requests.

Conclusion

We performed the tests to get some more insights about Dynomite using Redis at the data store layer, and how to size our clusters. We could have achieved better results with better instance types both at the client and Dyomite server cluster. For example, adding Dynomite nodes with better network capacity (especially the ones supporting enhanced Networking on Linux Instances in a VPC) could further increase the performance of our clusters.
Another way to improve the performance is by using fewer availability zones. In that case,  Dynomite would replicate the data in one more availability zone instead of two more, hence more bandwidth would have been available to client connections. In our experiment we used 3 availability zones in us-west-2, which is a common deployment in most production clusters at Netflix.
In summary, our benchmarks were based on instance types and deployments that are common at Netflix and Dynomite. We presented results that indicate that DC_QUORUM provides better read and write guarantees to the client but with higher latencies and lower throughput. We also showcased how a client can configure Redis Pipeline and benefit from request batching.
We briefly mentioned the availability of higher consistency in this article. In the next article we'll dive deeper into how we implemented higher consistency and how we handle anti-entropy.
by: Shailesh Birari, Jason Cacciatore, Minh Do, Ioannis Papapanagiotou, Christos Kalantzis