Thursday, September 1, 2016

Netflix Data Benchmark: Benchmarking Cloud Data Stores

The Netflix member experience is offered to 83+ million global members, and delivered using thousands of microservices. These services are owned by multiple teams, each having their own build and release lifecycles, generating a variety of data that is stored in different types of data store systems. The Cloud Database Engineering (CDE) team manages those data store systems, so we run benchmarks to validate updates to these systems, perform capacity planning, and test our cloud instances with multiple workloads and under different failure scenarios. We were also interested in a tool that could evaluate and compare new data store systems as they appear in the market or in the open source domain, determine their performance characteristics and limitations, and gauge whether they could be used in production for relevant use cases. For these purposes, we wrote Netflix Data Benchmark (NDBench), a pluggable cloud-enabled benchmarking tool that can be used across any data store system. NDBench provides plugin support for the major data store systems that we use -- Cassandra (Thrift and CQL), Dynomite (Redis), and Elasticsearch. It can also be extended to other client APIs.

Introduction

As Netflix runs thousands of microservices, we are not always aware of the traffic that bundled microservices may generate on our backend systems. Understanding the performance implications of new microservices on our backend systems was also a difficult task. We needed a framework that could assist us in determining the behavior of our data store systems under various workloads, maintenance operations and instance types. We wanted to be mindful of provisioning our clusters, scaling them either horizontally (by adding nodes) or vertically (by upgrading the instance types), and operating under different workloads and conditions, such as node failures, network partitions, etc.


As new data store systems appear in the market, they tend to report performance numbers for the “sweet spot”, and are usually based on optimized hardware and benchmark configurations. Being a cloud-native database team, we want to make sure that our systems can provide high availability under multiple failure scenarios, and that we are utilizing our instance resources optimally. There are many other factors that affect the performance of a database deployed in the cloud, such as instance types, workload patterns, and types of deployments (island vs global). NDBench aids in simulating the performance benchmark by mimicking several production use cases.


There were also some additional requirements; for example, as we upgrade our data store systems (such as Cassandra upgrades) we wanted to test the systems prior to deploying them in production. For systems that we develop in-house, such as Dynomite, we wanted to automate the functional test pipelines, understand the performance of Dynomite under various conditions, and under different storage engines. Hence, we wanted a workload generator that could be integrated into our pipelines prior to promoting an AWS AMI to a production-ready AMI.


We looked into various benchmark tools as well as REST-based performance tools. While some tools covered a subset of our requirements, we were interested in a tool that could achieve the following:
  • Dynamically change the benchmark configurations while the test is running, hence perform tests along with our production microservices.
  • Be able to integrate with platform cloud services such as dynamic configurations, discovery, metrics, etc.
  • Run for an infinite duration in order to introduce failure scenarios and test long running maintenances such as database repairs.
  • Provide pluggable patterns and loads.
  • Support different client APIs.
  • Deploy, manage and monitor multiple instances from a single entry point.
For these reasons, we created Netflix Data Benchmark (NDBench). We incorporated NDBench into the Netflix Open Source ecosystem by integrating it with components such as Archaius for configuration, Spectator for metrics, and Eureka for discovery service. However, we designed NDBench so that these libraries are injected, allowing the tool to be ported to other cloud environments, run locally, and at the same time satisfy our Netflix OSS ecosystem users.

NDBench Architecture

The following diagram shows the architecture of NDBench. The framework consists of three components:
  • Core: The workload generator
  • API: Allowing multiple plugins to be developed against NDBench
  • Web: The UI and the servlet context listener
We currently provide the following client plugins -- Datastax Java Driver (CQL), C* Astyanax (Thrift), Elasticsearch API, and Dyno (Jedis support). Additional plugins can be added, or a user can use dynamic scripts in Groovy to add new workloads. Each driver is just an implementation of the Driver plugin interface.

NDBench-core is the core component of NDBench, where one can further tune workload settings.


Fig. 1: NDBench Architecture


NDBench can be used from either the command line (using REST calls), or from a web-based user interface (UI).

NDBench Runner UI

ndbench-ui-2.jpg
Fig.2: NDBench Runner UI


A screenshot of the NDBench Runner (Web UI) is shown in Figure 2. Through this UI, a user can select a cluster, connect a driver, modify settings, set a load testing pattern (random or sliding window), and finally run the load tests. Selecting an instance while a load test is running also enables the user to view live-updating statistics, such as read/write latencies, requests per second, cache hits vs. misses, and more.

Load Properties

NDBench provides a variety of input parameters that are loaded dynamically and can dynamically change during the workload test. The following parameters can be configured on a per node basis:
  • numKeys: the sample space for the randomly generated keys
  • numValues: the sample space for the generated values
  • dataSize: the size of each value
  • numWriters/numReaders: the number of threads per NDBench node for writes/reads
  • writeEnabled/readEnabled: boolean to enable or disable writes or reads
  • writeRateLimit/readRateLimit: the number of writes per second and reads per seconds
  • userVariableDataSize: boolean to enable or disable the ability of the payload to be randomly generated.

Types of Workload

NDBench offers pluggable load tests. Currently it offers two modes -- random traffic and sliding window traffic. The sliding window test is a more sophisticated test that can concurrently exercise data that is repetitive inside the window, thereby providing a combination of temporally local data and spatially local data. This test is important as we want to exercise both the caching layer provided by the data store system, as well as the disk’s IOPS (Input/Output Operations Per Second).

Load Generation

Load can be generated individually for each node on the application side, or all nodes can generate reads and writes simultaneously. Moreover, NDBench provides the ability to use the “backfill” feature in order to start the workload with hot data. This helps in reducing the ramp up time of the benchmark.

NDBench at Netflix

NDBench has been widely used inside Netflix. In the following sections, we talk about some use cases in which NDBench has proven to be a useful tool.

Benchmarking Tool

A couple of months ago, we finished the Cassandra migration from version 2.0 to 2.1. Prior to starting the process, it was imperative for us to understand the performance gains that we would achieve, as well as the performance hit we would incur during the rolling upgrade of our Cassandra instances. Figures 3 and 4 below illustrate  the p99 and p95 read latency differences using NDBench. In Fig. 3, we highlight the differences between Cassandra 2.0 (blue line) vs 2.1 (brown line).















Fig.3: Capturing OPS and latency percentiles of Cassandra


Last year, we also migrated all our Cassandra instances from the older Red Hat 5.10 OS to Ubuntu 14.04 (Trusty Tahr). We used NDBench to measure performance under the newer operating system. In Figure 4, we showcase the three phases of the migration process by using NDBench’s long-running benchmark capability. We used rolling terminations of the Cassandra instances to update the AMIs with the new OS, and NDBench to verify that there would be no client-side impact during the migration. NDBench also allowed us to validate that the performance of the new OS was better after the migration.


Fig.4: Performance improvement from our upgrade from Red Hat 5.10 to Ubuntu 14.04

AMI Certification Process

NDBench is also part of our AMI certification process, which consists of integration tests and deployment validation. We designed pipelines in Spinnaker and integrated NDBench into them. The following figure shows the bakery-to-release lifecycle. We initially bake an AMI with Cassandra, create a Cassandra cluster, create an NDBench cluster, configure it, and run a performance test. We finally review the results, and make the decision on whether to promote an “Experimental” AMI to a “Candidate”. We use similar pipelines for Dynomite, testing out the replication functionalities with different client-side APIs. Passing the NDBench performance tests means that the AMI is ready to be used in the production environment. Similar pipelines are used across the board for other data store systems at Netflix.
Fig.5 NDBench integrated with Spinnaker pipelines


In the past, we’ve published benchmarks of Dynomite with Redis as a storage engine leveraging NDBench. In Fig. 6 we show some of the higher percentile latencies we derived from Dynomite leveraging NDBench.
Fig.6: P99 latencies for Dynomite with consistency set to DC_QUORUM with NDBench


NDBench allows us to run infinite horizon tests to identify potential memory leaks from long running processes that we develop or use in-house. At the same time, in our integration tests we introduce failure conditions, change the underlying variables of our systems, introduce CPU intensive operations (like repair/reconciliation), and determine the optimal performance based on the application requirements. Finally, our sidecars such as Priam, Dynomite-manager and Raigad perform various activities, such as multi-threaded backups to object storage systems. We want to make sure, through integration tests, that the performance of our data store systems is not affected.

Conclusion

For the last few years, NDBench has been a widely-used tool for functional, integration, and performance testing, as well as AMI validation. The ability to change the workload patterns during a test, support for different client APIs, and integration with our cloud deployments has greatly helped us in validating our data store systems. There are a number of improvements we would like to make to NDBench, both for increased usability and supporting additional features. Some of the features that we would like to work on include:
  • Performance profile management
  • Automated canary analysis
  • Dynamic load generation based on destination schemas
NDBench has proven to be extremely useful for us on the Cloud Database Engineering team at Netflix, and we are happy to have the opportunity to share that value. Therefore, we are releasing NDBench as an open source project, and are looking forward to receiving feedback, ideas, and contributions from the open source community. You can find NDBench on Github at: https://github.com/Netflix/ndbench


If you enjoy the challenges of building distributed systems and are interested in working with the Cloud Database Engineering team in solving next-generation data store problems, check out our job openings.


Authors: Vinay Chella, Ioannis Papapanagiotou, and Kunal Kundaje