Netflix has been rolling out the Apache Cassandra NoSQL data store for production use over the last six months. As part of our benchmarking we recently decided to run a test designed to validate our tooling and automation scalability as well as the performance characteristics of Cassandra. Adrian presented these results at the High Performance Transaction Systems workshop last week.
We picked a write oriented benchmark using the standard Cassandra "stress" tool that is part of the product, and Denis ran and analyzed the tests on Amazon EC2 instances. Writes stress a data store all the way to the disks, while read benchmarks may only exercise the in-memory cache. The benchmark results should be reproducible by anyone, but the Netflix cloud platform automation for AWS makes it quick and easy to do this kind of test.
The automated tooling that Netflix has developed lets us quickly deploy large scale Cassandra clusters, in this case a few clicks on a web page and about an hour to go from nothing to a very large Cassandra cluster consisting of 288 medium sized instances, with 96 instances in each of three EC2 availability zones in the US-East region. Using an additional 60 instances as clients running the stress program we ran a workload of 1.1 million client writes per second. Data was automatically replicated across all three zones making a total of 3.3 million writes per second across the cluster. The entire test was able to complete within two hours with a total cost of a few hundred dollars, and these EC2 instances were only in existence for the duration of the test. There was no setup time, no discussions with IT operations about datacenter space and no more cost once the test was over.
To measure scalability, the same test was run with 48, 96, 144 and 288 instances, with 10, 20, 30 and 60 clients respectively. The load on each instance was very similar in all cases, and the throughput scaled linearly as we increased the number of instances. Our previous benchmarks and production roll-out had resulted in many application specific Cassandra clusters from 6 to 48 instances, so we were very happy to see linear scale to six times the size of our current largest deployment. This benchmark went from concept to final result in five days as a spare time activity alongside other work, using our standard production configuration of Cassandra 0.8.6, running in our test account. The time taken by EC2 to create 288 new instances was about 15 minutes out of our total of 66 minutes. The rest of the time was taken to boot Linux, start the Apache Tomcat JVM that runs our automation tooling, start the Cassandra JVM and join the "ring" that makes up the Cassandra data store. For a more typical 12 instance Cassandra cluster the same sequence takes 8 minutes.
The Netflix cloud systems group recently created a Cloud Performance Team to focus on characterizing the performance of components such as Cassandra, and helping other teams make their code and AWS usage more efficient to reduce latency for customers and costs for Netflix. This team is currently looking for an additional engineer.
TL;DRThe rest of this post is the details of what we ran and how it worked, so that other performance teams working with Cassandra on AWS can leverage our work and replicate and extend our results.
EC2 ConfigurationThe EC2 instances used to run Cassandra in this test are known as M1 Extra Large (m1.xl), they have four medium speed CPUs, 15GB RAM and four disks of 400GB each. The total CPU power is rated by Amazon as 8 units. The other instance type we commonly use for Cassandra is an M2 Quadruple Extra Large (m2.4xl) which has eight (faster) CPUs, 68GB RAM and two disks of 800GB each, total 26 units of CPU power, so about three times the capacity. We use these for read intensive workloads to cache more data in memory. Both these instance types have a one Gbit/s network. There is also a Cluster Compute Quadruple Extra Large (cc.4xl) option with eight even faster CPUs (total 33.5 units), 23GB RAM, two 800GB disks and a low latency 10Gbit network. We haven't tried that option yet. In this case we were particularly interested in pushing the instance count to a high level to validate our tooling, so picked a smaller instance option. All four disks were striped together using CentOS 5.6 md and XFS. The Cassandra commit log and data files were all stored on this filesystem.
The 60 instances that ran stress as clients were m2.4xl instances running in a single availability zone, so two thirds of the client traffic was cross-zone, this slightly increased the mean client latency.
All instances are created using the EC2 auto-scaler feature. A single request is made to set the desired size of an auto-scale group (ASG), specifying an Amazon Machine Image (AMI) that contains the standard Cassandra 0.8.6 distribution plus Netflix specific tooling and instrumentation. Three ASGs are created, one in each availability zone, which are separate data-centers separated by about one millisecond of network latency. EC2 automatically creates the instances in each availability zone and maintains them at the set level. If an instance dies for any reason, the ASG automatically creates a replacement instance and the Netflix tooling manages bootstrap replacement of that node in the Cassandra cluster. Since all data is replicated three ways, effectively in three different datacenter buildings, this configuration is very highly available.
Here's an example screenshot of the configuration page for a small Cassandra ASG, editing the size fields and saving the change is all that is needed to create a cluster at any size.
Netflix Automation for Cassandra - PriamNetflix has also implemented automation for taking backups and archiving them in the Simple Storage Service (S3), and we can perform a rolling upgrade to a new version of Cassandra without any downtime. It is also possible to efficiently double the size of a Cassandra cluster while it is running. Each new node buddies up and splits the data and load of one of the existing nodes so that data doesn't have to be reshuffled too much. If a node fails, it's replacement has a different IP address, but we want it to have the same token, and the original Cassandra replacement mechanisms had to be extended to handle this case cleanly. We call the automation "Priam", after Cassandra's father in Greek mythology, it runs in a separate Apache Tomcat JVM and we are in the process of removing Netflix specific code from Priam so that we can release it as an open source project later this year. We have already released an Apache Zookeeper interface called Curator at Github and also plan to release a Java client library called Astyanax (the son of Hector, who was Cassandra's brother, and Hector is also the name of a commonly used Java client library for Cassandra that we have improved upon). We are adding Greek mythology to our rota of interview questions :-)
Scale-Up LinearityThe scalability is linear as shown in the chart below. Each client system generates about 17,500 write requests per second, and there are no bottlenecks as we scale up the traffic. Each client ran 200 threads to generate traffic across the cluster.
Per-Instance ActivityThe next step is to look at the average activity level on each instance for each of these tests to look for bottlenecks. A summary is tabulated below.
The writes per server are similar as we would expect, and the mean latency measured at the server remains low as the scale increases. The response time measured at the client was about 11ms, with about 1.2ms due to network latency and the rest from the Thrift client library overhead and scheduling delays as the threads pick up responses. The write latency measured at each Cassandra server is a small fraction of a millisecond (explained in detail later). Average server side latency of around one millisecond is what we typically see on our production Cassandra clusters with a more complex mixture of read and write queries. The CPU load is a little higher for the largest cluster. This could be due to a random fluctuation in the test, which we only ran once, variations in the detailed specification of the m1.xl instance type, or an increase in gossip or connection overhead for the larger cluster. Disk writes are caused by commit log writes and large sequential SSTable writes. Disk reads are due to the compaction processing that combines Cassandra SSTables in the background. Network traffic is dominated by the Cassandra inter-node replication messages.
Costs of Running This BenchmarkBenchmarking can take a lot of time and money, there are many permutations of factors to test so the cost of each test in terms of setup time and compute resources used can be a real limitation on how many tests are performed. Using the Netflix cloud platform automation for AWS a dramatic reduction in setup time and cost means that we can easily run more and bigger tests. The table below shows the test duration and AWS cost at the normal list price. This cost could be further reduced by using spot pricing, or by sharing unused reservations with the production site.
The usable storage capacity for a Cassandra 0.8.6 instance is half the available filesystem space, because Cassandra's current compaction algorithm needs space to compact into. This changes with Cassandra 1.0, which has an improved compaction algorithm and on-disk compression of the SSTables. The m1.xl instances cost $0.68 per hour and the m2.4xl test driver instances cost $2.00 per hour. We ran this test in our default configuration which is highly available by locating replicas in three availability zones, there is a cost for this, since AWS charges $0.01 per gigabyte for cross zone traffic. An estimation of cross zone traffic was made as two thirds of the total traffic and for this network intense test it actually cost more per hour than the instances. The test itself was run for about ten minutes, which was long enough to show a clear steady state load level. Taking the setup time into account, the smaller tests can be completed within an hour, the largest test needed a second hour for the nodes only.
Unlike conventional datacenter testing, we didn't need to ask permission, wait for systems to be configured for us, or put up with a small number of dedicated test systems. We could also run as many tests as we like at the same time, since they don't use the same resources. Denis has developed scripting to create, monitor, analyze and plot the results of these tests. For example here's the client side response time plot from a single client generating about 20,000 requests/sec.
Detailed Cassandra ConfigurationThe client requests use consistency level "ONE" which means that they are acknowledged when one node has acknowledged the data. For consistent read after write data access our alternative pattern is to use "LOCAL QUORUM". In that case the client acknowledgement waits for two out of the three nodes to acknowledge the data and the write response time increases slightly, but the work done by the Cassandra cluster is essentially the same. As an aside, in our multi-region testing we have found that network latency and Cassandra response times are lower in the AWS Europe region than in US East, which could be due to a smaller scale deployment or newer networking hardware. We don't think network contention from other AWS tenants was a significant factor in this test.
Stress command lineThe client is writing 10 columns per row key, row key randomly chosen from 27 million ids, each column has a key and 10 bytes of data. The total on disk size for each write including all overhead is about 400 bytes.
java -jar stress.jar -d "144 node ids" -e ONE -n 27000000 -l 3 -i 1 -t 200 -p 7102 -o INSERT -c 10 -r
Thirty clients talk to the first 144 nodes and 30 talk to the second 144. For the Insert we write three replicas which is specified in the keyspace.
Cassandra Keyspace Configuration
Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Durable Writes: true
Key Validation Class: org.apache.cassandra.db.marshal.BytesType
Default column value validator: org.apache.cassandra.db.marshal.BytesType
Columns sorted by: org.apache.cassandra.db.marshal.BytesType
Row cache size / save period in seconds: 0.0/0
Key cache size / save period in seconds: 200000.0/14400
Memtable thresholds: 1.7671875/1440/128 (millions of ops/minutes/MB)
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.0
Replicate on write: true
Data Flows, Latency and DurabilityTo understand latency and durability we need to look at the data flows for writes for different Cassandra configurations. Cassandra client's are not normally aware of which node should store their data, so they pick a node at random which then acts as a coordinator to send replicas of the data to the correct nodes (which are picked using a consistent hash of the row key). For the fastest writes a single node has to acknowledge the write. This is useful when replacing an ephemeral memcached oriented data store with Cassandra, where we want to avoid the cold cache issues associated with failed memcached instances, but speed and availability is more important than consistency. The additional overhead compared to memcached is an extra network hop. However an immediate read after write may get the old data, which will be eventually consistent.
To get immediately consistent writes with Cassandra we use a quorum write. Two out of three nodes must acknowledge the write before the client gets its ack so the writes are durable. In addition, if a read after a quorum write also uses quorum, it will always see the latest data, since the combination of two out of three in both cases must include an overlap. Since there is no concept of a master node for data in Cassandra, it is always possible to read and write, even when a node has failed.
The Cassandra commit log flushes to disk with an fsync call every 10 seconds by default. This means that there is up to ten seconds where the committed data is not actually on disk, but it has been written to memory in three different instances in three different availability zones (i.e. datacenter buildings). The chance of losing all three copies in the same time window is small enough that this provides excellent durability, along with high availability and low latency. The latency for each Cassandra server that receives a write is just the few microseconds it takes to queue the data to the commit log writer.
Cassandra implements a gossip protocol that lets every node know the state of all the others, if the target of a write is down the coordinator node remembers that it didn't get the data, which is known as "hinted handoff". When gossip tells the coordinator that the node has recovered, it delivers the missing data.
Netflix is currently testing and setting up global Cassandra clusters to support expansion into the UK and Ireland using the AWS Europe region located in Ireland. For use cases that need a global view of the data, an extra set of Cassandra nodes are configured to provide an asynchronously updated replica of all the data written on each side. There is no master copy, and both regions continue to work if the connections between them fail. In that case we use a local quorum for reads and writes, which sends the data remotely, but doesn't wait for it, so latency is not impacted by the remote region. This gives consistent access in the local region with eventually consistent access in the remote region along with high availability.
Next StepsThere are a lot more tests that could be run, we are currently working to characterize the performance of Cassandra 1.0 and multi-region performance for both reads and writes, with more complex query combinations and client libraries.
TakeawayNetflix is using Cassandra on AWS as a key infrastructure component of its globally distributed streaming product. Cassandra scales linearly far beyond our current capacity requirements, and very rapid deployment automation makes it easy to manage. In particular, benchmarking in the cloud is fast, cheap and scalable, once you try it, you won't go back.
If you are the kind of performance geek that has read this far and wishes your current employer would let you spin up huge tests in minutes and open source the tools you build, perhaps you should give us a call...