Thursday, July 19, 2012

Benchmarking High Performance I/O with SSD for Cassandra on AWS


by Adrian Cockcroft

Today AWS has launched a new Solid State Disk (SSD) based instance that addresses the need for high performance I/O, and we have run a few initial benchmarks to see how it shapes up. With this announcement AWS makes it easy to provision extremely high I/O capacity with consistently low latency. AWS has been competitive in instance memory capacity for a long time and is leading the industry in CPU performance along with 10GBit networks. Now that extremely IO intensive applications can be deployed, a commonly cited obstacle to running in the cloud has been removed.

Benchmarking
Last year we published an Apache Cassandra performance benchmark that achieved over a million client writes per second using hundreds of fairly small EC2 instances. We were testing the scalability of the Priam tooling that we used to create and manage Cassandra, and proved that large scale Cassandra clusters scale up linearly, so ten times the number of instances gets you ten times the throughput. Today we are publishing some benchmark results that include a comparison of Cassandra running on an existing instance type to the new SSD based instance type.

Summary of AWS Instance I/O Options
There are several existing storage options based on internal disks, these are ephemeral - they go away when the instance terminates. The three options that we have previously tested for Cassandra are found in the m1.xlarge, m2.4xlarge, and cc2.8xlarge instances, and this is now joined by the new SSD based hi1.4xlarge. AWS specifies relative total CPU performance for each instance type using EC2 Compute Units (ECU).



Instance TypeCPUMemoryInternal StorageNetwork
m1.xlarge4 CPU threads
8 ECU
15GB RAM4 x 420GB
Disk
1 Gbit
m2.4xlarge8 CPU threads
26 ECU
68GB RAM2 x 840GB
Disk
1 Gbit
hi1.4xlarge
(Westmere)
16 CPU threads
35 ECU
60GB RAM2 x 1000GB
Solid State Disk
10 Gbit
cc2.8xlarge
(Sandybridge)
32 CPU threads
88 ECU
60GB RAM4 x 840GB
Disk
10 Gbit


We primarily use m2.4xlarge to run Cassandra at Netflix today as it has the best balance of CPU, IO and RAM capacity for most of our workloads, although we have had to be careful not to overload the IO with maintenance operations by scheduling compactions and repairs in sequence across the nodes.

The hi1.4xlarge SSD Based Instance
This new instance type provides high performance internal ephemeral SSD based storage. The CPU reported by /proc/cpuinfo is an Intel Westmere E5620 at 2.4GHz with 8 cores and hyper threading, so it appears as 16 CPU threads. This falls between the m2.4xlarge and cc2.8xlarge in CPU performance, with similar RAM capacity, and a 10Gbit network interface like the cc2.8xlarge.

The disk configuration appears as two large SSD volumes of around a terabyte each, and the instance is capable of around 100,000 very low latency IOPS and a gigabyte per second of throughput. This provides hundreds of times higher throughput than can be achieved with other storage options, and has extremely low latency and variance, since the hi1.4xlarge instance has local access to the SSD, and there is no network traffic in the storage path.

Benchmark Results
The first thing to do with a new storage subsystem is basic filesystem level performance testing, we used the iozone benchmark to verify that we could get over 100,000 IOPS and 1 GByte/s of throughput at the disk level, at a very low service time per request, 20 to 60 microseconds.


iozone with 60 threadsI/O per secondKBytes per secondService Time
Sequential Writes16,500 64KB writes1,050,0000.06 ms
Random Reads100,000 4KB reads400,0000.02 ms
Mapped Random Reads56,000 19KB reads1,018,0000.04 ms

The second benchmark was to use the standard Cassandra stress test to run simple data access patterns against a small dataset, similar to the benchmark we published last year. We found that our tests were mostly CPU bound, but we could get close to a gigabyte per second of throughput at the disk for a short while during startup, as the data loaded into memory. The increased CPU performance of 35 ECUs for the hi1.4xlarge over the m2.4xlarge at 26 ECUs gave a useful speedup, but the test wasn't generating enough IOPS.

The third was more complex, we took our biggest Cassandra data store and restored two copies of it from backups, one on m2.4xlarge, and one on hi1.4xlarge, so that we could evaluate a real-world workload and figure out how best to configure the SSD instances as a replacement for the existing configuration. We concentrate on the application level benchmark next as it's the most interesting comparison.

Netflix Application Benchmark
Our architecture is very fine grain, with each development team owning a set of services and data stores. As a result, we have tens of distinct Cassandra clusters in production, each serving up a different data source. The one we picked is storing 8.5TB of data and has a rest based data provider application that currently uses a memcached tier to cache results for the read workload as well as Cassandra for persistent writes. Our goal was to see if we could use a smaller number of SSD based Cassandra instances, and do without the memcached tier, without impacting response times. Our memcached tier is wrapped up in a service we call EVcache that we described in a previous techblog post. The two configurations compared were:

  • Existing system: 48 Cassandra on m2.4xlarge. 36 EVcache on m2.xlarge.
  • SSD based system: 12 Cassandra on hi1.4xlarge.

This application is one of the most complex and demanding workloads we run. It requires tens of thousands of reads and thousands of writes per second. The queries and column family layout are far more complex than the simple stress benchmark. The EVcache tier absorbs most of the reads in the existing system, and the Cassandra instances aren’t using all the available CPU. We use a lot of memory to reduce the IO workload to a sustainable level.

The SSD based system running the same workload had plenty of IOPS left over and could also run compaction operations under full load without affecting response times. The overall throughput of the 12-instance SSD based system was CPU limited to about 20% less than the existing system, but with much lower mean and 99th percentile latency. This sizing exercise indicated that we could replace the 48 m2.4xlarge and 36 m2.xlarge with 15 hi1.4xlarge to get the same throughput, but with much lower latency.


Cost Comparison
We have already found that running Cassandra on EC2 using ephemeral disks and triple replicated instances is a very scalable, reliable and cost effective storage mechanism, despite having to over-configure RAM and CPU capacity to compensate for a relative lack of IOPS in each m2.4xlarge instance. With the hi1.4xlarge SSD instance, the bottleneck moves from IOPS to CPU and we will be able to reduce the instance count substantially.

The relative cost of the two configurations shows that over-all there are cost savings using the SSD instances. There are no per-instance software licensing costs for using Apache Cassandra, but users of commercial data stores could also see a licensing cost saving by reducing instance count.

Benefits of moving Cassandra Workloads to SSD

  • The hi1.4xlarge configuration is about half the system cost for the same throughput.
  • The mean read request latency was reduced from 10ms to 2.2ms.
  • The 99th percentile request latency was reduced from 65ms to 10ms.

Summary
We were able to validate the claimed raw performance numbers for the hi1.4xlarge and in a real world benchmark it gives us a simpler and lower cost solution for running our Cassandra workloads.

TL;DR

What follows is a more detailed explanation of the benchmark configuration and results. TL;DR is short for "too long; don't read". If you get all the way to the end and understand it, you get a prize...

SSD hi1.4xlarge Filesystem Tests with iozone

The Cassandra disk access workload consists of large sequential writes from the SSTable flushes, and small random reads as all the stored versions of keys are checked for a get operation. As more files are written, the number of reads increases, then a compaction replaces a few smaller files with one large one. The iozone benchmark was used to create a similar workload on one hi4.4xlarge instance. The standard data size recommendation for iozone is twice the memory capacity, in this case 120GB is needed.

Using sixty threads to write 2GB files at once using 64KB writes, results in 1099MBytes/s at 0.06ms service time.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.19    0.00   41.90   49.77    5.64    2.50

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda1             19.60    52.90 28.00 104.40     0.66     0.61    19.64     0.54    4.11   1.87  24.80
sdb               0.00 52068.10  0.20 15645.50   0.00   549.45    71.92    85.93    5.50   0.06  98.98
sdc               0.00 52708.00  0.40 15027.10   0.00   549.65    74.91   139.66    9.30   0.07  99.31
md0               0.00     0.00  0.60 135509.40  0.00  1099.17    16.61     0.00    0.00   0.00   0.00


Reading back from the sixty files with 4KB random requests gets about 100,000 reads/sec and 400MBytes/s.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.98    0.00   19.33   54.64    8.89   16.16

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda1              0.00     0.40  0.00  0.20     0.00     0.00    24.00     0.00    2.50   2.50   0.05
sdb               0.00     0.00 50558.70  0.00 197.53    0.00     8.00    25.84    0.52   0.02  99.96
sdc               0.00     0.00 50483.80  0.00 197.23    0.00     8.00    21.15    0.43   0.02  99.95
md0               0.00     0.00 101041.70 0.00 394.76    0.00     8.00     0.00    0.00   0.00   0.00


Telling iozone to memory map the file that it is reading (as Cassandra does) makes the reads more efficient and started off with over a gigabyte per second of 4KB mapped read requests across 60 threads, with requests being merged and extra data being fetched on each read. As memory filled up the request rate sped up and the data rate dropped.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.38    0.00    4.78   27.00    0.75   67.09

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda1              0.00     2.00  0.00  2.00     0.00     0.02    16.00     0.00    2.25   1.00   0.20
sdb            1680.40     0.00 28292.00 0.20 509.26     0.00    36.86    49.88    1.76   0.04 100.01
sdc            1872.20     0.00 28041.10 0.20 508.86     0.00    37.16    84.62    3.02   0.04  99.99
md0               0.00     0.00 59885.50 0.40 1018.09    0.00    34.82     0.00    0.00   0.00   0.00


Versions and Automation
Cassandra itself has moved on significantly from the 0.8.3 build that we used last year, to the 1.0.9 build that we are currently running. We have also built and published automation around the Jmeter workload generation tool, which makes it even easier to run sophisticated performance regression tests.

We currently run Centos 5 Linux and use mdadm to stripe together our disk volumes with default options and the XFS filesystem. No tuning was performed on the Linux or disk configuration for these tests.

For extensive explanation of how Cassandra works please see the previous Netflix Tech Blog Cassandra Benchmark post, and more recent post on the Priam and Jmeter code used to manage the instances and run the benchmark. All this code is Apache 2.0 licensed and hosted at netflix.github.com.

We used Java7 and the following Cassandra configuration tuning in this benchmark:
conf/cassandra-env.sh
MAX_HEAP_SIZE="10G"
HEAP_NEWSIZE="2G"

JVM_OPTS="$JVM_OPTS -XX:+UseCondCardMark"

conf/cassandra.yaml
concurrent_reads: 128
concurrent_writes: 128

rpc_server_type: hsha
rpc_min_threads: 32
rpc_max_threads: 1024

rpc_timeout_in_ms: 5000

dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 60000
dynamic_snitch_badness_threshold: 0.2


Cost Comparison Details for the Netflix Application Benchmark
The configurations were both loaded with the same 8.5TB dataset, so the 48 m2.4xlarge systems had 177 GB per node, and the 12 hi1.4xlarge based systems had 708 GB per node. As usual, we triple-replicate all our data across three AWS Availability Zones, so this is 2.8TB of unique data per zone. We used our test environment and a series of application level stress tests.

Per-hour pricing is appropriate for running benchmarks, but for production use a long lived data store will have instances in use all the time, so the 3-year heavy use reservation provides the best price comparison against the total cost of ownership of on-premise alternatives. Both options are shown below based on US-East pricing (EU-West prices are a little higher).


Instance Type
On-Demand Hourly Cost
3 Year Heavy Use Reservation
3 Year Heavy Use Hourly Cost
Total 3 Year Heavy Use Cost
m2.xlarge
$0.45/hour
$1550
$0.070/hour
$3360
m2.4xlarge
$1.80/hour
$6200
$0.280/hour
$13558
hi1.4xlarge
$3.10/hour
$10960
$0.482/hour
$23627


With the instance counts balanced to get the same throughput:


System Configuration
On-Demand Hourly Cost
Total 3 Year Heavy Use Cost
36 x m2.xlarge + 48 x m2.4xlarge
36 x $0.45 + 48 x $1.80 = $102/hour
$772806
15 x hi1.4xlarge
15 x $3.10 = $46.5/hour
$354405

The usable capacity of the system is reduced by the replication factor of three from the raw capacity. This provides very high availability for the service and very high durability for the data, even if individual instances or entire availability zones are lost. We already established that we get linear scalability for Cassandra with automated deployments up to hundreds of instances, so extremely high performance clusters can easily be built. For the cost shown above the usable durable and available capacity is as follows, for each availability zone containing five instances:


  • 80 CPU threads, 175 ECU
  • 300 GB RAM
  • 10 TB of durable storage.
  • 500,000 low latency IOPS
  • 5 Gigabytes/s of disk throughput
  • 50 Gbits of network capacity


The Prize
If you read this far and made sense of the iostat metrics and Cassandra tuning options, the prize is that we'd like to talk to you, we're hiring in Los Gatos CA for our Cassandra development and operations teams and our performance team. Contact me @adrianco or see http://jobs.netflix.com