Monday, July 30, 2012

Chaos Monkey Released Into The Wild

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.
We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.
Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.

What is Chaos Monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don't, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

Why Run Chaos Monkey?

Failures happen and they inevitably happen when least desired or expected. If your application can't tolerate an instance failure would you rather find out by being paged at 3am or when you're in the office and have had your morning coffee? Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that "simple fix" you put in place last week could have undesired consequences. Do your traffic load balancers correctly detect and route requests around instances that go offline? Can you reliably rebuild your instances? Perhaps an engineer "quick patched" an instance last week and forgot to commit the changes to your source repository?
There are many failure scenarios that Chaos Monkey helps us detect. Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again.

Auto Scaling Groups

The default instance groupings that Chaos uses for selection is Amazon's Auto Scaling Group (ASG). Within an ASG, Chaos Monkey will select an instance at random and terminate it. The ASG should detect the instance termination and automatically bring up a new, identically configured, instance. If you are not using Auto Scaling Groups that should be the first step to making your application handle these isolated instance failure scenarios. Check out Asgard to make deploying and managing ASGs easy. There are many great features for ASGs beyond replacing terminated instances, like enabling the use of Amazon's Elastic Load Balancers (ELBs) to distribute traffic to all instances in your application. Netflix has a best-practice where all instances should be run within an ASG and we have Janitor Monkey to remind us by terminating all instances not following this best-practice.

Configuration

Chaos Monkey allows for an Opt-In or an Opt-Out model. At Netflix, we use the Opt-Out model, so if an application owner does nothing, Chaos Monkey will be acting on their application. For your organization, you have the option to choose what is right for you. This allows you to "test the water" and try out Chaos Monkey on a specific application to see how it reacts. Not every application can trivially handle an instance going offline.  Sometimes it takes a human to manually recover instances, perhaps exercising backups to bring them back. Ideally, engineers work towards making that process easier and faster and eventually automatic. For those applications, there is the ability to Opt-Out of Chaos Monkey. There is also a tunable "probability" that Chaos Monkey uses to control the chance of a termination.  A probability of 1 (or 100%) will terminate one instance per day per ASG.  If instance recovery is difficult and you only want a termination weekly, you can reduce the probability to 0.2 or 20% (daily is 100%, it runs 5 work days per week, so weekly is 20%). Note that this is still a probability and only meaningful when sampled multiple times. With a 20% probability, Chaos Monkey would terminate one instance a week on average. In practice, it might be 2 days in a row followed by 2 weeks of no terminations, but given a large enough sample it will terminate weekly on average. For an environment as large as Netflix, the configuration can get a bit tricky to manage and for this we have developed a dashboard to help that we hope to open source soon. You can read more about how to configure Chaos Monkey on the documentation wiki.

REST

Currently, there is a simple REST interface that allows you to query Chaos Monkey termination events. We keep records of what was terminated and when, so if something disappears, you can see if Chaos Monkey was responsible. You could use this API to get notifications of terminations, but we encourage you to use a more general application monitoring solution like servo to discover what is happening to your applications at runtime.

Costs

The termination events are stored in an Amazon SimpleDB table by default. There could be associated costs with Amazon SimpleDB but the activity of Chaos Monkey should be small enough to fall within Amazon's Free Usage Tier. Ultimately the costs associated with running Chaos Monkey are your responsibility.

More Monkey Business

We have a long line of simians waiting to be released.  The next likely candidate will be Janitor Monkey which helps keep your environment tidy and your costs down.  Stay tuned for more announcements.
If building tools to automate the operations and improve the reliability of the cloud sounds exciting, we're always looking for new members to join the team.  Take a look at jobs.netflix.com for current openings or contact @atseitlin.






Thursday, July 19, 2012

Benchmarking High Performance I/O with SSD for Cassandra on AWS


by Adrian Cockcroft

Today AWS has launched a new Solid State Disk (SSD) based instance that addresses the need for high performance I/O, and we have run a few initial benchmarks to see how it shapes up. With this announcement AWS makes it easy to provision extremely high I/O capacity with consistently low latency. AWS has been competitive in instance memory capacity for a long time and is leading the industry in CPU performance along with 10GBit networks. Now that extremely IO intensive applications can be deployed, a commonly cited obstacle to running in the cloud has been removed.

Benchmarking
Last year we published an Apache Cassandra performance benchmark that achieved over a million client writes per second using hundreds of fairly small EC2 instances. We were testing the scalability of the Priam tooling that we used to create and manage Cassandra, and proved that large scale Cassandra clusters scale up linearly, so ten times the number of instances gets you ten times the throughput. Today we are publishing some benchmark results that include a comparison of Cassandra running on an existing instance type to the new SSD based instance type.

Summary of AWS Instance I/O Options
There are several existing storage options based on internal disks, these are ephemeral - they go away when the instance terminates. The three options that we have previously tested for Cassandra are found in the m1.xlarge, m2.4xlarge, and cc2.8xlarge instances, and this is now joined by the new SSD based hi1.4xlarge. AWS specifies relative total CPU performance for each instance type using EC2 Compute Units (ECU).



Instance TypeCPUMemoryInternal StorageNetwork
m1.xlarge4 CPU threads
8 ECU
15GB RAM4 x 420GB
Disk
1 Gbit
m2.4xlarge8 CPU threads
26 ECU
68GB RAM2 x 840GB
Disk
1 Gbit
hi1.4xlarge
(Westmere)
16 CPU threads
35 ECU
60GB RAM2 x 1000GB
Solid State Disk
10 Gbit
cc2.8xlarge
(Sandybridge)
32 CPU threads
88 ECU
60GB RAM4 x 840GB
Disk
10 Gbit


We primarily use m2.4xlarge to run Cassandra at Netflix today as it has the best balance of CPU, IO and RAM capacity for most of our workloads, although we have had to be careful not to overload the IO with maintenance operations by scheduling compactions and repairs in sequence across the nodes.

The hi1.4xlarge SSD Based Instance
This new instance type provides high performance internal ephemeral SSD based storage. The CPU reported by /proc/cpuinfo is an Intel Westmere E5620 at 2.4GHz with 8 cores and hyper threading, so it appears as 16 CPU threads. This falls between the m2.4xlarge and cc2.8xlarge in CPU performance, with similar RAM capacity, and a 10Gbit network interface like the cc2.8xlarge.

The disk configuration appears as two large SSD volumes of around a terabyte each, and the instance is capable of around 100,000 very low latency IOPS and a gigabyte per second of throughput. This provides hundreds of times higher throughput than can be achieved with other storage options, and has extremely low latency and variance, since the hi1.4xlarge instance has local access to the SSD, and there is no network traffic in the storage path.

Benchmark Results
The first thing to do with a new storage subsystem is basic filesystem level performance testing, we used the iozone benchmark to verify that we could get over 100,000 IOPS and 1 GByte/s of throughput at the disk level, at a very low service time per request, 20 to 60 microseconds.


iozone with 60 threadsI/O per secondKBytes per secondService Time
Sequential Writes16,500 64KB writes1,050,0000.06 ms
Random Reads100,000 4KB reads400,0000.02 ms
Mapped Random Reads56,000 19KB reads1,018,0000.04 ms

The second benchmark was to use the standard Cassandra stress test to run simple data access patterns against a small dataset, similar to the benchmark we published last year. We found that our tests were mostly CPU bound, but we could get close to a gigabyte per second of throughput at the disk for a short while during startup, as the data loaded into memory. The increased CPU performance of 35 ECUs for the hi1.4xlarge over the m2.4xlarge at 26 ECUs gave a useful speedup, but the test wasn't generating enough IOPS.

The third was more complex, we took our biggest Cassandra data store and restored two copies of it from backups, one on m2.4xlarge, and one on hi1.4xlarge, so that we could evaluate a real-world workload and figure out how best to configure the SSD instances as a replacement for the existing configuration. We concentrate on the application level benchmark next as it's the most interesting comparison.

Netflix Application Benchmark
Our architecture is very fine grain, with each development team owning a set of services and data stores. As a result, we have tens of distinct Cassandra clusters in production, each serving up a different data source. The one we picked is storing 8.5TB of data and has a rest based data provider application that currently uses a memcached tier to cache results for the read workload as well as Cassandra for persistent writes. Our goal was to see if we could use a smaller number of SSD based Cassandra instances, and do without the memcached tier, without impacting response times. Our memcached tier is wrapped up in a service we call EVcache that we described in a previous techblog post. The two configurations compared were:

  • Existing system: 48 Cassandra on m2.4xlarge. 36 EVcache on m2.xlarge.
  • SSD based system: 12 Cassandra on hi1.4xlarge.

This application is one of the most complex and demanding workloads we run. It requires tens of thousands of reads and thousands of writes per second. The queries and column family layout are far more complex than the simple stress benchmark. The EVcache tier absorbs most of the reads in the existing system, and the Cassandra instances aren’t using all the available CPU. We use a lot of memory to reduce the IO workload to a sustainable level.

The SSD based system running the same workload had plenty of IOPS left over and could also run compaction operations under full load without affecting response times. The overall throughput of the 12-instance SSD based system was CPU limited to about 20% less than the existing system, but with much lower mean and 99th percentile latency. This sizing exercise indicated that we could replace the 48 m2.4xlarge and 36 m2.xlarge with 15 hi1.4xlarge to get the same throughput, but with much lower latency.


Cost Comparison
We have already found that running Cassandra on EC2 using ephemeral disks and triple replicated instances is a very scalable, reliable and cost effective storage mechanism, despite having to over-configure RAM and CPU capacity to compensate for a relative lack of IOPS in each m2.4xlarge instance. With the hi1.4xlarge SSD instance, the bottleneck moves from IOPS to CPU and we will be able to reduce the instance count substantially.

The relative cost of the two configurations shows that over-all there are cost savings using the SSD instances. There are no per-instance software licensing costs for using Apache Cassandra, but users of commercial data stores could also see a licensing cost saving by reducing instance count.

Benefits of moving Cassandra Workloads to SSD

  • The hi1.4xlarge configuration is about half the system cost for the same throughput.
  • The mean read request latency was reduced from 10ms to 2.2ms.
  • The 99th percentile request latency was reduced from 65ms to 10ms.

Summary
We were able to validate the claimed raw performance numbers for the hi1.4xlarge and in a real world benchmark it gives us a simpler and lower cost solution for running our Cassandra workloads.

TL;DR

What follows is a more detailed explanation of the benchmark configuration and results. TL;DR is short for "too long; don't read". If you get all the way to the end and understand it, you get a prize...

SSD hi1.4xlarge Filesystem Tests with iozone

The Cassandra disk access workload consists of large sequential writes from the SSTable flushes, and small random reads as all the stored versions of keys are checked for a get operation. As more files are written, the number of reads increases, then a compaction replaces a few smaller files with one large one. The iozone benchmark was used to create a similar workload on one hi4.4xlarge instance. The standard data size recommendation for iozone is twice the memory capacity, in this case 120GB is needed.

Using sixty threads to write 2GB files at once using 64KB writes, results in 1099MBytes/s at 0.06ms service time.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.19    0.00   41.90   49.77    5.64    2.50

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda1             19.60    52.90 28.00 104.40     0.66     0.61    19.64     0.54    4.11   1.87  24.80
sdb               0.00 52068.10  0.20 15645.50   0.00   549.45    71.92    85.93    5.50   0.06  98.98
sdc               0.00 52708.00  0.40 15027.10   0.00   549.65    74.91   139.66    9.30   0.07  99.31
md0               0.00     0.00  0.60 135509.40  0.00  1099.17    16.61     0.00    0.00   0.00   0.00


Reading back from the sixty files with 4KB random requests gets about 100,000 reads/sec and 400MBytes/s.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.98    0.00   19.33   54.64    8.89   16.16

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda1              0.00     0.40  0.00  0.20     0.00     0.00    24.00     0.00    2.50   2.50   0.05
sdb               0.00     0.00 50558.70  0.00 197.53    0.00     8.00    25.84    0.52   0.02  99.96
sdc               0.00     0.00 50483.80  0.00 197.23    0.00     8.00    21.15    0.43   0.02  99.95
md0               0.00     0.00 101041.70 0.00 394.76    0.00     8.00     0.00    0.00   0.00   0.00


Telling iozone to memory map the file that it is reading (as Cassandra does) makes the reads more efficient and started off with over a gigabyte per second of 4KB mapped read requests across 60 threads, with requests being merged and extra data being fetched on each read. As memory filled up the request rate sped up and the data rate dropped.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.38    0.00    4.78   27.00    0.75   67.09

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda1              0.00     2.00  0.00  2.00     0.00     0.02    16.00     0.00    2.25   1.00   0.20
sdb            1680.40     0.00 28292.00 0.20 509.26     0.00    36.86    49.88    1.76   0.04 100.01
sdc            1872.20     0.00 28041.10 0.20 508.86     0.00    37.16    84.62    3.02   0.04  99.99
md0               0.00     0.00 59885.50 0.40 1018.09    0.00    34.82     0.00    0.00   0.00   0.00


Versions and Automation
Cassandra itself has moved on significantly from the 0.8.3 build that we used last year, to the 1.0.9 build that we are currently running. We have also built and published automation around the Jmeter workload generation tool, which makes it even easier to run sophisticated performance regression tests.

We currently run Centos 5 Linux and use mdadm to stripe together our disk volumes with default options and the XFS filesystem. No tuning was performed on the Linux or disk configuration for these tests.

For extensive explanation of how Cassandra works please see the previous Netflix Tech Blog Cassandra Benchmark post, and more recent post on the Priam and Jmeter code used to manage the instances and run the benchmark. All this code is Apache 2.0 licensed and hosted at netflix.github.com.

We used Java7 and the following Cassandra configuration tuning in this benchmark:
conf/cassandra-env.sh
MAX_HEAP_SIZE="10G"
HEAP_NEWSIZE="2G"

JVM_OPTS="$JVM_OPTS -XX:+UseCondCardMark"

conf/cassandra.yaml
concurrent_reads: 128
concurrent_writes: 128

rpc_server_type: hsha
rpc_min_threads: 32
rpc_max_threads: 1024

rpc_timeout_in_ms: 5000

dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 60000
dynamic_snitch_badness_threshold: 0.2


Cost Comparison Details for the Netflix Application Benchmark
The configurations were both loaded with the same 8.5TB dataset, so the 48 m2.4xlarge systems had 177 GB per node, and the 12 hi1.4xlarge based systems had 708 GB per node. As usual, we triple-replicate all our data across three AWS Availability Zones, so this is 2.8TB of unique data per zone. We used our test environment and a series of application level stress tests.

Per-hour pricing is appropriate for running benchmarks, but for production use a long lived data store will have instances in use all the time, so the 3-year heavy use reservation provides the best price comparison against the total cost of ownership of on-premise alternatives. Both options are shown below based on US-East pricing (EU-West prices are a little higher).


Instance Type
On-Demand Hourly Cost
3 Year Heavy Use Reservation
3 Year Heavy Use Hourly Cost
Total 3 Year Heavy Use Cost
m2.xlarge
$0.45/hour
$1550
$0.070/hour
$3360
m2.4xlarge
$1.80/hour
$6200
$0.280/hour
$13558
hi1.4xlarge
$3.10/hour
$10960
$0.482/hour
$23627


With the instance counts balanced to get the same throughput:


System Configuration
On-Demand Hourly Cost
Total 3 Year Heavy Use Cost
36 x m2.xlarge + 48 x m2.4xlarge
36 x $0.45 + 48 x $1.80 = $102/hour
$772806
15 x hi1.4xlarge
15 x $3.10 = $46.5/hour
$354405

The usable capacity of the system is reduced by the replication factor of three from the raw capacity. This provides very high availability for the service and very high durability for the data, even if individual instances or entire availability zones are lost. We already established that we get linear scalability for Cassandra with automated deployments up to hundreds of instances, so extremely high performance clusters can easily be built. For the cost shown above the usable durable and available capacity is as follows, for each availability zone containing five instances:


  • 80 CPU threads, 175 ECU
  • 300 GB RAM
  • 10 TB of durable storage.
  • 500,000 low latency IOPS
  • 5 Gigabytes/s of disk throughput
  • 50 Gbits of network capacity


The Prize
If you read this far and made sense of the iostat metrics and Cassandra tuning options, the prize is that we'd like to talk to you, we're hiring in Los Gatos CA for our Cassandra development and operations teams and our performance team. Contact me @adrianco or see http://jobs.netflix.com