Tuesday, March 13, 2012

JMeter Plugin for Cassandra


By Vijay Parthasarathy and Denis Sheahan

A number of previous blogs have discussed our adoption of Cassandra as a NoSQL solution in the cloud. We now have over 55 Cassandra clusters in the cloud and are moving our source of truth from our Datacenter to these Cassandra clusters. As part of this move we have not only contributed to Cassandra itself but developed software to ease its deployment and use. It is our plan to open source as much of this software as possible.

We recently announced the open sourcing of Priam, which is a co-process that runs alongside Cassandra on every node to provide backup and recovery, bootstrapping, token assignment, configuration management and a RESTful interface to monitoring and metrics. In January we also announced our Cassandra Java client Astyanax which is built on top of Thrift and provides lower latency, reduced latency variance, and better error handling.

At Netflix we have recently started to standardize our load testing across the fleet using Apache JMeter. As Cassandra is a key part of our infrastructure that needs to be tested we developed a JMeter plugin for Cassandra. In this blog we discuss the plugin and present performance data for Astyanax vs Thrift collected using this plugin.

Cassandra JMeter Plugin

JMeter allows us to customize our test cases based on our application logic/datamodel. The Cassandra JMeter plugin we are releasing today is described on the github wiki here. It consists of a jar file that is placed in JMeter's lib/ext directory. The instructions to build and install the jar file are here.

An example screenshot is shown below.


Benchmark Setup

We set up a simple 6-node Cassandra cluster using EC2 m2.4xlarge instances, and the following schema

create keyspace MemberKeySp
with placement_strategy = 'NetworkTopologyStrategy'
and strategy_options = [{us-east : 3}]
and durable_writes = true;
use MemberKeySp;
create column family Customer
with column_type = 'Standard'
and comparator = 'UTF8Type'
and default_validation_class = 'BytesType'
and key_validation_class = 'UTF8Type'
and rows_cached = 0.0
and keys_cached = 100000.0
and read_repair_chance = 0.0
and comment = 'Customer Records';

Six million rows were then inserted into the cluster with a replication factor 3. Each row has 19 columns of simple ascii data. Total data set is 2.9GB per node so easily cacheable in our instances which have 68GB of memory. We wanted to test the latency of the client implementation using a single Get Range Slice operation ie 100% Read only. Each test was run twice to ensure the data was indeed cached, confirmed with iostat. One hundred JMeter threads were used to apply the load with 100 connections from JMeter to each node of Cassandra. Each JMeter thread therefore has at least 6 connections to choose from when sending it's request to Cassandra.

Every Cassandra JMeter Thread Group has a Config Element called CassandraProperties which contains clientType amongst other properties. For Astyanax clientType is set t0 com.netflix.jmeter.connections.a6x.AstyanaxConnection, for Thrift com.netflix.jmeter.connections.thrift.ThriftConnection.

Token Aware is the default JMeter setting. If you wish to experiment with other settings create a properties file, cassandra.properties, in the JMeter home directory with properties from the list below.

astyanax.connection.discovery=
astyanax.connection.pool=
astyanax.connection.latency.stategy=

Results

Transaction throughput

This graph shows the throughput at 5 second intervals for the Token Aware client vs the Thrift client. Token aware is consistently higher than Thrift and its average is 3% better throughput

Average Latency

JMeter reports response times to millisecond granularity. The Token Aware implementation responds in 2ms the majority of the time with occasional 3ms periods, the average is 2.29ms. The Thrift implementation is consistently at 3ms. So Astyanax has about a 30% better response time than raw Thrift implementation without token aware connection pool.

The plugin provides a wide range of samplers for Put, Composite Put, Batch Put, Get, Composite Get, Range Get and Delete. The github wiki has examples for all these scenarios including jmx files to try. Usually we develop the test scenario using the GUI on our laptops and then deploy to the cloud for load testing using the non-GUI version. We often deploy on a number of drivers in order to apply the required level of load.

The data for the above benchmark was also collected using a tool called casstat which we are also making available in the repository. Casstat is a bash script that calls other tools at regular intervals, compares the data with its previous sample, normalizes it on a per second basis and displays the pertinent data on a single line. Under the covers casstat uses

  • Cassandra nodetool cfstats to get Column Family performance data
  • nodetool tpstats to get internal state changes
  • nodetool cfhistograms to get 95th and 99th percentile response times
  • nodetool compactionstats to get details on number and type of compactions
  • iostat to get disk and cpu performance data
  • ifconfig to calculate network bandwidth

An example output is below (note some fields have been removed and abbreviated to reduce the width)

Epoch Rds/s RdLat ... %user %sys %idle .... md0r/s w/s rMB/s wMB/s NetRxK NetTxK Percentile Read Write Compacts
133... 5657 0.085 ... 7.74 10.09 81.73 ... 0.00 2.00 0.00 0.05 9083 63414 99th 0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0
133... 5635 0.083 ... 7.65 10.12 81.79 ... 0.00 0.30 0.00 0.00 9014 62777 99th 0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0
133... 5615 0.085 ... 7.81 10.19 81.54 ... 0.00 0.60 0.00 0.00 9003 62974 99th 0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0
We merge the casstat data from each Cassandra node and then use gnuplot to plot throughput etc.

The Cassandra JMeter plugin has become a key part of our load testing environment. We hope the wider community also finds it useful.