Wednesday, March 14, 2012

Testing Netflix on Android


When Netflix decided to enter the Android ecosystem, we faced a daunting set of challenges: a) We wanted to release rapidly every 6-8 weeks, b) There were hundreds of Android devices of different shapes, versions, capacities and specifications which need to playback audio and video and c) We wanted to keep the team small and happy.

Of course, the seasoned tester in you has to admit that these are the sort of problems you like to wake up to every day and solve. Doing it with a group of other software engineers who are passionate about quality is what made overcoming those challenges even more fun.

Release rapidly

You probably guessed that automation had to play a role in this solution. However automating scenarios on the phone or a tablet is complicated when the core functionality of your application is to play back videos natively but you are using an HTML5 interface which lives in the application’s web view.

Verifying an app that uses an embedded web view to serve as its presentation platform was challenging in part due to the dearth of tools available. We considered, Selenium, AndroidNativeDriver and the Android Instrumentation Framework. Unfortunately, we could not use Selenium or the AndroidNativeDriver, because the bulk of our user interactions occur on the HTML5 front end.  As a result, we decided to build a slightly modified solution.

Our modified test framework heavily leverages a piece of our product code which bridges JavaScript and native code through a proxy interface.  Though we were able to drive some behavior by sending commands through the bridge, we needed an automation hook in order to report state back to the automation framework. Since the HTML document doesn’t expose its title, we decided to use the title element as our hook.  We rely on the onReceivedTitle notification as a way to communicate back to our Java code when some Javascript is executed in the HTML5 UI. Through this approach, we were able to execute a variety of tasks by injecting JavaScript into the web view, performing the appropriate DOM inspection task, and then reporting the result through the title property. 

With this solution in place, we are able to automate all our key scenarios such as login, browsing the movie catalog, searching and controlling movie playback.

While we automate the testing of playback, the subjective analysis of quality is still left to the tester. Using automation we can catch buffering and other streaming issues by adding testability in our software, but at the end of the day we need a testers to verify issues such as seamless resolution switching or HD quality which are hard to achieve today using automation and also cost prohibitive.

We have a continuous build integration system that allows us to run our automated smoke tests on each submit on a bank of devices.  With the framework in place, we are able to quickly ascertain build stability across the vast array of makes and models that are part of the Android ecosystem.  This quick and inexpensive feedback loop enables a very quick release cycle as the testing overhead in each release is low given the stakes.

Device Diversity
To put device diversity in context, we see almost around 1000 different devices streaming Netflix on Android every day. We had to figure out how to categorize these devices in buckets so that we can be reasonably sure that we are releasing something that will work properly on these devices. So the devices we choose to participate in our continuous integration system are based on the following criteria.
  • We have at least one device for each playback pipeline architecture we support (The app uses several approaches for video playback on Android such as hardware decoder, software decoder, OMX-AL, iOMX).
  • We choose devices with high and low end processors as well as devices with different memory capabilities.
  • We have representatives that support each major operating system by make in addition to supporting custom ROMs (most notably CM7, CM9).
  • We choose devices that are most heavily used by Netflix Subscribers.


With this information, we have taken stock of all the devices we have in house and classified them based on their specs. We figured out the optimal combination of devices to give us maximum coverage. We are able to reduce our daily smoke automation devices to around 10 phones and 4 tablets and keep the rest for the longer release wide test cycles.

This list gets updated periodically to adjust to the changing market conditions. Also note that this is only the phone list, we have a separate list for tablets. We have several other phones that we test using automation and a smaller set of high priority tests, the list above goes through the comprehensive suite of manual and automation testing.

To put it other way, when it comes to watching Netflix, any device other than those ten devices can be classified with the high priority devices based on their configuration. This in turn helps us to quickly identify the class of problems associated with the given device.

Small Happy Team
We keep our team lean by focusing our full time employees on building solutions that scale and automation is a key part of this effort. When we do an international launch, we rely on crowd-sourcing test solutions like uTest to quickly verify network and latency performance.  This provides us real world insurance that all of our backend systems are working as expected. These approaches give our team time to watch their favorite movies to ensure that we have the best mobile streaming video solution in the industry.

In a future post, we will discuss our iOS test process which provides its own unique set of technical challenges.

Amol Kher is the Engineering Manager in Tools for the Android, iOS and AppleTV teams. If you are interested in joining Netflix or the Mobile  team, apply at www.netflix.com/jobs.

Tuesday, March 13, 2012

JMeter Plugin for Cassandra


By Vijay Parthasarathy and Denis Sheahan

A number of previous blogs have discussed our adoption of Cassandra as a NoSQL solution in the cloud. We now have over 55 Cassandra clusters in the cloud and are moving our source of truth from our Datacenter to these Cassandra clusters. As part of this move we have not only contributed to Cassandra itself but developed software to ease its deployment and use. It is our plan to open source as much of this software as possible.

We recently announced the open sourcing of Priam, which is a co-process that runs alongside Cassandra on every node to provide backup and recovery, bootstrapping, token assignment, configuration management and a RESTful interface to monitoring and metrics. In January we also announced our Cassandra Java client Astyanax which is built on top of Thrift and provides lower latency, reduced latency variance, and better error handling.

At Netflix we have recently started to standardize our load testing across the fleet using Apache JMeter. As Cassandra is a key part of our infrastructure that needs to be tested we developed a JMeter plugin for Cassandra. In this blog we discuss the plugin and present performance data for Astyanax vs Thrift collected using this plugin.

Cassandra JMeter Plugin

JMeter allows us to customize our test cases based on our application logic/datamodel. The Cassandra JMeter plugin we are releasing today is described on the github wiki here. It consists of a jar file that is placed in JMeter's lib/ext directory. The instructions to build and install the jar file are here.

An example screenshot is shown below.


Benchmark Setup

We set up a simple 6-node Cassandra cluster using EC2 m2.4xlarge instances, and the following schema

create keyspace MemberKeySp
with placement_strategy = 'NetworkTopologyStrategy'
and strategy_options = [{us-east : 3}]
and durable_writes = true;
use MemberKeySp;
create column family Customer
with column_type = 'Standard'
and comparator = 'UTF8Type'
and default_validation_class = 'BytesType'
and key_validation_class = 'UTF8Type'
and rows_cached = 0.0
and keys_cached = 100000.0
and read_repair_chance = 0.0
and comment = 'Customer Records';

Six million rows were then inserted into the cluster with a replication factor 3. Each row has 19 columns of simple ascii data. Total data set is 2.9GB per node so easily cacheable in our instances which have 68GB of memory. We wanted to test the latency of the client implementation using a single Get Range Slice operation ie 100% Read only. Each test was run twice to ensure the data was indeed cached, confirmed with iostat. One hundred JMeter threads were used to apply the load with 100 connections from JMeter to each node of Cassandra. Each JMeter thread therefore has at least 6 connections to choose from when sending it's request to Cassandra.

Every Cassandra JMeter Thread Group has a Config Element called CassandraProperties which contains clientType amongst other properties. For Astyanax clientType is set t0 com.netflix.jmeter.connections.a6x.AstyanaxConnection, for Thrift com.netflix.jmeter.connections.thrift.ThriftConnection.

Token Aware is the default JMeter setting. If you wish to experiment with other settings create a properties file, cassandra.properties, in the JMeter home directory with properties from the list below.

astyanax.connection.discovery=
astyanax.connection.pool=
astyanax.connection.latency.stategy=

Results

Transaction throughput

This graph shows the throughput at 5 second intervals for the Token Aware client vs the Thrift client. Token aware is consistently higher than Thrift and its average is 3% better throughput

Average Latency

JMeter reports response times to millisecond granularity. The Token Aware implementation responds in 2ms the majority of the time with occasional 3ms periods, the average is 2.29ms. The Thrift implementation is consistently at 3ms. So Astyanax has about a 30% better response time than raw Thrift implementation without token aware connection pool.

The plugin provides a wide range of samplers for Put, Composite Put, Batch Put, Get, Composite Get, Range Get and Delete. The github wiki has examples for all these scenarios including jmx files to try. Usually we develop the test scenario using the GUI on our laptops and then deploy to the cloud for load testing using the non-GUI version. We often deploy on a number of drivers in order to apply the required level of load.

The data for the above benchmark was also collected using a tool called casstat which we are also making available in the repository. Casstat is a bash script that calls other tools at regular intervals, compares the data with its previous sample, normalizes it on a per second basis and displays the pertinent data on a single line. Under the covers casstat uses

  • Cassandra nodetool cfstats to get Column Family performance data
  • nodetool tpstats to get internal state changes
  • nodetool cfhistograms to get 95th and 99th percentile response times
  • nodetool compactionstats to get details on number and type of compactions
  • iostat to get disk and cpu performance data
  • ifconfig to calculate network bandwidth

An example output is below (note some fields have been removed and abbreviated to reduce the width)

Epoch Rds/s RdLat ... %user %sys %idle .... md0r/s w/s rMB/s wMB/s NetRxK NetTxK Percentile Read Write Compacts
133... 5657 0.085 ... 7.74 10.09 81.73 ... 0.00 2.00 0.00 0.05 9083 63414 99th 0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0
133... 5635 0.083 ... 7.65 10.12 81.79 ... 0.00 0.30 0.00 0.00 9014 62777 99th 0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0
133... 5615 0.085 ... 7.81 10.19 81.54 ... 0.00 0.60 0.00 0.00 9003 62974 99th 0.179 ms 95th 0.14 ms 99th 0.00 ms 95th 0.00 ms Pen/0
We merge the casstat data from each Cassandra node and then use gnuplot to plot throughput etc.

The Cassandra JMeter plugin has become a key part of our load testing environment. We hope the wider community also finds it useful.