Wednesday, September 10, 2014

Introducing Chaos Engineering

Chaos Monkey was launched in 2010 with our move to Amazon Web Services, and thus the Netflix Simian Army was born.  Our ecosystem has evolved as we’ve introduced thousands of devices, many new countries, a Netflix optimized CDN  often referred to as OpenConnect, a growing catalog of Netflix Originals, and new and exciting UI advancements.   Not only has complexity grown, but our infrastructure itself has grown to support our rapidly growing customer base.  As growth and evolution continues, we will experience and find new failure modes.

Our philosophy remains unchanged around injecting failure into production to ensure our systems are fault-tolerant. We are constantly testing our ability to survive “once in a blue moon” failures. In a sign of our commitment to this very philosophy, we want to double down on chaos aka failure-injection. We strive to mirror the failure modes that are possible in our production environment and simulate these under controlled circumstances.  Our engineers are expected to write services that can withstand failures and gracefully degrade whenever necessary.  By continuing to run these simulations, we are able to evaluate and improve such vulnerabilities in our ecosystem.

A great example of a new failure mode was the Christmas Eve 2012 regional ELB outage we experienced.  The Simian Army at the time only injected failure that we understood and experienced up to that point.  In response we invested in a multi-region Active-Active infrastructure to be resilient to such events.  Its not enough that we simply make a system that is fault-tolerant to region outages, we must regularly exercise our ability to withstand regional outages.  

Each outage reinforces our commitment to chaos to ensure a reliable experience possible for our users.  While much of the simian army is designed and built around maintaining our environments, Chaos Engineering is entirely focused on controlled failure injection.

The Plan for Chaos Engineering:

Establish Virtuous Chaos Cycles
A common industry practice around outages are blameless post-mortems, a discipline we practice along with action items to prevent recurrence.  In parallel with resilience patches and work to prevent recurrence, we also want to build new chaos tools to regularly and systematically test resilience to detect regressions or new conditions.

Regression Testing in Software Testing is a well understood discipline, chaos testing for regression in distributed systems at scale presents a unique challenge.  We aspire to make chaos testing as well an understood discipline in production systems as other disciplines in software development.

Increase use of Reliability Design Patterns
In distributed environments there’s a challenge in both creating reliability design patterns and integrating them in a consistent manner to handle failure.  When an outage or new failure mode surfaces it may start in a single service, but all services may be susceptible to the same failure mode.  Post-mortems will lead to immediate action items for a particular involved service but do not always lead to improvement for other loosely coupled services.  Eventually other susceptible services become impacted by a failure condition that may have previously surfaced.  Hystrix is a fantastic example of a reliability design pattern that helps to create consistency in our micro-services ecosystem.

Anticipate Future Failure Modes
Ideally distributed systems are designed to be so robust and fault-tolerant that they never fail. We must anticipate failure modes, determine ways to inject these conditions in a controlled manner and evolve our reliability design patterns.  Anticipating such events requires creativity and deep understanding of distributed systems; two of the most critical characteristics of Chaos Engineers.

New forms of Chaos and Reliability Design Patterns are two ways we are researching at Chaos Engineering.  As we get deeper into our research we will continue to post our findings.

For those interested in this challenging research, we’re hiring additional Chaos Engineers.  Check out the jobs for Chaos Engineering at our jobs site.

-Bruce Wong, Engineering Manager of Chaos Engineering at Netflix (sometimes referred to as Chaos Commander)

Monday, August 25, 2014

Announcing Scumblr and Sketchy - Search, Screenshot, and Reclaim the Internet

Netflix is pleased to announce the open source release of two security-related web applications: Scumblr and Sketchy!

Scumbling The Web

Many security teams need to stay on the lookout for Internet-based discussions, posts, and other bits that may be of impact to the organizations they are protecting. These teams then take a variety of actions based on the nature of the findings discovered. Netflix’s security team has these same requirements, and today we’re releasing some of the tools that help us in these efforts.
Scumblr is a Ruby on Rails web application that allows searching the Internet for sites and content of interest. Scumblr includes a set of built-in libraries that allow creating searches for common sites like Google, Facebook, and Twitter. For other sites, it is easy to create plugins to perform targeted searches and return results. Once you have Scumblr setup, you can run the searches manually or automatically on a recurring basis.

Scumblr leverages a gem called Workflowable (which we are also open sourcing) that allows setting up flexible workflows that can be associated with search results. These workflows can be customized so that different types of results go through different workflow processes depending on how you want to action them. Workflowable also has a plug-in architecture that allows triggering custom automated actions at each step of the process.
Scumblr also integrates with Sketchy, which allows automatic screenshot generation of identified results to provide a snapshot-in-time of what a given page and result looked like when it was identified.


Scumblr makes use of the following components :
  • Ruby on Rails 4.0.9
  • Backend database for storing results
  • Redis + Sidekiq for background tasks
  • Workflowable for workflow creation and management
  • Sketchy for screenshot capture
We’re shipping Scumblr with built-in search libraries for seven common services including Google, Twitter, and Facebook.

Getting Started with Scumblr and Workflowable

Scumblr and Workflowable are available now on the Netflix Open Source site. Detailed instructions on setup and configuration are available in the projects’ wiki pages.


One of the features we wanted to see in Scumblr was the ability to collect screenshots and text content from potentially malicious sites - this allows security analysts to preview Scumblr results without the risk of visiting the site directly. We wanted this collection system to be isolated from Scumblr and also resilient to sites that may perform malicious actions. We also decided it would be nice to build an API that we could use in other applications outside of Scumblr.   Although a variety of tools and frameworks exist for taking screenshots, we discovered a number of edge cases that made taking reliable screenshots difficult - capturing screenshots from AJAX-heavy sites, cut-off images with virtual X drivers, and SSL and compression issues in the PhantomJS driver for Selenium, to name a few. In order to solve these challenges, we decided to leverage the best possible tools and create an API framework that would allow for reliable, scalable, and easy to use screenshot and text scraping capabilities.  Sketchy to the rescue!


At a high level, Sketchy contains the following components:
  • Python + Flask to serve Sketchy
  • PhantomJS to take lazy captures of AJAX heavy sites
  • Celery to manage jobs and + Redis to schedule and store job results
  • Backend database to store capture records (by leveraging SQLAlchemy)

Sketchy Overview

Sketchy at its core provides a scalable task-based framework to capture screenshots, scrape page text, and save HTML through a simple to use API.  These captures can be stored locally or on an AWS S3 bucket.  Optionally, token auth can be configured and callbacks can be used if required. Sketchy uses PhantomJS with lazy-rendering to ensure AJAX-heavy sites are captured correctly. Sketchy also uses the Celery task management system, allowing users to scale Sketchy accordingly and manage time-intensive captures for large sites.

Getting Started with Sketchy

Sketchy is available now on the Netflix Open Source site and setup is straightforward.  In addition, we've also created a Docker for Sketchy for interested users. Please visit the Sketchy wiki for documentation on how to get started.  


Scumblr and Sketchy are helping the Netflix security team keep an eye on potential threats to our environment every day. We hope that the open source community can find new and interesting uses for the newest additions to the Netflix Open Source Software initiative. Scumblr, Sketchy, and the Workflowable gem are all available on our GitHub site now!

-Andy Hoernecke and Scott Behrens (Netflix Cloud Security Team)

Wednesday, August 20, 2014

Netflix Hack Day - Summer 2014

Hack Day is a tradition at Netflix, as it is for many Silicon Valley companies. It is a great way to get away from everyday work and to provide a fun, experimental, collaborative and creative outlet for our product development and technical teams.

Similar to our Hack Day in February, we saw some really incredible ideas and implementations in our latest iteration last week.  If something interesting and useful comes from Hack Day, that is great, but the real motivation is fun and collaboration. With that spirit in mind, we had over 150 people code from Thursday morning to Friday morning, yielding more than 50 “hacks.”

The teams produced hacks covering a wide array of areas, including development productivity, internal insights tools, quirky and fun mashups, and of course a breadth of product-related feature ideas.  All hackers then presented and the audience of Netflix employees rated each hack on a 5-star scale to determine our seven category winners and a “People’s Choice Award.”

Below are some examples of some of the hacks to give you a taste of what we saw this time around.  We should note that, while we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or otherwise be used beyond Hack Day. We are surfacing them here publicly to share the spirit of the Netflix Hack Day.

And thanks to all of the hackers for putting together some incredible work in just 24 hours!  If you are interested in being a part of our next Hack Day, let us know by checking out our jobs site!

Netflix Hue
Use Philips 'smart' lightbulbs to make your room's ambient lighting match the titles that you are browsing and watching.

Text and console-based Netflix UI.

A 3D room version of our UI for the Oculus Rift, complete with gesture support.
By Ian McKay, Steve McGuire, Rachel Nordman, Kris Range, Alex Bustin, M. Frank Emanuel

Netflix Mini
Chrome extension that allows for multi-task watching in a mini screen.
By Adam Butterworth, Paul Anastasopoulos and Art Burov

Bringing actions that are currently only accessible via Display Pages into the Home & Browse UI.
By Ben Johnson, David Sutton

Circle of Life
Home page shows an alternate Netflix UI experience based on a network graph of titles.


And here are some pictures taken during the event.

Monday, August 18, 2014

Scaling A/B Testing on with Node.js

By Chris Saint-Amant

Last Wednesday night we held our third JavaScript Talks event at our Netflix headquarters in Los Gatos, Calif. Alex Liu and Micah Ransdell discussed some of the unique UI engineering challenges that go along with running hundreds of A/B tests each year

We are currently migrating our website UI layer to Node.js, and have taken the opportunity to step back and evaluate the most effective way to build A/B tests. The talk explored some of the patterns we’ve built in Node.js to to handle the complexity of running so many multivariate UI tests at scale. These solutions ultimately enable quick feature development and rapid test iteration.

Slides from the talk are now online. No video this time around, but you can come see us talk about UI approaches to A/B testing and our adoption of Node.js at several upcoming events. We hope to see you there!
Lastly, don’t forget to check out our Netflix UI Engineering channel on YouTube to watch videos from past JavaScript Talks.

Friday, July 25, 2014

Revisiting 1 Million Writes per second

In an article we posted in November 2011, Benchmarking Cassandra Scalability on AWS - Over a million writes per second, we showed how Cassandra (C*) scales linearly as you add more nodes to a cluster. With the advent of new EC2 instance types, we decided to revisit this test. Unlike the initial post, we were not interested in proving C*’s scalability. Instead, we were looking to quantify the performance these newer instance types provide.
What follows is a detailed description of our new test, as well as the throughput and latency results of those tests.

Node Count, Software Versions & Configuration

The C* Cluster

The Cassandra cluster ran Datastax Enterprise 3.2.5, which incorporates C* The C* cluster had 285 nodes. The instance type used was i2.xlarge. We ran JVM 1.7.40_b43 and set the heap to 12GB. The OS is Ubuntu 12.04 LTS. Data and logs are in the same mount point. The mount point is EXT3.
You will notice that in the previous test we used m1.xlarge instances for the test. Although we could have had similar write throughput results with this less powerful instance type, in Production, for the majority of our clusters, we read more than we write. The choice of i2.xlarge (an SSD backed instance type) is more realistic and will better showcase read throughput and latencies.
The full schema follows:
create keyspace Keyspace1
 with placement_strategy = 'NetworkTopologyStrategy'
 and strategy_options = {us-east : 3}
 and durable_writes = true;

use Keyspace1;

create column family Standard1
 with column_type = 'Standard'
 and comparator = 'AsciiType'
 and default_validation_class = 'BytesType'
 and key_validation_class = 'BytesType'
 and read_repair_chance = 0.1
 and dclocal_read_repair_chance = 0.0
 and populate_io_cache_on_flush = false
 and gc_grace = 864000
 and min_compaction_threshold = 999999
 and max_compaction_threshold = 999999
 and replicate_on_write = true
 and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
 and caching = 'KEYS_ONLY'
 and column_metadata = [
   {column_name : 'C4',
   validation_class : BytesType},
   {column_name : 'C3',
   validation_class : BytesType},
   {column_name : 'C2',
   validation_class : BytesType},
   {column_name : 'C0',
   validation_class : BytesType},
   {column_name : 'C1',
   validation_class : BytesType}]
 and compression_options = {'sstable_compression' : ''};
You will notice that min_compaction_threshold and max_compaction_threshold were set high. Although we don’t set these parameters to exactly those values in Production, it does reflect the fact that we prefer to control when compactions take place and initiate a full compaction on our own schedule.

The Client

The client application used was Cassandra Stress. There were 60 client nodes. The instance type used was r3.xlarge. This instance type has half the cores of the m2.4xlarge instances we used in the previous test. However, the r3.xlarge instances were still able to push the load (while using 40% less threads) required to reach the same throughput at almost half the price. The client was running JVM 1.7.40_b43 on Ubuntu 12.04 LTS.

Network Topology

Netflix deploys Cassandra clusters with a Replication Factor of 3. We also spread our Cassandra rings across 3 Availability Zones. We equate a C* rack to an Amazon Availability Zone (AZ). This way, in the event of an Availability Zone outage, the Cassandra ring still has 2 copies of the data and will continue to serve requests.
In the previous post all clients were launched from the same AZ. This differs from our actual production deployment where stateless applications are also deployed equally across three zones. Clients in one AZ attempt to always communicate with C* nodes in the same AZ. We call this zone-aware connections. This feature is built into Astyanax, Netflix’s C* Java client library. As a further speed enhancement, Astyanax also inspects the record’s key and sends requests to nodes that actually serve the token range of the record about the be written or read. Although any C* coordinator can fulfill any request, if the node is not part of the replica set, there will be an extra network hop. We call this making token-aware requests.
Since this test uses Cassandra Stress, I do not use token-aware requests. However, through some simple grep and awk-fu, this test is zone-aware. This is more representative of our actual production network topology.

Latency & Throughput Measurements

We’ve documented our use of Priam as a sidecar to help with token assignment, backups & restores. Our internal version of Priam adds some extra functionality. We use the Priam sidecar to collect C* JMX telemetry and send it to our Insights platform, Atlas. We will be adding this functionality to the open source version of Priam in the coming weeks.
Below are the JMX properties we collect to measure latency and throughput.


  • AVG & 95%ile Coordinator Latencies
    • Read
      • StorageProxyMBean.getRecentReadLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated
    • Write
      • StorageProxyMBean.getRecentWriteLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated


  • Coordinator Operations Count
    • Read
      • StorageProxyMBean.getReadOperations()
    • Write
      • StorageProxyMBean.getWriteOperations()

The Test

I performed the following 4 tests:
  1. A full write test at CL One
  2. A full write test at CL Quorum
  3. A mixed test of writes and reads at CL One
  4. A mixed test of writes and reads at CL Quorum

100% Write

Unlike in the original post, this test shows a sustained >1 million writes/sec. Not many applications will only write data. However, a possible use of this type of footprint can be a telemetry system or a backend to an Internet of Things (IOT) application. The data can then be fed into a BI system for analysis.

CL One

Like in the original post, this test runs at CL One. The average coordinator latencies are a little over 5 milliseconds and a 95th percentile of 10 milliseconds.
Every client node ran the following Cassandra Stress command:
cassandra-stress -d [list of C* IPs] -t 120 -r -p 7102 -n 1000000000  -k -f [path to log] -o INSERT


For the use case where a higher level of consistency in writes is desired, this test shows the throughput that is achieved if the million writes per/sec test was running at a CL of LOCAL_QUORUM.
The write throughput is hugging the 1 million writes/sec mark at an average coordinator latency of just under 6 milliseconds and a 95th percentile of 17 milliseconds.
Every client node ran the following Cassandra Stress command:
cassandra-stress -d [list of C* IPs] -t 120 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f [path to log] -o INSERT

Mixed - 10% Write 90% Read

1 Million writes/sec makes for an attractive headline. Most applications, however, have a mix of reads and writes. After investigating some of the key applications at Netflix I noticed a mix of 10% writes and 90% reads. So this mixed test consists of reserving 10% of the total threads for writes and 90% for reads. The test is unbounded. This is still not realistic of the actual footprint an app might experience. However, it is a good indicator of how much throughput can be handled by the cluster and what the latencies might look like when pushed hard.
To avoid reading data from memory or from the file system cache, I let the write test run for a few days until a compacted data to memory ratio of 2:1 was reached.

CL One

C* achieves the highest throughput and highest level of availability when used in a CL One configuration. This does require developers to embrace eventual consistency and to design their applications around this paradigm. More info on this subject, can be found here.
The Write throughput is >200K writes/sec with an average coordinator latency of about 1.25 milliseconds and a 95th percentile of 2.5 milliseconds.
The Read throughput is around 900K reads/sec with an average coordinator latency  of 2 milliseconds and a 95th percentile of 7.5 milliseconds.
Every client node ran the following 2 Cassandra Stress commands:
cassandra-stress -d $cCassList -t 20 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f /data/stressor/stressor.log -o INSERT
cassandra-stress -d $cCassList -t 100 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f /data/stressor/stressor.log -o READ


Most application developers starting off with C*, will default to CL Quorum writes and reads. This provides them the opportunity to dip their toes into the distributed database world, without having to also tackle the extra challenges of rethinking their applications for eventual consistency.
The Write throughput is just below the 200K writes/sec with an average coordinator latency of 1.75 milliseconds and a 95th percentile of 20 milliseconds.
The Read throughput is around 600K reads/sec with an average coordinator latency of 3.4 milliseconds and a 95th percentile of 35 milliseconds.
Every client node ran the following 2 Cassandra Stress commands:
cassandra-stress -d $cCassList -t 20 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f [path to log] -o INSERT
cassandra-stress -d $cCassList -t 100 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f [path to log] -o READ


The total costs involved in running this test include the EC2 instance costs as well as the inter-zone network traffic costs. We use Boundary to monitor our C* network usage.
The above shows that we were transferring a total of about 30Gbps between Availability Zones.
Here is the breakdown of the costs incurred to run the 1 million writes per/second test. These are retail prices that can be referenced here.
Instance Type / Item
Cost per Minute
Total Price per Minute
Inter-zone traffic
$0.01 per GB
3.75 GBps * 60 = 225GB per minute

Total Cost per minute

Total Cost per half Hour

Total Cost per Hour

Final Thoughts

Most companies probably don’t need to process this much data. For those that do, this is a good indication of what types of cost, latencies and throughput one could expect while using the newer i2 and r3 AWS instance types. Every application is different, and your mileage will certainly vary.
This test was performed over the course of a week during my free time. This isn’t an exhaustive performance study, nor did I get into any deep C*,  system or JVM tuning. I know you can probably do better.  If working with distributed databases at scale and squeezing out every last drop of performance is what drives you, please join the Netflix CDE team.