Friday, July 25, 2014

Revisiting 1 Million Writes per second

In an article we posted in November 2011, Benchmarking Cassandra Scalability on AWS - Over a million writes per second, we showed how Cassandra (C*) scales linearly as you add more nodes to a cluster. With the advent of new EC2 instance types, we decided to revisit this test. Unlike the initial post, we were not interested in proving C*’s scalability. Instead, we were looking to quantify the performance these newer instance types provide.
What follows is a detailed description of our new test, as well as the throughput and latency results of those tests.

Node Count, Software Versions & Configuration

The C* Cluster

The Cassandra cluster ran Datastax Enterprise 3.2.5, which incorporates C* 1.2.15.1. The C* cluster had 285 nodes. The instance type used was i2.xlarge. We ran JVM 1.7.40_b43 and set the heap to 12GB. The OS is Ubuntu 12.04 LTS. Data and logs are in the same mount point. The mount point is EXT3.
You will notice that in the previous test we used m1.xlarge instances for the test. Although we could have had similar write throughput results with this less powerful instance type, in Production, for the majority of our clusters, we read more than we write. The choice of i2.xlarge (an SSD backed instance type) is more realistic and will better showcase read throughput and latencies.
The full schema follows:
create keyspace Keyspace1
 with placement_strategy = 'NetworkTopologyStrategy'
 and strategy_options = {us-east : 3}
 and durable_writes = true;


use Keyspace1;


create column family Standard1
 with column_type = 'Standard'
 and comparator = 'AsciiType'
 and default_validation_class = 'BytesType'
 and key_validation_class = 'BytesType'
 and read_repair_chance = 0.1
 and dclocal_read_repair_chance = 0.0
 and populate_io_cache_on_flush = false
 and gc_grace = 864000
 and min_compaction_threshold = 999999
 and max_compaction_threshold = 999999
 and replicate_on_write = true
 and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
 and caching = 'KEYS_ONLY'
 and column_metadata = [
   {column_name : 'C4',
   validation_class : BytesType},
   {column_name : 'C3',
   validation_class : BytesType},
   {column_name : 'C2',
   validation_class : BytesType},
   {column_name : 'C0',
   validation_class : BytesType},
   {column_name : 'C1',
   validation_class : BytesType}]
 and compression_options = {'sstable_compression' : ''};
You will notice that min_compaction_threshold and max_compaction_threshold were set high. Although we don’t set these parameters to exactly those values in Production, it does reflect the fact that we prefer to control when compactions take place and initiate a full compaction on our own schedule.

The Client

The client application used was Cassandra Stress. There were 60 client nodes. The instance type used was r3.xlarge. This instance type has half the cores of the m2.4xlarge instances we used in the previous test. However, the r3.xlarge instances were still able to push the load (while using 40% less threads) required to reach the same throughput at almost half the price. The client was running JVM 1.7.40_b43 on Ubuntu 12.04 LTS.

Network Topology

Netflix deploys Cassandra clusters with a Replication Factor of 3. We also spread our Cassandra rings across 3 Availability Zones. We equate a C* rack to an Amazon Availability Zone (AZ). This way, in the event of an Availability Zone outage, the Cassandra ring still has 2 copies of the data and will continue to serve requests.
In the previous post all clients were launched from the same AZ. This differs from our actual production deployment where stateless applications are also deployed equally across three zones. Clients in one AZ attempt to always communicate with C* nodes in the same AZ. We call this zone-aware connections. This feature is built into Astyanax, Netflix’s C* Java client library. As a further speed enhancement, Astyanax also inspects the record’s key and sends requests to nodes that actually serve the token range of the record about the be written or read. Although any C* coordinator can fulfill any request, if the node is not part of the replica set, there will be an extra network hop. We call this making token-aware requests.
Since this test uses Cassandra Stress, I do not use token-aware requests. However, through some simple grep and awk-fu, this test is zone-aware. This is more representative of our actual production network topology.


Latency & Throughput Measurements

We’ve documented our use of Priam as a sidecar to help with token assignment, backups & restores. Our internal version of Priam adds some extra functionality. We use the Priam sidecar to collect C* JMX telemetry and send it to our Insights platform, Atlas. We will be adding this functionality to the open source version of Priam in the coming weeks.
Below are the JMX properties we collect to measure latency and throughput.

Latency

  • AVG & 95%ile Coordinator Latencies
    • Read
      • StorageProxyMBean.getRecentReadLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated
    • Write
      • StorageProxyMBean.getRecentWriteLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated

Throughput

  • Coordinator Operations Count
    • Read
      • StorageProxyMBean.getReadOperations()
    • Write
      • StorageProxyMBean.getWriteOperations()

The Test

I performed the following 4 tests:
  1. A full write test at CL One
  2. A full write test at CL Quorum
  3. A mixed test of writes and reads at CL One
  4. A mixed test of writes and reads at CL Quorum

100% Write

Unlike in the original post, this test shows a sustained >1 million writes/sec. Not many applications will only write data. However, a possible use of this type of footprint can be a telemetry system or a backend to an Internet of Things (IOT) application. The data can then be fed into a BI system for analysis.

CL One

Like in the original post, this test runs at CL One. The average coordinator latencies are a little over 5 milliseconds and a 95th percentile of 10 milliseconds.
WriteCLOne.png
Every client node ran the following Cassandra Stress command:
cassandra-stress -d [list of C* IPs] -t 120 -r -p 7102 -n 1000000000  -k -f [path to log] -o INSERT

CL LOCAL_QUORUM

For the use case where a higher level of consistency in writes is desired, this test shows the throughput that is achieved if the million writes per/sec test was running at a CL of LOCAL_QUORUM.
The write throughput is hugging the 1 million writes/sec mark at an average coordinator latency of just under 6 milliseconds and a 95th percentile of 17 milliseconds.
WriteCLQuorum.png
Every client node ran the following Cassandra Stress command:
cassandra-stress -d [list of C* IPs] -t 120 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f [path to log] -o INSERT

Mixed - 10% Write 90% Read

1 Million writes/sec makes for an attractive headline. Most applications, however, have a mix of reads and writes. After investigating some of the key applications at Netflix I noticed a mix of 10% writes and 90% reads. So this mixed test consists of reserving 10% of the total threads for writes and 90% for reads. The test is unbounded. This is still not realistic of the actual footprint an app might experience. However, it is a good indicator of how much throughput can be handled by the cluster and what the latencies might look like when pushed hard.
To avoid reading data from memory or from the file system cache, I let the write test run for a few days until a compacted data to memory ratio of 2:1 was reached.

CL One

C* achieves the highest throughput and highest level of availability when used in a CL One configuration. This does require developers to embrace eventual consistency and to design their applications around this paradigm. More info on this subject, can be found here.
The Write throughput is >200K writes/sec with an average coordinator latency of about 1.25 milliseconds and a 95th percentile of 2.5 milliseconds.
The Read throughput is around 900K reads/sec with an average coordinator latency  of 2 milliseconds and a 95th percentile of 7.5 milliseconds.
MixedWriteCLOne.png
MixedReadCLOne.png
Every client node ran the following 2 Cassandra Stress commands:
cassandra-stress -d $cCassList -t 30 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f /data/stressor/stressor.log -o INSERT
cassandra-stress -d $cCassList -t 270 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f /data/stressor/stressor.log -o READ


CL LOCAL_QUORUM

Most application developers starting off with C*, will default to CL Quorum writes and reads. This provides them the opportunity to dip their toes into the distributed database world, without having to also tackle the extra challenges of rethinking their applications for eventual consistency.
The Write throughput is just below the 200K writes/sec with an average coordinator latency of 1.75 milliseconds and a 95th percentile of 20 milliseconds.
The Read throughput is around 600K reads/sec with an average coordinator latency of 3.4 milliseconds and a 95th percentile of 35 milliseconds.
MixedWriteCLQuorum.png
MixedReadCLQuorum.png
Every client node ran the following 2 Cassandra Stress commands:
cassandra-stress -d $cCassList -t 30 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f [path to log] -o INSERT
cassandra-stress -d $cCassList -t 270 -r -p 7102 -e LOCAL_QUORUM -n 1000000000  -k -f [path to log] -o READ

Cost

The total costs involved in running this test include the EC2 instance costs as well as the inter-zone network traffic costs. We use Boundary to monitor our C* network usage.
boundary.png
The above shows that we were transferring a total of about 30Gbps between Availability Zones.
Here is the breakdown of the costs incurred to run the 1 million writes per/second test. These are retail prices that can be referenced here.
Instance Type / Item
Cost per Minute
Count
Total Price per Minute
i2.xlarge
$0.0142
285
$4.047
r3.xlarge
$0.0058
60
$0.348
Inter-zone traffic
$0.01 per GB
3.75 GBps * 60 = 225GB per minute
$2.25






Total Cost per minute
$6.645


Total Cost per half Hour
$199.35


Total Cost per Hour
$398.7


Final Thoughts

Most companies probably don’t need to process this much data. For those that do, this is a good indication of what types of cost, latencies and throughput one could expect while using the newer i2 and r3 AWS instance types. Every application is different, and your mileage will certainly vary.
This test was performed over the course of a week during my free time. This isn’t an exhaustive performance study, nor did I get into any deep C*,  system or JVM tuning. I know you can probably do better.  If working with distributed databases at scale and squeezing out every last drop of performance is what drives you, please join the Netflix CDE team.

Wednesday, July 9, 2014

Billing & Payments Engineering Meetup

On June 18th, we hosted our first Billing & Payments Engineering Meetup at Netflix.
We wanted to create a space for exchanging information and learning among professionals. That space would serve as a forum, or an agora, for a community of people sharing the same interests in the engineering aspects of billing & payment systems.
The billing and payments space is a dynamic and innovative environment that requires increased attention as it evolves. Many of the Bay Area's tech companies may have different core products, yet we all monetize in a fairly similar way. Most created billing systems internally and had to overcome similar technical or business challenges as companies grew. Moreover, as our companies expand internationally, the need to process foreign payment methods is becoming critical and potentially defining factor in maximizing chances of success.

Several trend-setting companies responded to our invite to speak to the large audience that came looking for tips and best-of-industry practices. 
Below is a recap of the agenda:
  • Mathieu Chauvin - Engineering Manager for Payments @ Netflix
  • Taylor Wicksell - Sr. Software Engineer for Billing @ Netflix
  • Jean-Denis Greze - Engineer @ Dropbox
  • Alec Holmes - Software Engineer @ Square
  • Emmanuel Cron - Software Engineer III, Google Wallet @ Google
  • Paul Huang - Engineering Manager @ Survey Monkey
  • Anthony Zacharakis - Lead Engineer @ Lumos Labs
  • Shengyong Li / Feifeng Yang - Dir. Engineering Commerce / Tech Lead Payment @ Electronic Arts
Below you can find the aggregate presentations. Thanks again to the presenters for sharing this material.


After the presentations, we held a networking session and engaged in very interesting conversations. It was a great event and another one will come up soon. Stay tuned on the meetup page to be notified!
http://www.meetup.com/Netflix-Billing-Payments-Engineering/

Netflix is always looking for talented people. If you share our passion for billing & payments innovation, check out our Careers page!
http://jobs.netflix.com/jobs.php?id=NFX00084


Sunday, July 6, 2014

Scale and Performance of a Large JavaScript Application



We recently held our second JavaScript Talks event at our Netflix headquarters in Los Gatos, Calif. Matt Seeley discussed the development approaches we use at Netflix to build the JavaScript applications which run on TV-connected devices, phones and tablets. These large, rich applications run across a wide range of devices and require carefully managing network resources, memory and rendering. This talk explores various approaches the team uses to build well-performing UIs, monitor application performance, write consistent code, and scale development across the team.

The video from the talk can found below along with slides from the talk at: https://speakerdeck.com/mseeley/life-on-the-grid.

And don't forget to check out videos from our past JavaScript Talks events on the Netflix UI Engineering YouTube Channel.







Monday, June 30, 2014

Announcing Security Monkey - AWS Security Configuration Monitoring and Analysis

We are pleased to announce the open source availability of Security Monkey, our solution for monitoring and analyzing the security of our Amazon Web Services configurations.


At Netflix, responsibility for delivering the streaming service is distributed and the environment is constantly changing. Code is deployed thousands of times a day, and cloud configuration parameters are modified just as frequently. To understand and manage the risk associated with this velocity, the security team needs to understand how things are changing and how these changes impact our security posture.
Netflix delivers its service primarily out of Amazon Web Services’ (AWS) public cloud, and while AWS provides excellent visibility of systems and configurations, it has limited capabilities in terms of change tracking and evaluation. To address these limitations, we created Security Monkey - the member of the Simian Army responsible for tracking and evaluating security-related changes and configurations in our AWS environments.

Overview of Security Monkey

We envisioned and built the first version of Security Monkey in 2011. At that time, we used a few different AWS accounts and delivered the service from a single AWS region. We now use several dozen AWS accounts and leverage multiple AWS regions to deliver the Netflix service. Over its lifetime, Security Monkey has ‘evolved’ (no pun intended) to meet our changing and growing requirements.

Viewing IAM users in Security Monkey - highlighted users have active access keys.
There are a number of security-relevant AWS components and configuration items - for example, security groups, S3 bucket policies, and IAM users. Changes or misconfigurations in any of these items could create an unnecessary and dangerous security risk. We needed a way to understand how AWS configuration changes impacted our security posture. It was also critical to have access to an authoritative configuration history service for forensic and investigative purposes so that we could know how things have changed over time. We also needed these capabilities at scale across the many accounts we manage and many AWS services we use.
Security Monkey's filter interface allows you to quickly find the configurations and items you're looking for.
These needs are at the heart of what Security Monkey is - an AWS security configuration tracker and analyzer that scales for large and globally distributed cloud environments.

Architecture

At a high-level, Security Monkey consists of the following components:
  • Watcher - The component that monitors a given AWS account and technology (e.g. S3, IAM, EC2). The Watcher detects and records changes to configurations. So, if a new IAM user is created or if an S3 bucket policy changes, the Watcher will detect this and store the change in Security Monkey’s database.
  • Notifier - The component that lets a user or group of users know when a particular item has changed. This component also provides notification based on the triggering of audit rules.
  • Auditor - Component that executes a set of business rules against an AWS configuration to determine the level of risk associated with the configuration. For example, a rule may look for a security group with a rule allowing ingress from 0.0.0.0/0 (meaning the security group is open to the Internet). Or, a rule may look for an S3 policy that allows access from an unknown AWS account (meaning you may be unintentionally sharing the data stored in your S3 bucket). Security Monkey has a number of built-in rules included, and users are free to add their own rules.

In terms of technical components, we run Security Monkey in AWS on Ubuntu Linux, and storage is provided by a PostgreSQL RDS database. We currently run Security Monkey on a single m3.large instance - this instance type has been able to easily monitor our dozens of accounts and many hundreds of changes per day.

The application itself is written in Python using the Flask framework (including a number of Flask plugins). At Netflix, we use our standard single-sign on (SSO) provider for authentication, but for the OSS version we’ve implemented Flask-Login and Flask-Security for user management. The frontend for Security Monkey’s data presentation is written in Angular Dart, and JSON data is also available via a REST API.

General Features and Operations

Security Monkey is relatively straightforward from an operational perspective. Installation and AWS account setup is covered in the installation document, and Security Monkey does not rely on other Netflix OSS components to operate. Generally, operational use includes:
  • Initial Configuration
    • Setting up one or more Security Monkey users to use/administer the application itself.
    • Setting up one or more AWS accounts for Security Monkey to monitor.
    • Configuring user-specific notification preferences (to determine whether or not a given user should be notified for configuration changes and audit reports).
  • Typical Use Cases
    • Checking historical details for a given configuration item (e.g. the different states a security group has had over time).
    • Viewing reports to check what audit issues exist (e.g. all S3 policies that reference unknown accounts or all IAM users that have active access keys).
    • Justifying audit issues (providing background or context on why a particular issues exists and is acceptable though it may violate an audit rule).

Note on AWS CloudTrail and AWS Trusted Advisor

CloudTrail is AWS’ service that records and logs API calls. Trusted Advisor is AWS’ premium support service that automatically evaluates your cloud deployment against a set of best practices (including security checks).

Security Monkey predates both of these services and meets a bit of each services’ goals while having unique value of its own:
  • CloudTrail provides verbose data on API calls, but has no sense of state in terms of how a particular configuration item (e.g. security group) has changed over time. Security Monkey provides exactly this capability.
  • Trusted Advisor has some excellent checks, but it is a paid service and provides no means for the user to add custom security checks. For example, Netflix has a custom check to identify whether a given IAM user matches a Netflix employee user account, something that is impossible to do via Trusted Advisor. Trusted Advisor is also a per-account service, whereas Security Monkey scales to support and monitor an arbitrary number of AWS accounts from a single Security Monkey installation.

Open Items and Future Plans

Security Monkey has been in production use at Netflix since 2011 and we will continue to add additional features. The following list documents some of our planned enhancements.
  • Integration with CloudTrail for change detail (including originating IP, instance, IAM account).
  • Ability to compare different configuration items across regions or accounts.
  • CSRF protections for form POSTs.
  • Content Security Policy headers (currently awaiting a Dart issue to be addressed).
  • Additional AWS technology and configuration tracking.
  • Test integration with moto.
  • SSL certificate expiry monitoring.
  • Simpler installation script and documentation.
  • Roles/authorization capabilities for admin vs. user roles.
  • More refined AWS permissions for Security Monkey operations (the current policy in the install docs is a broader read-only role).
  • Integration with edda, our general purpose AWS change tracker. On a related note, our friends at Prezi have open sourced reddalert, a security change detector that is itself integrated with edda.

Conclusion


Security Monkey has helped the security teams @ Netflix gain better awareness of changes and security risks in our AWS environment. Its approach fits well with the general Simian Army approach of continuously monitoring and detecting potential anomalies and risky configurations, and we look forward to seeing how other AWS users choose to extend and adapt its capabilities. Security Monkey is now available on our GitHub site.

If you’re in the San Francisco Bay Area and would like to hear more about Security Monkey (and see a demo), our August Netflix OSS meetup will be focused specifically on security. It’s scheduled for August 20th and will be held at Netflix HQ in Los Gatos.

-Patrick Kelley, Kevin Glisson, and Jason Chan (Netflix Cloud Security Team)