Thursday, September 26, 2013

NetflixOSS Meetup S1E4 - Cloud Prize Nominations

by Adrian Cockcroft

We launched the Netflix Open Source Software Cloud Prize in March 2013 and it got a lot of attention in the press and blogosphere. Six months later we closed the contest, took a good look at the entrants, picked the best as nominees and announced them at a Netflix Meetup in Los Gatos on September 25th. The next step is for the panel of distinguished judges to decide who wins in each category, and the final winners will be announced at AWS Re:Invent in November.

Starting the nominations with some “monkey business”, we were looking for additions to the Netflix Simian Army that automates operations and failure testing for NetflixOSS. We have three nominees who between them built sixteen new monkeys, and one portability extension.

Peter Sankauskas (pas256, of San Mateo California) who built Backup Monkey and Graffiti Monkey concentrated on automating management of attached Elastic Block Store volumes. Backup Monkey makes sure you always have snapshot backups, and Graffiti monkey tags EBS volumes with information about the host they are attached to, so you can always figure out what is on each EBS volume or snapshot. By using monkeys to do this, we can be sure that every EBS volume is tagged the same way, and that we always have snapshot backups, even though creation of volumes can be done “self service” by any developer. Peter also contributed Ansible playbooks and many pre-built AMIs to make it easier for everyone else to get started with NetflixOSS, and put Asgard and Edda in the AWS Marketplace. He recently started his own company AnsWerS to help people who want to move to AWS.

Justin Santa Barbara (justinsb of San Franciso, California) decided to make the Chaos Monkey far more evil, and created fourteen new variants, a “barrel of chaos monkeys”. They interfere with the network, causing routing failure, packet loss, network data corruption and extra network latency. They block access to DNS, S3, DynamoDB and the EC2 control plane. They interfere with storage, by disconnecting EBS volumes, filling up the root disk, and saturating the disks with IO requests. They interfere with the CPU by consuming all the spare cycles, or killing off all the processes written in Python or Java. When run, a random selection is made, and the victims suffer the consequences. This is an excellent but scary workout for our monitoring and repair/replacement automation.

Our original Chaos Monkey framework keeps a small amount of state and logs what it does in SimpleDB, which is an AWS specific service. To make it easier to run Chaos Monkey in other clouds, or datacenter environments such as Eucalyptus or OpenStack John Gardner (huxoll, of Austin Texas) generalized the interface and provided a sample implementation that he calls “Monkey Recorder” which writes to local disk.

Keeping with the animal theme, we have some “piggy business”. Anyone familiar with the Hadoop tools and the big data ecosystem knows about the Pig language. It provides a way to specify a high level dataflow for processing but the Pig scripts can get complex and hard to debug. Netflix built and open sourced a visualization and monitoring tool called Lipstick, and it was adopted by a vendor called Mortardata (mortardata, of New York, NY) who worked with us to generalize some of the interfaces and integrate it with their own Pig based Hadoop platform. We saved Mortardata from having to create their own tool to do this, and Netflix now has an enthusiastic partner to help improve and extend Lipstick so everyone who uses it benefits.

Business computing is what IBM is known for. They have put in a lot of work and produced two related entries. IBM had previously created a demonstration application called Acme Air for their Websphere tools running on IBM Smartcloud. It was a fairly conventional enterprise architecture application, with a Java front end and a database back end. For their first prize entry, Andrew Spyker (aspyker, of Raleigh North Carolina) figured out how to re-implement Acme Air as a cloud native example application using NetflixOSS libraries and component services, running on AWS. He then ran some benchmark stress tests to demonstrate scalability. This was demonstrated at a Netflix Meetup last summer. Following on, a team led by Richard Johnson (EmergingTechnologyInstitute, of Raleigh North Carolina) including Andrew Spyker and Jonathan Bond ported several components of NetflixOSS to the IBM Softlayer cloud using Rightscale to provide autoscaling functionality. This involved ports of the Eureka service registry, Hystrix circuit breaker pattern, Karyon base server framework, Ribbon http client and Asgard provisioning portal. They even made a video demo of the final product and put it up on YouTube. This was a lot of work, but IBM sees the value in getting a deep understanding of Cloud Native architecture and tools, which it can then figure out how to apply to helping enterprise customers make the transition to cloud.

Acme Air is a fairly simple application with a web based user interface, but in the real world complex web service APIs are hard to manage, and NetflixOSS includes the Zuul API gateway, which is used to authenticate process and route http requests. The next nomination is from Neil Beveridge (neilbeveridge, of Kent, United Kingdom). He was interested in porting the Zuul container from Tomcat to Netty, which also provides non-blocking output requests, and benchmarking the difference. He ran into an interesting problem with Netty consuming excess CPU and running slower than the original Tomcat version, and then ran into the contest deadline, but plans to continue work to debug and tune the Netty code. Since Netflix is also looking at moving some of our services from Tomcat to Netty, this is a useful and timely contribution. It’s also helpful to other people considering using Zuul to have some published benchmarks to show the throughput on a common AWS instance type.

Eucalyptus have been using NetflixOSS to provide a proof point for portability of applications from AWS to private clouds based on Eucalyptus. In June 2013 they shipped a major update that included advanced AWS features such as Autoscale Groups that NetflixOSS depends on. To support the extra capabilities of Eucalyptus and the ability to deploy applications to AWS regions and Eucalyptus datacenters from the same Asgard console, Chris Grzegorczyk and Greg Dekoenigsberg (eucaflix, grze and gregdek, of Goleta California) made a series of changes to NetflixOSS projects which they submitted as a prize entry. They have demonstrated working code at several Netflix meetups.

Turbine is a real time monitoring mechanism that provides a continuous stream of updates for tracking the Hystrix circuit breaker pattern that is used to protect API calls from broken dependencies. Improvements by Michael Rose (Xorlev, of Lakewood, Colorado) extended Turbine so that it can be used in environments that are using Zookeeper for service discovery rather than Eureka.

Cheng Ma and Joe Gardner (xiaoma318 & joehack3r, of Houston, Texas) built three related user interface tools MyEdda, MyJanitor and ASG Console to simplify operations monitoring. Edda collects a complete history of everything deployed in an AWS account, Janitor Monkey uses Edda to find entities such as unused empty Autoscaling Groups that it can remove. Fei Teng (veyronfei, of Sydney Australia) built Clouck, a user interface that keeps track of AWS resources across regions. These tools let you see what is going on more easily.

EC2box by Sean Kavanagh (skavanagh, of Louisville, Kentucky) is a web based ssh console that can replicate command line operations across large numbers of instances, and also acts as a control and audit point so that operations by many engineers can be coordinated and managed centrally.

When we started to build the Denominator library for portable DNS management we contacted Neustar to discuss their UltraDNS product, and made contact with Jeff Damick (jdamick, of South Riding, Virginia). His input as we structured the early versions of Denominator was extremely useful, and provides a great example of the power of developing code in public. We were able to tap into his years of experience with DNS management, and he was able to contribute code, tests and fixes to the Denominator code and fixes to the UltraDNS API itself.

Jakub Narloch (jmnarloch, of Szczecin, Poland) started out by helping to configure the JBoss Arquillian test framework to do unit tests for the Denominator DNS integration with UltraDNS. This work was extended to include Karyon, the base server that underpins NetflixOSS services and acts as the starting point for developing new services. Since Genie is based on Karyon, we were able to leverage this integration to use Arquilian to test Genie, and the changes have been merged into the code that Netflix uses internally.

Feign is a simple annotation based way to construct http clients for Java applications. It was developed as a side project of Denominator, and David Carr (davidmc24, of Clifton Park, New York) has emerged as a major contributor to Feign. As well as a series of pull requests that simplified common Feign use cases, he’s acted as a design reviewer for changes proposed by Netflix engineering, and we just proposed adding him as a committer on the project. This is another example of the value of developing code “in public”.

Netflix uses our Servo instrumentation library to annotate and collect information and post it into AWS Cloudwatch, but many people use a somewhat similar library developed by Coda Hale at Yammer, which is called Metrics. Maheedhar Gunturu (mailmahee, of Santa Clara, California) instrumented the Astyanax Cassandra client library to generate Yammer Metrics. Astyanax is one of the most widely used NetflixOSS projects, and it’s used in many non-AWS contexts so this is a useful generalization.

Priam is our Java Tomcat service that runs on each node in a Cassandra cluster to manage creation and backup of Cassandra. Sean McCully (seanmccully, of San Antonio, Texas) ported the functionality of Priam from Java to Python and called it Hector. This is useful in environments that don’t use Java or that run smaller Cassandra clusters on small instances where the reduced memory overhead of a Python implementation leaves more space for Cassandra itself.

Although the primary storage used by Netflix is based on Cassandra, we also use AWS RDS to create several small MySQL databases for specific purposes. Other AWS customers use RDS much more heavily. Jiaqui Guo (jiaqui, Chicago, Illinois) has built Datamung to automate backup of RDS to S3 and replication of backups across regions for disaster recovery.

Abdelmonaim Remani (PolymathicCoder, Capitola, California) has built a DynamoDB Framework that is similar to the way we use Astyanax for Cassandra. As well as providing a nice annotation based Java interface the framework adds extra functionality such as cross regional replication by managing access to multiple DynamoDB back ends.

The Reactive Extensions (Rx) pattern is one of the most advanced and powerful concepts for structuring code to come out in recent years. The original work on Rx at Microsoft by Eric Meijer inspired Netflix to create the RxJava project. We started with a subset of Rx functionality and left a lot of “to do” areas. This inspired Mairbek Khadikov (mairbek, Kharkiv, Ukraine) to help us fill in the missing features with over thirty pull requests for the RxJava project. As the project matured we began to extend RxJava to include other JVM based languages and Joachim Hofer (jmhofer, Möhrendorf, Germany) made major contribution to type safety and Scala support, again with over thirty pull requests.

When Paypal decided they wanted a developer oriented console for their OpenStack based private cloud they took a look at Asgard and realized that it was close enough to what they wanted, so they could use it as a starting point. Anand Palanisamy (paypal, San Jose, California) submitted Aurora, their fork of Asgard, as a prize entry, and demonstrated it running at one of the Netflix meetups.

Riot Games have adopted many NetflixOSS components and extended them to meet their own needs. They demonstrated their work at the summer NetflixOSS Meetup, and Asbjorn Kjaer (bunjiboys, Santa Monica, California) submitted Chef-Solo for AMInator, which integrates Chef recipes for building systems with the immutable AMI based instance model that Netflix uses for deployments.

We are pleased to have such a wide variety of nominations, from individuals around the world, small and large companies, vendors and end users. Many thanks to all of them for the work they have put into helping grow the NetflixOSS ecosystem, and thanks to everyone else who just uses NetflixOSS or entered the contest but didn’t make the cut. Next the judges will pick ten winners and we will be contacting them and secretly flying them to Las Vegas in November. There they will be announced, meet with the judges and the Netflix team and pick up their $10K prize money, $5K AWS credits and Cloud Monkey trophy.