Thursday, September 26, 2013

NetflixOSS Meetup S1E4 - Cloud Prize Nominations

by Adrian Cockcroft

We launched the Netflix Open Source Software Cloud Prize in March 2013 and it got a lot of attention in the press and blogosphere. Six months later we closed the contest, took a good look at the entrants, picked the best as nominees and announced them at a Netflix Meetup in Los Gatos on September 25th. The next step is for the panel of distinguished judges to decide who wins in each category, and the final winners will be announced at AWS Re:Invent in November.

Starting the nominations with some “monkey business”, we were looking for additions to the Netflix Simian Army that automates operations and failure testing for NetflixOSS. We have three nominees who between them built sixteen new monkeys, and one portability extension.

Peter Sankauskas (pas256, of San Mateo California) who built Backup Monkey and Graffiti Monkey concentrated on automating management of attached Elastic Block Store volumes. Backup Monkey makes sure you always have snapshot backups, and Graffiti monkey tags EBS volumes with information about the host they are attached to, so you can always figure out what is on each EBS volume or snapshot. By using monkeys to do this, we can be sure that every EBS volume is tagged the same way, and that we always have snapshot backups, even though creation of volumes can be done “self service” by any developer. Peter also contributed Ansible playbooks and many pre-built AMIs to make it easier for everyone else to get started with NetflixOSS, and put Asgard and Edda in the AWS Marketplace. He recently started his own company AnsWerS to help people who want to move to AWS.

Justin Santa Barbara (justinsb of San Franciso, California) decided to make the Chaos Monkey far more evil, and created fourteen new variants, a “barrel of chaos monkeys”. They interfere with the network, causing routing failure, packet loss, network data corruption and extra network latency. They block access to DNS, S3, DynamoDB and the EC2 control plane. They interfere with storage, by disconnecting EBS volumes, filling up the root disk, and saturating the disks with IO requests. They interfere with the CPU by consuming all the spare cycles, or killing off all the processes written in Python or Java. When run, a random selection is made, and the victims suffer the consequences. This is an excellent but scary workout for our monitoring and repair/replacement automation.

Our original Chaos Monkey framework keeps a small amount of state and logs what it does in SimpleDB, which is an AWS specific service. To make it easier to run Chaos Monkey in other clouds, or datacenter environments such as Eucalyptus or OpenStack John Gardner (huxoll, of Austin Texas) generalized the interface and provided a sample implementation that he calls “Monkey Recorder” which writes to local disk.

Keeping with the animal theme, we have some “piggy business”. Anyone familiar with the Hadoop tools and the big data ecosystem knows about the Pig language. It provides a way to specify a high level dataflow for processing but the Pig scripts can get complex and hard to debug. Netflix built and open sourced a visualization and monitoring tool called Lipstick, and it was adopted by a vendor called Mortardata (mortardata, of New York, NY) who worked with us to generalize some of the interfaces and integrate it with their own Pig based Hadoop platform. We saved Mortardata from having to create their own tool to do this, and Netflix now has an enthusiastic partner to help improve and extend Lipstick so everyone who uses it benefits.

Business computing is what IBM is known for. They have put in a lot of work and produced two related entries. IBM had previously created a demonstration application called Acme Air for their Websphere tools running on IBM Smartcloud. It was a fairly conventional enterprise architecture application, with a Java front end and a database back end. For their first prize entry, Andrew Spyker (aspyker, of Raleigh North Carolina) figured out how to re-implement Acme Air as a cloud native example application using NetflixOSS libraries and component services, running on AWS. He then ran some benchmark stress tests to demonstrate scalability. This was demonstrated at a Netflix Meetup last summer. Following on, a team led by Richard Johnson (EmergingTechnologyInstitute, of Raleigh North Carolina) including Andrew Spyker and Jonathan Bond ported several components of NetflixOSS to the IBM Softlayer cloud using Rightscale to provide autoscaling functionality. This involved ports of the Eureka service registry, Hystrix circuit breaker pattern, Karyon base server framework, Ribbon http client and Asgard provisioning portal. They even made a video demo of the final product and put it up on YouTube. This was a lot of work, but IBM sees the value in getting a deep understanding of Cloud Native architecture and tools, which it can then figure out how to apply to helping enterprise customers make the transition to cloud.

Acme Air is a fairly simple application with a web based user interface, but in the real world complex web service APIs are hard to manage, and NetflixOSS includes the Zuul API gateway, which is used to authenticate process and route http requests. The next nomination is from Neil Beveridge (neilbeveridge, of Kent, United Kingdom). He was interested in porting the Zuul container from Tomcat to Netty, which also provides non-blocking output requests, and benchmarking the difference. He ran into an interesting problem with Netty consuming excess CPU and running slower than the original Tomcat version, and then ran into the contest deadline, but plans to continue work to debug and tune the Netty code. Since Netflix is also looking at moving some of our services from Tomcat to Netty, this is a useful and timely contribution. It’s also helpful to other people considering using Zuul to have some published benchmarks to show the throughput on a common AWS instance type.

Eucalyptus have been using NetflixOSS to provide a proof point for portability of applications from AWS to private clouds based on Eucalyptus. In June 2013 they shipped a major update that included advanced AWS features such as Autoscale Groups that NetflixOSS depends on. To support the extra capabilities of Eucalyptus and the ability to deploy applications to AWS regions and Eucalyptus datacenters from the same Asgard console, Chris Grzegorczyk and Greg Dekoenigsberg (eucaflix, grze and gregdek, of Goleta California) made a series of changes to NetflixOSS projects which they submitted as a prize entry. They have demonstrated working code at several Netflix meetups.

Turbine is a real time monitoring mechanism that provides a continuous stream of updates for tracking the Hystrix circuit breaker pattern that is used to protect API calls from broken dependencies. Improvements by Michael Rose (Xorlev, of Lakewood, Colorado) extended Turbine so that it can be used in environments that are using Zookeeper for service discovery rather than Eureka.

Cheng Ma and Joe Gardner (xiaoma318 & joehack3r, of Houston, Texas) built three related user interface tools MyEdda, MyJanitor and ASG Console to simplify operations monitoring. Edda collects a complete history of everything deployed in an AWS account, Janitor Monkey uses Edda to find entities such as unused empty Autoscaling Groups that it can remove. Fei Teng (veyronfei, of Sydney Australia) built Clouck, a user interface that keeps track of AWS resources across regions. These tools let you see what is going on more easily.

EC2box by Sean Kavanagh (skavanagh, of Louisville, Kentucky) is a web based ssh console that can replicate command line operations across large numbers of instances, and also acts as a control and audit point so that operations by many engineers can be coordinated and managed centrally.

When we started to build the Denominator library for portable DNS management we contacted Neustar to discuss their UltraDNS product, and made contact with Jeff Damick (jdamick, of South Riding, Virginia). His input as we structured the early versions of Denominator was extremely useful, and provides a great example of the power of developing code in public. We were able to tap into his years of experience with DNS management, and he was able to contribute code, tests and fixes to the Denominator code and fixes to the UltraDNS API itself.

Jakub Narloch (jmnarloch, of Szczecin, Poland) started out by helping to configure the JBoss Arquillian test framework to do unit tests for the Denominator DNS integration with UltraDNS. This work was extended to include Karyon, the base server that underpins NetflixOSS services and acts as the starting point for developing new services. Since Genie is based on Karyon, we were able to leverage this integration to use Arquilian to test Genie, and the changes have been merged into the code that Netflix uses internally.

Feign is a simple annotation based way to construct http clients for Java applications. It was developed as a side project of Denominator, and David Carr (davidmc24, of Clifton Park, New York) has emerged as a major contributor to Feign. As well as a series of pull requests that simplified common Feign use cases, he’s acted as a design reviewer for changes proposed by Netflix engineering, and we just proposed adding him as a committer on the project. This is another example of the value of developing code “in public”.

Netflix uses our Servo instrumentation library to annotate and collect information and post it into AWS Cloudwatch, but many people use a somewhat similar library developed by Coda Hale at Yammer, which is called Metrics. Maheedhar Gunturu (mailmahee, of Santa Clara, California) instrumented the Astyanax Cassandra client library to generate Yammer Metrics. Astyanax is one of the most widely used NetflixOSS projects, and it’s used in many non-AWS contexts so this is a useful generalization.

Priam is our Java Tomcat service that runs on each node in a Cassandra cluster to manage creation and backup of Cassandra. Sean McCully (seanmccully, of San Antonio, Texas) ported the functionality of Priam from Java to Python and called it Hector. This is useful in environments that don’t use Java or that run smaller Cassandra clusters on small instances where the reduced memory overhead of a Python implementation leaves more space for Cassandra itself.

Although the primary storage used by Netflix is based on Cassandra, we also use AWS RDS to create several small MySQL databases for specific purposes. Other AWS customers use RDS much more heavily. Jiaqui Guo (jiaqui, Chicago, Illinois) has built Datamung to automate backup of RDS to S3 and replication of backups across regions for disaster recovery.

Abdelmonaim Remani (PolymathicCoder, Capitola, California) has built a DynamoDB Framework that is similar to the way we use Astyanax for Cassandra. As well as providing a nice annotation based Java interface the framework adds extra functionality such as cross regional replication by managing access to multiple DynamoDB back ends.

The Reactive Extensions (Rx) pattern is one of the most advanced and powerful concepts for structuring code to come out in recent years. The original work on Rx at Microsoft by Eric Meijer inspired Netflix to create the RxJava project. We started with a subset of Rx functionality and left a lot of “to do” areas. This inspired Mairbek Khadikov (mairbek, Kharkiv, Ukraine) to help us fill in the missing features with over thirty pull requests for the RxJava project. As the project matured we began to extend RxJava to include other JVM based languages and Joachim Hofer (jmhofer, Möhrendorf, Germany) made major contribution to type safety and Scala support, again with over thirty pull requests.

When Paypal decided they wanted a developer oriented console for their OpenStack based private cloud they took a look at Asgard and realized that it was close enough to what they wanted, so they could use it as a starting point. Anand Palanisamy (paypal, San Jose, California) submitted Aurora, their fork of Asgard, as a prize entry, and demonstrated it running at one of the Netflix meetups.

Riot Games have adopted many NetflixOSS components and extended them to meet their own needs. They demonstrated their work at the summer NetflixOSS Meetup, and Asbjorn Kjaer (bunjiboys, Santa Monica, California) submitted Chef-Solo for AMInator, which integrates Chef recipes for building systems with the immutable AMI based instance model that Netflix uses for deployments.

We are pleased to have such a wide variety of nominations, from individuals around the world, small and large companies, vendors and end users. Many thanks to all of them for the work they have put into helping grow the NetflixOSS ecosystem, and thanks to everyone else who just uses NetflixOSS or entered the contest but didn’t make the cut. Next the judges will pick ten winners and we will be contacting them and secretly flying them to Las Vegas in November. There they will be announced, meet with the judges and the Netflix team and pick up their $10K prize money, $5K AWS credits and Cloud Monkey trophy.

Wednesday, September 25, 2013

Glisten, a Groovy way to use Amazon's Simple Workflow Service

by Clay McCoy

While adding a new automated deployment feature to Asgard we realized that our current in-memory task system was not sufficient for the new demands. There would now be tasks that would be measured in hours or days rather than minutes and that work needed to be resilient to the failure of a single Asgard instance. We also wanted better asynchronous task coordination and the ability to distribute these tasks among a fleet of Asgard instances.

Amazon's Simple Workflow Service

Amazon's Simple Workflow Service (SWF) is a task based API for building highly scalable and resilient applications. With SWF the progress of your tasks is persisted by AWS while all the actual work is still done on your own servers. Your services poll for decision tasks and activity tasks. Decision tasks simply determine what to do next (start an activity, start a timer...) based on the workflow progress so far. This is high level logic that orchestrates your activities and should execute very quickly. Activity tasks are where real processing is performed (calculations, contacting remote services, I/O...). SWF was exactly what we were looking for in a distributed task system, but we quickly realized that it can be arduous writing a workflow against the base SWF API. It is up to you to do a lot of low level operations and your actual application logic can get lost in the mix.

Amazon's Flow Framework

Amazon anticipated our predicament and provided the Flow Framework which is a higher level API on top of SWF. It minimizes SWF based boilerplate code and makes your workflow look more like ordinary Java code. It also provides a lot of useful SWF infrastructure for registering SWF objects, polling for tasks, analyzing workflow history, and responding with decisions. Flow enforces a programming model where you implement your own interfaces for workflows and activities.

The interfaces contain special Flow annotations that identify their roles and allow specification of versions, timeouts, and more.
    defaultExecutionStartToCloseTimeoutSeconds = 60L)
interface TestWorkflow {
@Execute(version = '1.0')
    void doIt()
@Activities(version = '1.0')
    defaultTaskScheduleToStartTimeoutSeconds = -1L,
    defaultTaskStartToCloseTimeoutSeconds = 300L)
interface TestActivities {
    String doSomething()
    void consumeSomething(String thing)
Flow generates code to make your activities asynchronous. Promises will need to wrap your activity method return values and parameters. Rather than the TestActivities above, you will program against the generated TestActivitiesClient below.
interface TestActivitiesClient {
    Promise<String> doSomething()
    void consumeSomething(Promise<String> thing)
The workflow implementation is your decider logic which gets replayed repeatedly until your workflow is complete. In your workflow implementation you can reference the generated activities client that was just described. Flow uses AspectJ and @Asynchronous annotations on methods to ensure that promises are ready before executing the method body that uses their results. In this example, 'doIt' is the entry point to the workflow due to the @Execute annotation on the interface above. First we 'doSomething' and wait on the result before we send it to 'consumeSomething'.
class TestWorkflowImpl implements TestWorkflow {
    private final TestActivitiesClient client = new TestActivitiesClientImpl();

    void doIt() {
        Promise<String> result = client.doSomething()

    void waitForSomething(Promise<String> something) {
Flow clearly offers a lot of help in easing the use of SWF. Unfortunately its dependence on AspectJ and code generation kept us from using it as is. Asgard is a Groovy and Grails application that already has enough byte code manipulation and runtime magic. Since Groovy itself is well suited to the job of hiding boilerplate code we began to wonder if we could use it to get what we wanted from SWF.

Netflix OSS Glisten

Glisten is an ease of use SWF library developed at Netflix. It still uses core Flow objects but does not require AspectJ or code generation. Glisten provides WorkflowOperations and ActivitiesOperations interfaces that can be used by your WorkflowImplementation and ActivitiesImplementation classes respectively. All of the SWF specifics are hidden behind these operation interfaces in specific SWF implementations. There are also local implementations that allow for easy unit testing of workflows and activities.
Let's take a look at what a Glisten based workflow implementation looks like. Without code generation or AspectJ we no longer have the use of generated clients or the @Asynchronous annotation. Instead we use WorkflowOperations to provide 'activites' and 'waitFor' in addition to many other workflow concerns. Note that the Groovy annotation @Delegate is used here to allow the WorkflowOperations' public methods to appear on TestWorkflowImpl itself just to clean up the code. Like in the Flow example above, the 'doSomething' activity is scheduled and then we 'waitFor' its result to be ready. Once ready, the closure is executed where the 'consumeSomething' activity is provided with an 'it' parameter. In Groovy you can use 'it' to refer to an implicit parameter that is made available to the closure. Here 'it' is the result of the Promise passed into 'waitFor'. This is a pretty dense example of how we are using Groovy to handle some of the syntactic sugar that we lost from Flow by removing AspectJ and code generation.
class TestWorkflowImpl implements TestWorkflow {
    WorkflowOperations<TestActivities> workflowOperations = SwfWorkflowOperations.of(TestActivities)

    void doIt() {
        waitFor(activities.doSomething()) {
Glisten is a lightweight way to use SWF and only requires a dependency on Groovy. Most of your code can still be written in Java if you prefer. Glisten is currently used in Asgard to enable long lived deployment tasks. There is a comprehensive example workflow in the Glisten codebase and documented on the wiki. It demonstrates many SWF features (timers, parallel tasks, retries, error handling...) along with unit tests.

Glisten makes it easier for us to use Amazon's SWF, and maybe it can help you too. If you are interested in helping develop projects like this feel free to contribute or even join us at Netflix.