Monday, November 23, 2015

Creating Your Own EC2 Spot Market -- Part 2

In Part 1 Creating Your Own EC2 Spot Market of this series, we explained how Netflix manages its EC2 footprint and how we take advantage of our daily peak of 12,000 unused instances which we named the “internal spot market.”  This sizeable trough has significantly improved our encoding throughput, and we are pursuing other benefits from this large pool of unused resources.   
The Encoding team went through two iterations of internal spot market implementations.  The initial approach was a simple schedule-based borrowing mechanism that was quickly deployed in June in the us-east AZ to reap immediate benefits.  We applied the experience we gained to influence the next iteration of the design based on real-time availability.  
The main challenge of using the spot instances effectively is handling the dynamic nature of our instance availability.  With correct timing, running spot instances is effectively free; when the timing is off, however, any EC2 usage is billed at the on-demand price.  In this post we will discuss how the real-time, availability-based internal spot market system works and efficiently uses the unused capacity
Benefits of Extra Capacity
The encoding system at Netflix is responsible for encoding master media source files into many different output formats and bitrates for all Netflix supported devices.  A typical workload is triggered by source delivery, and sometimes the encoding system receives an entire season of a show within moments.  By leveraging the internal spot market, we have measured the equivalent of a 210% increase in encoding capacity.  With the extra boost of computing resources, we have improved our ability to handle sudden influx work and to quickly reduce our of backlog.
In addition to the production environment, the encoding infrastructure maintains 40 “farms” for development and testing.  Each farm is a complete encoding system with 20+ micro-services that matches the capability and capacity of the production environment.  
Computing resources are continuously evaluated and redistributed based on workload.  With the boost of spot market instances, the total encoding throughput increases significantly.  On the R&D side, researchers leverage these extra resources to carry out experiments in a fraction of the time it used to take.  Our QA automation is able to broaden the coverage of our comprehensive suite of continuous integration and run these jobs in less time.
Spot Market Borrowing in Action
We started the new spot market system in October, and we are encouraged by the improved performance compared to our borrowing in the first iteration.
For instance, in one of the research projects, we triggered 12,000 video encoding jobs over a weekend.  We had anticipated the work to finish in a few days, but we were pleasantly surprised to discover that the jobs were completed in only 18 hours.
The following graph captures that weekend’s activity.
The Y-axis denotes the amount of video encoder jobs queued in the messaging system, the red line represents high priority jobs, and the yellow area graph shows the amount of medium and low priority jobs.
Important Considerations
  • By launching on-demand instances in the Encoding team AWS account, the Encoding team never impacts guaranteed capacity (reserved instances) from the main Netflix account.
  • The Encoding team competes for on-demand instances with other Netflix accounts.   
  • Spot instance availability fluctuates and can become unavailable at any moment.  The encoding service needs to react to these changes swiftly.
  • It is possible to dip into unplanned on-demand usage due to sudden surge of instance usage in other Netflix accounts while we have internal spot instances running.  The benefits of borrowing must significantly outweigh the cost of these on-demand charges.
  • Available spot capacity comes in different types and sizes.  We can make the most out of them by making our jobs instance type agnostic.
Design Goals
Cost Effectiveness: Use as many spot instances as are available.  Incur as little unplanned on-demand usage as possible.
Good Citizenship: We want to minimize contention that may cause a shortage in the on-demand pool.  We take a light-handed approach by yielding spot instances to other Netflix accounts when there is competition on resources.
Automation: The Encoding team invests heavily in automation.  The encoding system is responsible for encoding activities for the entire Netflix catalog 24x7, hands free.  Spot market borrowing needs to function continuously and autonomously.
Metrics: Collect Atlas metrics to measure effectiveness, pinpoint areas of inefficiency, and trend usage patterns in near real-time.
Key Borrowing Strategies
We spend a great deal of the effort devising strategies to address the goals of Cost Effectiveness and Good Citizenship.  We started with a set of simple assumptions, and then constantly iterated using our monitoring system, allowing us to validate and fine tune the initial design to the following set of strategies below:
Real-time Availability Based Borrowing: Closely align utilization based on the fluctuating real-time spot instance availability using a Spinnaker API.  Spinnaker is a Continuous Delivery Platform that manages Netflix reservations and deployment pipelines.  It is in the optimal  position to know what instances are in use across all Netflix accounts.
Negative Surplus Monitor: Sample spot market availability, and quickly terminate (yield) borrowed instances when we detect overdraft of internal spot instances.  It enforces that our spot borrowing is treated as the lowest priority usage in the company and leads to reduced on-demand contention.
Idle Instance Detection: Detect over-allocated spot instances.  Accelerate down scaling of spot instances to improve time to release, with an additional benefit of reducing borrowing surface area.
Fair Distribution: When spot instances are abundant, distribute assignment evenly to avoid exhausting one EC2 instance type on a specific AZ.  This helps minimize on-demand shortage and contention while reducing involuntary churn due to negative surplus.
Smoothing Function: The resource scheduler evaluates assignments of EC2 instances based on a smoothed representation of workload, smoothing out jitters and spikes to prevent over-reaction.
Incremental Stepping & Fast Evaluation Cycle: Acting in incremental steps avoids over-reaction and allows us to evaluate the workload frequently for rapid self correction.  Incremental stepping also helps distribute instance usage across instance types and AZ more evenly.
Safety Margin: Reduce contention by leaving some amount of available spot instances unused.  It helps reduce involuntary termination due to minor fluctuations in usage in other Netflix accounts.
Curfew: Preemptively reduce spot usage before a predictable pattern of negative surplus inflection that drops rapidly (e.g. Nightly Netflix personal recommendation computation schedule). These curfews help minimize preventable on-demand charges.
Evacuation Monitor: A system-wide toggle to immediately evacuate all borrowing usage in case of emergency (e.g. regional traffic failover).  Eliminate on-demand contention in case of emergency.
The following graph depicts a five day span on spot usage by instance type.
This graph illustrates a few interesting points:
  • The variance in color represents different instance types in use, and in most cases the relatively even distribution of bands of color shows that instance type usage is reasonably balanced.
  • The sharp rise and drop of the peaks confirms that the encoding resource manager scales up and down relatively quickly in response to changes in workload.
  • The flat valleys show the frugality of instance usage. Spot instances are only used when there is work for them to do.
  • Not all color bands have the same height because the size of the reservation varies between instance types.  However, we are able to borrow from both large (orange) and small (green) pools, collectively satisfying the entire workload.
  • Finally, although this graph reports instance usage, it indirectly tracks the workload.  The overall shape of the graphs shows that there is no discernible pattern of the workload, such is the event driven nature of the encoding activities.
Based on the AWS billing data from October, we summed up all the borrowing hours and adjusted them relative to the r3.4xlarge instance type that makes up the Encoding reserved capacity.  With the addition of spot market instances, the effective encoding capacity increased by 210%.
Dark blue denotes spot market borrowing, and light blue represents on-demand usage.
On-demand pricing is multiple times more expensive than reserved instances, and it varies depending on instance type.  We took the October spot market usage and calculated what it would have cost with purely on-demand pricing and computed a 92% cost efficiency.
Lessons Learned
On-demand is Expensive: We already knew this fact, but the idea sinks in once we observed on-demand charges as a result of sudden overdrafts of spot usage.  A number of the strategies (e.g. Safety Margin, Curfew) listed in the above section were devised to specifically mitigate this occurrence.
Versatility: Video encoding represents 70% of our computing needs.  We made some tweaks to the video encoder to run on a much wider selection of instance types.  As a result, we were able to leverage a vast number of spot market instances during different parts of the day.
Tolerance to Interruption: The encoding system is built to withstand interruptions. This attribute works well with the internal spot market since instances can be terminated at any time.
Next Steps
Although the current spot market borrowing system is a notable improvement over the previous attempt, we are uncovering the tip of the iceberg.  In the future, we want to leverage spot market instances from different EC2 regions as they become available.  We are also heavily investing in the next generation of encoding architecture that scales more efficiently and responsively.  Here are some ideas we are exploring:
Cross Region Utilization: By borrowing from multiple EC2 regions, we triple the access to unused reservations from the current usable pool.  Using multiple regions also significantly reduces concentration of on-demand usages in a single EC2 region.
Containerization: The current encoding system is based on ASG scaling.  We are actively investing in the next generation of our encoding infrastructure using container technology.  The container model will reduce overhead in ASG scaling, minimize overhead of churning, and increase performance and throughput as Netflix continues to grow its catalog.
Resource Broker: The current borrowing system is monopolistic in that it assumes the Encoding service is the sole borrower.  It is relatively easy to implement for one borrower.  We need to create a resource broker to better coordinate access to the spot surplus when sharing amongst multiple borrowers.
In the first month of deployment, we observed significant benefits in terms of performance and throughput.  We were successful in making use of Netflix idle capacity for production, research, and QA.  Our encoding velocity increased dramatically.  Experimental research turn-around time was drastically reduced.  A comprehensive full regression test finishes in half the time it used to take.  With a cost efficiency of 92%, the spot market is not completely free but it is worth the cost.
All of these benefits translate to faster research turnaround, improved playback quality, and ultimately a better member experience.

-- Media Cloud Engineering

Friday, November 20, 2015

Sleepy Puppy Extension for Burp Suite

Netflix recently open sourced Sleepy Puppy - a cross-site scripting (XSS) payload management framework for security assessments. One of the most frequently requested features for Sleepy Puppy has been for an extension for Burp Suite, an integrated platform for web application security testing. Today, we are pleased to open source a Burp extension that allows security engineers to simplify the process of injecting payloads from Sleep Puppy and then tracking the XSS propagation over longer periods of time and over multiple assessments.

Prerequisites and Configuration
First, you need to have a copy of Burp Suite running on your system. If you do not have a copy of Burp Suite, you can download/buy Burp Suite here. You also need a Sleepy Puppy instance running on a server. You can download Sleepy Puppy here. You can try out Sleepy Puppy using Docker. Detailed instructions on setup and configuration are available on the wiki page.

Once you have these prerequisites taken care of, please download the Burp extension here.

If the Sleepy Puppy server is running over HTTPS (which we would encourage), you need to inform the Burp JVM to trust the CA that signed your Sleepy Puppy server certificate. This can be done by importing the cert from Sleepy Puppy server into a keystore and then specifying the keystore location and passphrase while starting Burp Suite. Specific instructions include:

  • Visit your Sleepy Puppy server and export the certificate using Firefox in pem format
  • Import the cert in pem format into a keystore with the command below. keytool -import -file </path/to/cert.pem> -keystore sleepypuppy_truststore.jks -alias sleepypuppy
  • You can specify the truststore information for the plugin either as an environment variable or as a JVM option.
  • Set truststore info as environmental variables and start Burp as shown below: export SLEEPYPUPPY_TRUSTSTORE_LOCATION=</path/to/sleepypuppy_truststore.jks> export SLEEPYPUPPY_TRUSTSTORE_PASSWORD=<passphrase provided while creating the truststore using keytool command above> java -jar burp.jar
  • Set truststore info as part of the Burp startup command as shown below: java -DSLEEPYPUPPY_TRUSTSTORE_PASSWORD=</path/to/sleepypuppy_truststore.jks> -DSLEEPYPUPPY_TRUSTSTORE_PASSWORD=<passphrase provided while creating the truststore using keytool command above> -jar burp.jar
Now it is time to load the Sleepy Puppy extension and explore its functionality.

Using the Extension
Once you launch Burp and load up the Sleepy Puppy extension, you will be presented with the Sleepy Puppy tab.


This tab will allow you to leverage the capabilities of Burp Suite along with the Sleepy Puppy XSS Management framework to better manage XSS testing.

Some of the features provided by the extension include:

  • Create a new assessment or select an existing assessment
  • Add payloads to your assessment and the Sleepy Puppy server from the the extension
  • When an Active Scan is conducted against a site or URL, the XSS payloads from the selected Sleepy Puppy Assessment will be executed after Burp's built-in XSS payloads
  • In Burp Intruder, the Sleepy Puppy Extension can be chosen as the payload generator for XSS testing
  • In Burp Repeater, you can replace any value in an existing request with a Sleepy Puppy payload using the context menu
  • The Sleepy Puppy tab provides statistics about any payloads that have been triggered for the selected assessment

You can watch the Sleepy Puppy extension in action at youtube.

Interested in Contributing?
Feel free to reach out or submit pull requests if there’s anything else you’re looking for. We hope you’ll find Sleepy Puppy and the Burp extension as useful as we do!

by: Rudra Peram

Monday, November 16, 2015

Global Continuous Delivery with Spinnaker

After over a year of development and production use at Netflix, we’re excited to announce that our Continuous Delivery platform, Spinnaker, is available on GitHub. Spinnaker is an open source multi-cloud Continuous Delivery platform for releasing software changes with high velocity and confidence. Spinnaker is designed with pluggability in mind; the platform aims to make it easy to extend and enhance cloud deployment models. To create a truly extensible multi-cloud platform, the Spinnaker team partnered with Google, Microsoft and Pivotal to deliver out-of-the-box cluster management and deployment. As of today, Spinnaker can deploy to and manage clusters simultaneously across both AWS and Google Cloud Platform with full feature compatibility across both cloud providers. Spinnaker also features deploys to Cloud Foundry; support for its newest addition, Microsoft Azure, is actively underway.

If you’re familiar with Netflix’s Asgard, you’ll be in good hands. Spinnaker is the replacement for Asgard and builds upon many of its concepts.  There is no need for a migration from Asgard to Spinnaker as changes to AWS assets via Asgard are completely compatible with changes to those same assets via Spinnaker and vice versa. 

Continuous Delivery with Spinnaker

Spinnaker facilitates the creation of pipelines that represent a delivery process that can begin with the creation of some deployable asset (such as an machine image, Jar file, or Docker image) and end with a deployment. We looked at the ways various Netflix teams implemented continuous delivery to the cloud and generalized the building blocks of their delivery pipelines into configurable Stages that are composable into Pipelines. Pipelines can be triggered by the completion of a Jenkins Job, manually, via a cron expression, or even via other pipelines. Spinnaker comes with a number of stages, such as baking a machine image, deploying to a cloud provider, running a Jenkins Job, or manual judgement to name a few. Pipeline stages can be run in parallel or serially. 

Spinnaker Pipelines

Spinnaker also provides cluster management capabilities and provides deep visibility into an application’s cloud footprint. Via Spinnaker’s application view, you can resize, delete, disable, and even manually deploy new server groups using strategies like Blue-Green (or Red-Black as we call it at Netflix). You can create, edit, and destroy load balancers as well as security groups. 

Cluster Management in Spinnaker
Spinnaker is a collection of JVM-based services, fronted by a customizable AngularJS single-page application. The UI leverages a rich RESTful API exposed via a gateway service. 

You can find all the code for Spinnaker on GitHubThere are also installation instructions on how to setup and deploy Spinnaker from source as well as instructions for deploying Spinnaker from pre-existing images that Kenzan and Google have created. We’ve set up a Slack channel and we are committed to leveraging StackOverflow as a means for answering community questions. Issues, questions, and pull requests are welcome. 

Monday, November 9, 2015

Netflix Hack Day - Autumn 2015

Last week, we hosted our latest installment of Netflix Hack Day. As always, Hack Day is a way for our product development staff to get away from everyday work, to have fun, experiment, collaborate, and be creative.

The following video is an inside look at what our Hack Day event looks like:

Video credit: Sean Williams

This time, we had 75 hacks that were produced by about 200 engineers and designers (and even some from the legal team!). We’ve embedded some of our favorites below.  You can also see some of our past hacks in our posts for March 2015, Feb. 2014 & Aug. 2014

While we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or otherwise be used beyond Hack Day.  We are posting them here publicly to simply share the spirit of the event.

Thanks to all of the hackers for putting together some incredible work in just 24 hours!

Netflix VHF
Watch Netflix on your Philco Predicta, the TV of tomorrow! We converted a 1950's era TV into a smart TV that runs Netflix.

Narcos: Plata O Plomo Video Game
A fun game based on the Netflix original series, Narcos.

Stream Possible
Netflix TV experience over a 3G cell connection for the non-broadband rich parts of the world.

Ok Netflix
Ok Netflix can find the exact scene in a movie or episode from listening to the dialog that comes from that scene.  Speak the dialog into Ok Netflix and Ok Netflix will do the rest, starting the right title in the right location.

Smart Channels
A way to watch themed collections of content that are personalized and also autoplay like serialized content.

And here are some pictures taken during the event.