Monday, November 23, 2015

Creating Your Own EC2 Spot Market -- Part 2

In Part 1 Creating Your Own EC2 Spot Market of this series, we explained how Netflix manages its EC2 footprint and how we take advantage of our daily peak of 12,000 unused instances which we named the “internal spot market.”  This sizeable trough has significantly improved our encoding throughput, and we are pursuing other benefits from this large pool of unused resources.   
The Encoding team went through two iterations of internal spot market implementations.  The initial approach was a simple schedule-based borrowing mechanism that was quickly deployed in June in the us-east AZ to reap immediate benefits.  We applied the experience we gained to influence the next iteration of the design based on real-time availability.  
The main challenge of using the spot instances effectively is handling the dynamic nature of our instance availability.  With correct timing, running spot instances is effectively free; when the timing is off, however, any EC2 usage is billed at the on-demand price.  In this post we will discuss how the real-time, availability-based internal spot market system works and efficiently uses the unused capacity
Benefits of Extra Capacity
The encoding system at Netflix is responsible for encoding master media source files into many different output formats and bitrates for all Netflix supported devices.  A typical workload is triggered by source delivery, and sometimes the encoding system receives an entire season of a show within moments.  By leveraging the internal spot market, we have measured the equivalent of a 210% increase in encoding capacity.  With the extra boost of computing resources, we have improved our ability to handle sudden influx work and to quickly reduce our of backlog.
In addition to the production environment, the encoding infrastructure maintains 40 “farms” for development and testing.  Each farm is a complete encoding system with 20+ micro-services that matches the capability and capacity of the production environment.  
Computing resources are continuously evaluated and redistributed based on workload.  With the boost of spot market instances, the total encoding throughput increases significantly.  On the R&D side, researchers leverage these extra resources to carry out experiments in a fraction of the time it used to take.  Our QA automation is able to broaden the coverage of our comprehensive suite of continuous integration and run these jobs in less time.
Spot Market Borrowing in Action
We started the new spot market system in October, and we are encouraged by the improved performance compared to our borrowing in the first iteration.
For instance, in one of the research projects, we triggered 12,000 video encoding jobs over a weekend.  We had anticipated the work to finish in a few days, but we were pleasantly surprised to discover that the jobs were completed in only 18 hours.
The following graph captures that weekend’s activity.
The Y-axis denotes the amount of video encoder jobs queued in the messaging system, the red line represents high priority jobs, and the yellow area graph shows the amount of medium and low priority jobs.
Important Considerations
  • By launching on-demand instances in the Encoding team AWS account, the Encoding team never impacts guaranteed capacity (reserved instances) from the main Netflix account.
  • The Encoding team competes for on-demand instances with other Netflix accounts.   
  • Spot instance availability fluctuates and can become unavailable at any moment.  The encoding service needs to react to these changes swiftly.
  • It is possible to dip into unplanned on-demand usage due to sudden surge of instance usage in other Netflix accounts while we have internal spot instances running.  The benefits of borrowing must significantly outweigh the cost of these on-demand charges.
  • Available spot capacity comes in different types and sizes.  We can make the most out of them by making our jobs instance type agnostic.
Design Goals
Cost Effectiveness: Use as many spot instances as are available.  Incur as little unplanned on-demand usage as possible.
Good Citizenship: We want to minimize contention that may cause a shortage in the on-demand pool.  We take a light-handed approach by yielding spot instances to other Netflix accounts when there is competition on resources.
Automation: The Encoding team invests heavily in automation.  The encoding system is responsible for encoding activities for the entire Netflix catalog 24x7, hands free.  Spot market borrowing needs to function continuously and autonomously.
Metrics: Collect Atlas metrics to measure effectiveness, pinpoint areas of inefficiency, and trend usage patterns in near real-time.
Key Borrowing Strategies
We spend a great deal of the effort devising strategies to address the goals of Cost Effectiveness and Good Citizenship.  We started with a set of simple assumptions, and then constantly iterated using our monitoring system, allowing us to validate and fine tune the initial design to the following set of strategies below:
Real-time Availability Based Borrowing: Closely align utilization based on the fluctuating real-time spot instance availability using a Spinnaker API.  Spinnaker is a Continuous Delivery Platform that manages Netflix reservations and deployment pipelines.  It is in the optimal  position to know what instances are in use across all Netflix accounts.
Negative Surplus Monitor: Sample spot market availability, and quickly terminate (yield) borrowed instances when we detect overdraft of internal spot instances.  It enforces that our spot borrowing is treated as the lowest priority usage in the company and leads to reduced on-demand contention.
Idle Instance Detection: Detect over-allocated spot instances.  Accelerate down scaling of spot instances to improve time to release, with an additional benefit of reducing borrowing surface area.
Fair Distribution: When spot instances are abundant, distribute assignment evenly to avoid exhausting one EC2 instance type on a specific AZ.  This helps minimize on-demand shortage and contention while reducing involuntary churn due to negative surplus.
Smoothing Function: The resource scheduler evaluates assignments of EC2 instances based on a smoothed representation of workload, smoothing out jitters and spikes to prevent over-reaction.
Incremental Stepping & Fast Evaluation Cycle: Acting in incremental steps avoids over-reaction and allows us to evaluate the workload frequently for rapid self correction.  Incremental stepping also helps distribute instance usage across instance types and AZ more evenly.
Safety Margin: Reduce contention by leaving some amount of available spot instances unused.  It helps reduce involuntary termination due to minor fluctuations in usage in other Netflix accounts.
Curfew: Preemptively reduce spot usage before a predictable pattern of negative surplus inflection that drops rapidly (e.g. Nightly Netflix personal recommendation computation schedule). These curfews help minimize preventable on-demand charges.
Evacuation Monitor: A system-wide toggle to immediately evacuate all borrowing usage in case of emergency (e.g. regional traffic failover).  Eliminate on-demand contention in case of emergency.
The following graph depicts a five day span on spot usage by instance type.
This graph illustrates a few interesting points:
  • The variance in color represents different instance types in use, and in most cases the relatively even distribution of bands of color shows that instance type usage is reasonably balanced.
  • The sharp rise and drop of the peaks confirms that the encoding resource manager scales up and down relatively quickly in response to changes in workload.
  • The flat valleys show the frugality of instance usage. Spot instances are only used when there is work for them to do.
  • Not all color bands have the same height because the size of the reservation varies between instance types.  However, we are able to borrow from both large (orange) and small (green) pools, collectively satisfying the entire workload.
  • Finally, although this graph reports instance usage, it indirectly tracks the workload.  The overall shape of the graphs shows that there is no discernible pattern of the workload, such is the event driven nature of the encoding activities.
Based on the AWS billing data from October, we summed up all the borrowing hours and adjusted them relative to the r3.4xlarge instance type that makes up the Encoding reserved capacity.  With the addition of spot market instances, the effective encoding capacity increased by 210%.
Dark blue denotes spot market borrowing, and light blue represents on-demand usage.
On-demand pricing is multiple times more expensive than reserved instances, and it varies depending on instance type.  We took the October spot market usage and calculated what it would have cost with purely on-demand pricing and computed a 92% cost efficiency.
Lessons Learned
On-demand is Expensive: We already knew this fact, but the idea sinks in once we observed on-demand charges as a result of sudden overdrafts of spot usage.  A number of the strategies (e.g. Safety Margin, Curfew) listed in the above section were devised to specifically mitigate this occurrence.
Versatility: Video encoding represents 70% of our computing needs.  We made some tweaks to the video encoder to run on a much wider selection of instance types.  As a result, we were able to leverage a vast number of spot market instances during different parts of the day.
Tolerance to Interruption: The encoding system is built to withstand interruptions. This attribute works well with the internal spot market since instances can be terminated at any time.
Next Steps
Although the current spot market borrowing system is a notable improvement over the previous attempt, we are uncovering the tip of the iceberg.  In the future, we want to leverage spot market instances from different EC2 regions as they become available.  We are also heavily investing in the next generation of encoding architecture that scales more efficiently and responsively.  Here are some ideas we are exploring:
Cross Region Utilization: By borrowing from multiple EC2 regions, we triple the access to unused reservations from the current usable pool.  Using multiple regions also significantly reduces concentration of on-demand usages in a single EC2 region.
Containerization: The current encoding system is based on ASG scaling.  We are actively investing in the next generation of our encoding infrastructure using container technology.  The container model will reduce overhead in ASG scaling, minimize overhead of churning, and increase performance and throughput as Netflix continues to grow its catalog.
Resource Broker: The current borrowing system is monopolistic in that it assumes the Encoding service is the sole borrower.  It is relatively easy to implement for one borrower.  We need to create a resource broker to better coordinate access to the spot surplus when sharing amongst multiple borrowers.
In the first month of deployment, we observed significant benefits in terms of performance and throughput.  We were successful in making use of Netflix idle capacity for production, research, and QA.  Our encoding velocity increased dramatically.  Experimental research turn-around time was drastically reduced.  A comprehensive full regression test finishes in half the time it used to take.  With a cost efficiency of 92%, the spot market is not completely free but it is worth the cost.
All of these benefits translate to faster research turnaround, improved playback quality, and ultimately a better member experience.

-- Media Cloud Engineering