Monday, July 18, 2016

Chelsea: Encoding in the Fast Lane

Back in May Netflix launched its first global talk show: Chelsea. Delivering this new format was a first for us, and a fun challenge in many different aspects, which this blog describes in more detail. Chelsea Handler's new Netflix talk show ushered in a Day-of-Broadcast (DOB) style of delivery that is demanding on multiple levels for our teams, with a lightning-fast tight turnaround time. We looked at all the activities that take place in the Netflix Digital Supply Chain, from source delivery to live-on-site, and gave a time budget for each activity, pushing on all the teams to squeeze their times, aiming at an aggressive overall goal. In this article we explain enhancements and techniques that the encoding team used to successfully process this show faster than ever.

Historically there was not as much pressure on encode times. Our system was optimized for throughput and robustness, paying less attention to speed. In the last few years we had worked to reduce the ingest and encode time to about 2.5 hours. This met the demands of our most stringent use cases like the Day-After-Broadcast delivery of Breaking Bad. Now, Chelsea was pushing us to reduce this time even further. The new aggressive time budget calls for us to ingest and encode a 30 minute title in under 30 minutes. Our solution ends up using about 5 minutes for source inspection and 25 minutes for encoding.
The Starting Point
Although Chelsea challenged us to encode with a significantly shorter turnaround time compared to other movies or shows in our catalog, our work over the last few years on developing a robust and scalable cloud-based system helped jumpstart our efforts to meet this challenge.
Parallel Encoding
In the early days of Netflix streaming, the entire video encode of a title would be generated on a single Windows machine. For some streams (for example, slower codecs or higher resolutions), generating a single encode would take more than 24 hours. We improved on our system a few years ago by rolling out a parallel encoding workflow, which breaks up a title in “chunks” and the chunks can be processed in parallel on different machines. This allowed for shorter latency, especially as the number of machines scale up, and robustness to transient errors. If a machine is unexpectedly terminated, only a small amount of work is lost.
Automated Parallel Inspections
To ensure that we deliver high quality video streams to our members, we have invested in developing automated quality checks throughout the encoding pipeline. We start with inspecting the source “mezzanine” file to make sure that a pristine source is ingested into the system. Types of inspections include detection of wrong metadata, picture corruption, insertion of extra content, frame rate conversion and interlacing artifacts. After generating a video stream, we verify the encodes by inspecting the metadata, comparing the output video to the mezzanine fingerprint and generating quality metrics. This enables us to detect issues caused by glitches on the cloud instances or software implementation bugs. Through automated inspections of the encodes we can detect output issues early on, without the video having to reach manual QC. Just as we do encoding in parallel by breaking the source into chunks, likewise we can run our automated inspections in parallel by chunking the mezzanine file or encoded video.
Internal Spot Market
Since automated inspections and encoding are enabled to run in parallel in a chunked model, increasing the number of available instances can greatly reduce end-to-end latency. We recently worked on a system to dynamically leverage unused Netflix-reserved AWS servers during off-peak hours. The additional cloud instances, not used by other Netflix services, allowed us to expedite and prioritize encoding of Chelsea’s show.
Priority Scheduling
Encoding jobs can come in varying priorities from highly urgent (e.g. DOB titles, or interactive jobs submitted by humans) to low priority background backfill. Within the same title, certain codecs and bitrates rank higher in priority than others so that required bitrates necessary to go live are always processed first. To handle the fine grain and dynamic nature of job priority, the encoding team developed a custom priority messaging service. Priorities are broadly classified by priority class that models after the US Postal service classes of mail, and fine grain job priority is expressed by a due date. Chelsea belongs to the highest priority class, Express (sorry, no Sunday delivery). With the axiom that “what’s important is needed yesterday”, all Chelsea show jobs are due 30 years ago!
Innovations Motivated by Chelsea
As we analyzed our entire process looking for ways to make the process faster, it was apparent that DOB titles have different goals and characteristics than other titles. Some improvement techniques would only be practical on a DOB title, and others that might make sense on ordinary titles may only be practical on the smaller scale of DOB titles and not on the scale of the entire catalog. Low latency is often at odds with high throughput, and we still have to support an enormous throughput. So understand that the techniques described here are used selectively on the most urgent of titles.

When trying to make anything faster we consider these standard approaches:
  1. Use phased processing to postpone blocking operations
  2. Increase parallelism
  3. Make it plain faster
We will mention some improvements from each of these categories.
Phased Processing
Most sources for Netflix originals go through a rigorous set of inspections after delivery, both manual and automated. First, manual inspections happen on the source delivered to us to check if it adheres to the Netflix source guidelines. With Chelsea, this inspection begins early with the pre-taped segments being inspected well before the show itself is taped. Then, inspections are done during taping and again during the editorial process, right on set. By the time it is delivered, we are confident that it needs no further manual QC because exhaustive QC was performed at post.
We have control over the source production; it is our studio and our crew and our editing process. It is well-rehearsed and well-known. If we assume the source is good, we can bypass the automated inspections that focus on errors introduced by the production process. Examples of inspections typically done on all sources are detection of telecine, interlacing, audio hits, silence in audio and bad channel mapping. Bypassing the most expensive inspections, such as deep audio inspections, allowed us to bring the execution time down from 30 minutes to about 5 minutes on average. Aside from detecting problems, the inspection stage generates artifacts that are necessary for the encoding processing. We maintain all inspections that produce these artifacts.
Complexity Analysis
A previous article described how we use an encoding recipe uniquely tailored to each title. The first step in this per-title encode optimization is complexity analysis, an expensive examination of large numbers of frames to decide on a strategy that is optimal for the title.
For a DOB title, we are willing to release it with a standard set of recipes and bitrates, the same way Netflix had delivered titles for years. This standard treatment is designed to give a good experience for any show and does an adequate job for something like Chelsea.
We launch an asynchronous job to do the complexity analysis on Chelsea, which will trigger a re-encode and produce streams with ultimate efficiency and quality. We are not blocked on this. If it is not finished by the show start date, the show will still go live with the standard streams. Sometime later the new streams will replace the old.
Increase Parallelism
Encoding in Chunks
As mentioned earlier, breaking up a video into small chunks and encoding different chunks in parallel can effectively reduce the overall encoding time. At the time the DOB project started, we still had a few codecs that were processed as a single chunk, such as h263. We took this opportunity to create a chunkable process for these remaining codecs.
Optimized Encoding Chunk Size
For DOB titles we went more extreme. After extensive testing with different chunk sizes, we discovered that by reducing the chunk size from our previous standard of 3 minutes to 30 seconds we can cut down the encoding time by 80% without noticeable video quality degradation.

More chunks means more overhead so for normal titles we stick to a 3 minute chunk size. For DOB titles we are willing to pay the increased overhead.
Reduce Dependency in Steps
Some older codec formats (for example, VC1), used by legacy TVs and Bluray players, were being encoded from lightly-compressed intermediate files. This meant we could not begin encoding these streams (which were one of the slower processes in our pipeline) until we had finished the intermediate encode. We changed our process to do the legacy streams directly from the source so that we did not have to wait for the intermediate step.
Make It Faster
Infrastructure Enhancements
Once an AV source is entered into the encoding system, we encode it with a number of codecs, resolutions, and bitrates for all Netflix playback devices. To meet the SLA for DOB encoding and be as fast as possible, we need to run all DOB encoding jobs in parallel without waiting. We have an extra challenge that with a finer grain of chunk size used for DOB, there are more jobs that need to be run in parallel.
Right Sizing
The majority of the computing resources are spent on video encoding. A relatively small percentage of computing is spent on source inspection, audio, subtitle, and other assets. It is easy to pre-scale the production environment for these smaller activities. On the other hand, with a 30 second chunk size, we drastically increased the number of parallel video encoding activities. For a 30 minute Chelsea episode, we estimated a need of 1,000 video encoders to compute all codecs, resolutions, and bit rates at the same time. For the video encoders, we make use of internal spot market, the unused Netflix reserved instances, to achieve this high instance count.
Warm Up
The resource scheduler normally samples the work queues and autoscales video encoders based on workload at the moment. Scaling Amazon EC2 instances takes time. The amount of time to scale depends on many factors and is something that could prevent us from achieving the proper SLA for encoding a DOB title. Pre-scaling 1,000 video encoders eliminates the scaling time penalty when a DOB title arrives. It is uneconomical to keep 1,000 video encoders 24x7 regardless of workload.
To strike a balance, we introduce a warm-up mechanism. We pre-scale 1,000 video encoders at the earliest signal of an imminent DOB title arrival, and keep them around for an hour. The Netflix ingest pipeline sends a notification to the resource scheduler whenever we start to receive a DOB title from the on set. Upon receiving the notification, the resource scheduler immediately procures 1,000 video encoders spread out over many instance types and zones (e.g. r3.2xlarge on us-east-1e) parallelizing instance acquisition and reduce the overall time. This strategy also mitigates the risk of running out of a specific instance type and availability zone combination.
Priority and Job Preemption
Since the warm-up comes in advance of having actual DOB jobs, the video encoders will busy themselves with existing encode jobs. By the time the DOB Express priority jobs arrive, it is possible that a video encoder already has a lower priority job in-flight. We can't afford to wait for these jobs to finish before beginning the Express priority jobs. To mitigate this scenario, we enhanced our custom-built priority messaging service with job preemption where a high priority job such as a Chelsea video encode interrupts a lower priority job.
Empirical data shows that all DOB jobs are picked up within 60 seconds on average.
The Fast Lane
We examined all the interactions with other systems and teams and identified latencies that could be improved. Like the encoding system, other systems at Netflix were also designed for high throughput, while latency was a secondary concern. For all systems overall, throughput remains a top priority and we cannot sacrifice in this area. In many cases it was not practical to improve the latency of interactions for all titles while satisfying throughput demands so we developed special fast lane communications. Communications about DOB titles follow a different path, one that leads to lower latency.
We achieved our goal of reducing the time for ingest and encode to be approximately the runtime of the source, i.e. 30 minutes for a 30 minute source. We were pleased that the architecture we put in place in recent years that emphasizes flexibility and configurability provided a great foundation for building out the DOB process. This investment paid off by allowing us to quickly respond to business demands and effectively deliver our first talk show to members all around the world.

By Rick Wong, Zhan Chen, Anne Aaron, Megha Manohara, and Darrell Denlinger