Per-Title Encode Optimization

11 min readDec 14, 2015

We’ve spent years developing an approach, called per-title encoding, where we run analysis on an individual title to determine the optimal encoding recipe based on its complexity. Imagine having very involved action scenes that need more bits to encapsulate the information versus unchanging landscape scenes or animation that need less. This allows us to deliver the same or better experience while using less bandwidth, which will be particularly important in lower bandwidth countries and as we expand to places where video viewing often happens on mobile networks.

Background

In traditional terrestrial, cable or satellite TV, broadcasters have an allocated bandwidth and the program or set of programs are encoded such that the resulting video streams occupy the given fixed capacity. Statistical multiplexing is oftentimes employed by the broadcaster to efficiently distribute the bitrate among simultaneous programs. However, the total accumulated bitrate across the programs should still fit within the limited capacity. In many cases, padding is even added using null packets to guarantee strict constant bitrate for the fixed channel, thus wasting precious data rate. Furthermore, with pre-set channel allocations, less popular programs or genres may be allocated lower bitrates (and therefore, worse quality) than shows that are viewed by more people.

With the advantages of Internet streaming, Netflix is not bound to pre-allocated channel constraints. Instead, we can deliver the best video quality stream to a member, no matter what the program or genre, tailored to the member’s available bandwidth and viewing device capability. We pre-encode streams at various bitrates applying optimized encoding recipes. On the member’s device, the Netflix client runs adaptive streaming algorithms which instantaneously select the best encode to maximize video quality while avoiding playback interruptions due to rebuffers.

Encoding with the best recipe is not a simple problem. For example, assuming a 1 Mbps bandwidth, should we stream H.264/AVC at 480p, 720p or 1080p? With 480p, 1 Mbps will likely not exhibit encoding artifacts such blocking or ringing, but if the member is watching on an HD device, the upsampled video will not be sharp. On the other hand, if we encode at 1080p we send a higher resolution video, but the bitrate may be too low such that most scenes will contain annoying encoding artifacts.

The Best Recipe for All

When we first deployed our H.264/AVC encodes in late 2010, our video engineers developed encoding recipes that worked best across our video catalogue (at that time). They tested various codec configurations and performed side-by-side visual tests to settle on codec parameters that produced the best quality trade-offs across different types of content. A set of bitrate-resolution pairs (referred to as a bitrate ladder), listed below, were selected such that the bitrates were sufficient to encode the stream at that resolution without significant encoding artifacts.

This “one-size-fits-all” fixed bitrate ladder achieves, for most content, good quality encodes given the bitrate constraint. However, for some cases, such as scenes with high camera noise or film grain noise, the highest 5800 kbps stream would still exhibit blockiness in the noisy areas. On the other end, for simple content like cartoons, 5800 kbps is far more than needed to produce excellent 1080p encodes. In addition, a customer whose network bandwidth is constrained to 1750 kbps might be able to watch the cartoon at HD resolution, instead of the SD resolution specified by the ladder above.

The titles in Netflix’s video collection have very high diversity in signal characteristics. In the graph below we present a depiction of the diversity of 100 randomly sampled titles. We encoded 100 sources at 1080p resolution using x264 constant QP (Quantization Parameter) rate control. At each QP point, for every title, we calculate the resulting bitrate in kbps, shown on the x-axis, and PSNR (Peak Signal-To-Noise Ratio) in dB, shown on the y-axis, as a measure of video quality.

The plots show that some titles reach very high PSNR (45 dB or more) at bitrates of 2500 kbps or less. On the other extreme, some titles require bitrates of 8000 kbps or more to achieve an acceptable PSNR of 38 dB.

Given this diversity, a one-size-fits-all scheme obviously cannot provide the best video quality for a given title and member’s allowable bandwidth. It can also waste storage and transmission bits because, in some cases, the allocated bitrate goes beyond what is necessary to achieve a perceptible improvement in video quality.

Side Note on Quality Metrics: For the above figure, and many of the succeeding plots, we plot PSNR as the measure of quality. PSNR is the most commonly used metric in video compression. Although PSNR does not always reflect perceptual quality, it is a simple way to measure the fidelity to the source, gives good indication of quality at the high and low ends of the range (i.e. 45 dB is very good quality, 35 dB will show encoding artifacts), and is a good indication of quality trends within a single title.The analysis can also be applied using other quality measures such as the VMAF perceptual metric. VMAF (Video Multi-Method Assessment Fusion) is a perceptual quality metric developed by Netflix in collaboration with University of Southern California researchers. We will publish details of this quality metric in a future blog.

The Best Recipe for the Content

Why Per-Title?

Consider an animation title where the content is “simple”, that is, the video frames are composed mostly of flat regions with no camera or film grain noise and minimal motion between frames. We compare the quality curve for the fixed bitrate ladder with a bitrate ladder optimized for the specific title:

As shown in the figure above, encoding this video clip at 1920×1080, 2350 kbps (A) produces a high quality encode, and adding bits to reach 4300 kbps (B) or even 5800 kbps (C) will not deliver a noticeable improvement in visual quality (for encodes with PSNR 45 dB or above, the distortion is perceptually unnoticeable). In the fixed bitrate ladder, for 2350 kbps, we encode at 1280×720 resolution (D). Therefore members with bandwidth constraints around that point are limited to 720p video instead of the better quality 1080p video.

On the other hand, consider an action movie that has significantly more temporal motion and spatial texture than the animation title. It has scenes with fast-moving objects, quick scene changes, explosions and water splashes. The graph below shows the quality curve of an action movie.

Encoding these high complexity scenes at 1920×1080, 4300 kbps (A), would result in encoding artifacts such as blocking, ringing and contouring. A better quality trade-off would be to encode at a lower resolution 1280x720 (B), to eliminate the encoding artifacts at the expense of adding scaling. Encoding artifacts are typically more annoying and visible than blurring introduced by downscaling (before the encode) then upsampling at the member’s device. It is possible that for this title with high complexity scenes, it would even be beneficial to encode 1920×1080 at a bitrate beyond 5800 kbps, say 7500 kbps, to eliminate the encoding artifacts completely.

To deliver the best quality video to our members, each title should receive a unique bitrate ladder, tailored to its specific complexity characteristics. Over the last few years, the encoding team at Netflix invested significant research and engineering to investigate and answer the following questions:

Given a title, how many quality levels should be encoded such that each level produces a just-noticeable-difference (JND)?
Given a title, what is the best resolution-bitrate pair for each quality level?
Given a title, what is the highest bitrate required to achieve the best perceivable quality?
Given a video encode, what is the human perceived quality?
How do we design a production system that can answer the above questions in a robust and scalable way?

The Algorithm

To design the optimal per-title bitrate ladder, we select the total number of quality levels and the bitrate-resolution pair for each quality level according to several practical constraints. For example, we need backward-compatibility (streams are playable on all previously certified Netflix devices), so we limit the resolution selection to a finite set — 1920×1080, 1280×720, 720×480, 512×384, 384×288 and 320×240. In addition, the bitrate selection is also limited to a finite set, where the adjacent bitrates have an increment of roughly 5%.

We also have a number of optimality criteria that we consider.

The selected bitrate-resolution pair should be efficient, i.e. at a given bitrate, the produced encode should have as high quality as possible.
Adjacent bitrates should be perceptually spaced. Ideally, the perceptual difference between two adjacent bitrates should fall just below one JND. This ensures that the quality transitions can be smooth when switching between bitrates. It also ensures that the least number of quality levels are used, given a wide range of perceptual quality that the bitrate ladder has to span.

To build some intuition, consider the following example where we encode a source at three different resolutions with various bitrates.

Encoding at three resolutions and various bitrates. Blue marker depicts encoding point and the red curve indicates the PSNR-bitrate convex hull.

At each resolution, the quality of the encode monotonically increases with the bitrate, but the curve starts flattening out (A and B) when the bitrate goes above some threshold. This is because every resolution has an upper limit in the perceptual quality it can produce. When a video gets downsampled to a low resolution for encoding and later upsampled to full resolution for display, its high frequency components get lost in the process.

On the other hand, a high-resolution encode may produce a quality lower than the one produced by encoding at the same bitrate but at a lower resolution (see C and D). This is because encoding more pixels with lower precision can produce a worse picture than encoding less pixels at higher precision combined with upsampling and interpolation. Furthermore, at very low bitrates the encoding overhead associated with every fixed-size coding block starts to dominate in the bitrate consumption, leaving very few bits for encoding the actual signal. Encoding at high resolution at insufficient bitrate would produce artifacts such as blocking, ringing and contouring.

Based on the discussion above, we can draw a conceptual plot to depict the bitrate-quality relationship for any video source encoded at different resolutions, as shown below:

We can see that each resolution has a bitrate region in which it outperforms other resolutions. If we collect all these regions from all the resolutions available, they collectively form a boundary called convex hull. In an economic sense, the convex hull is where the encoding point achieves Pareto efficiency. Ideally, we want to operate exactly at the convex hull, but due to practical constraints (for example, we can only select from a finite number of resolutions), we would like to select bitrate-resolution pairs that are as close to the convex hull as possible.

It is practically infeasible to construct the full bitrate-quality graphs spanning the entire quality region for each title in our catalogue. To implement a practical solution in production, we perform trial encodings at different quantization parameters (QPs), over a finite set of resolutions. The QPs are chosen such that they are one JND apart. For each trial encode, we measure the bitrate and quality. By interpolating curves based on the sample points, we produce bitrate-quality curves at each candidate resolution. The final per-title bitrate ladder is then derived by selecting points closest to the convex hull.

Sample Results

BoJack Horseman is an example of an animation with simple content — flat regions and low motion from frame to frame. In the fixed bitrate ladder scheme, we use 1750 kbps for the 480p encode. For this particular episode, with the per-title recipe we start streaming 1080p video at 1540 kbps. Below we compare cropped screenshots (assuming a 1080p display) from the two versions (top: 1750 kbps, bottom: new 1540 kbps). The new encode is crisper and has better visual quality.

Orange is the New Black has video characteristics with more average complexity. At the low bitrate range, there is no significant quality improvement seen with the new scheme. At the high end, the new per-title encoding assigns 4640 kbps for the highest quality 1080p encode. This is 20% in bitrate savings compared to 5800 kbps for the fixed ladder scheme. For this title we avoid wasting bits but maintain the same excellent visual quality for our members. The images below show a screenshot at 5800 kbps (top) vs. 4640 kbps (bottom).

The Best Recipe for Your Device

In the description above where we select the optimized per-title bitrate ladder, there is an inherent assumption that the viewing device can receive and play any of the encoded resolutions. However, because of hardware constraints, some devices may be limited to resolutions lower than the original resolution of the source content. If we select the convex hull covering resolutions up to 1080p, this could lead to suboptimal viewing experiences for, say, a tablet limited to 720p decoding hardware. For example, given an animation title, we may switch to 1080p at 2000 kbps because it results in better quality than a 2000 kbps 720p stream. However the tablet will not be able to utilize the 1080p encode and would be constrained to a sub-2000 kbps stream even if the bandwidth allows for a better quality 720p encode.

To remedy this, we design additional per-title bitrate ladders corresponding to the maximum playable resolution on the device. More specifically, we design additional optimal per-title bitrate ladders tailored to 480p and 720p-capped devices. While these extra encodes reduce the overall storage efficiency for the title, adding them ensures that our customers have the best experience.

What does this mean for my Netflix shows?

Per-title encoding allows us to deliver higher quality video two ways: Under low-bandwidth conditions, per-title encoding will often give you better video quality as titles with “simple” content, such as BoJack Horseman, will now be streamed at a higher resolution for the same bitrate. When the available bandwidth is adequate for high bitrate encodes, per-title encoding will often give you even better video quality for complex titles, such as Marvel’s Daredevil, because we will encode at a higher maximum bitrate than our current recipe. Our continuous innovation on this front recognizes the importance of providing an optimal viewing experience for our members while simultaneously using less bandwidth and being better stewards of the Internet.

by Anne Aaron, Zhi Li, Megha Manohara, Jan De Cock and David Ronca

Originally published at techblog.netflix.com on December 14, 2015.

Netflix TechBlog