Monday, June 27, 2016

Netflix and the IMF Community


Photon OSS - IMF validation for the masses

When you’ve got something this good, it’s hard to keep it to yourself. As we develop our IMF validation tools internally, we are merging them into a public git repository. The project, code-named Photon (to imply a torchbearer), was envisioned to aid a wider adoption of the IMF standard, and to simplify the development of IMF tools. It could be utilized in a few different ways: as a core library for building a complete end-to-end IMF content ingestion workflow, a web-service backend providing quick turnaround in validating IMF assets or even as a reference implementation of the IMF standard.

Photon in the Netflix Content Processing Pipeline

IMF presents new opportunities and significant challenges to the Digital Supply Chain ecosystem. Photon embodies all our knowledge and experience of building an automated, distributed cloud-based content ingestion workflow. At the time of this blog, we have fully integrated Photon into the Netflix IMF workflow and continue to enhance it as our workflow requirements evolve. A simple 3-step diagram of the Netflix content processing workflow (as discussed in our last tech blog Netflix IMF Workflow) along with the usage of Photon is shown in the figure below.
Screen Shot 2016-06-13 at 3.32.21 PM.png
Photon has all the necessary logic for parsing, reading and validating IMF assets including AssetMap, Packing List (PKL), Composition Playlist (CPL) and Audio/Video track files. Some of the salient features of Photon that we have leveraged in building our IMF content ingestion workflow are:
  1. A modular architecture along with a set of thread safe classes for validating IMF assets such as Composition Playlist, Packing List and AssetMap.
  2. A model to enforce IMF constraints on structural metadata of track files and Composition Playlists.
  3. Support for multiple namespaces for Composition Playlist, Packing List and AssetMap in order to remain compliant with the newer schemas published by SMPTE.
  4. A parser/reader to interpret metadata within IMF track files and serialize it as a SMPTE st2067-3 (the Composition Playlist specification) compliant XML document.
  5. Implementation of deep inspection of IMF assets including algorithms for Composition Playlist conformance and associativity (more on this aspect below).
  6. A stateless interface for IMF validation that could be used as a backend in a RESTful web service for validating IMF packages.

IMF Composition in the Real World

As IMF evolves, so will the tools used to produce IMF packages. While the standard and tools mature we expect to receive assets in the interim that haven’t caught up with the latest specifications. In order to minimize such malformed assets from making their way into our workflow we are striving to implement algorithms for performing deep inspections. In the earlier section we introduced two algorithms implemented in Photon for deep inspections, namely Composition Playlist conformance and Composition Playlist associativity. We will attempt to define and elaborate on these algorithms in this section.

Composition Playlist Conformance

We define a Composition Playlist to be conformant if the entire file descriptor metadata structure (including sub-descriptors) present in each track file that is a part of the composition is mapped to a single essence descriptor element in the Composition Playlist’s Essence Descriptor List. The algorithm to determine Composition Playlist conformance comprises the following steps:
  1. Parse and read all the essence descriptors along with their associated sub-descriptors from every track file that is a part of the composition.
  2. Parse and read all the essence descriptor elements along with their associated sub-descriptors in the Essence Descriptor List.
  3. Verify that every Essence Descriptor in the Essence Descriptor List is referenced by at least one track file in the composition. If not the Composition Playlist is not conformant.
  4. Identify the essence descriptor in the Essence Descriptor List corresponding to the next track file. This is done by utilizing syntactical elements defined in the Composition Playlist - namely Track File ID and SourceEncoding element. If not present the Composition Playlist is not conformant.
  5. Compare the identified essence descriptor and its sub-descriptors present in the Essence Descriptor List with the corresponding essence descriptor and its sub-descriptors present in the track file. At least one essence descriptor and sub-descriptors in the Track file should match with the corresponding essence descriptor and sub-descriptors in the Essence Descriptor List. If not, the Composition Playlist is not conformant.

The algorithmic approach we have adopted to perform this check is depicted in the following flow chart:

Composition Associativity

The current definition of IMF allows for version management between one IMF publisher and one IMF consumer. In the real world, multiple parties (content partners, fulfillment partners, etc.) often work together to produce a finished title. This suggests the need for a multi-party version management system (along the lines of software version control systems). While the IMF standard does not preclude this - this aspect is missing in existing IMF implementations and does not have industry mind-share as of yet. We have come up with the concept of Composition associativity as a solution to identify and associate Composition Playlists of the same presentation that were not constructed incrementally. Such scenarios could occur when multiple Composition Playlist assets are received for a certain title where each asset fulfills certain supplemental tracks of the original video presentation.  As an example,  let us say a content partner authors a Composition Playlist for a certain title with an original video track and an English audio track, whereas the fulfillment partner publishes a Composition Playlist with the same original video track and a Spanish audio track.

The current version of the algorithm for verifying composition associativity comprises the following checks:
  1. Verify that the constituent edit units of the original video track across all of the Composition Playlists are temporally aligned and represent the same video material. If not, the Composition Playlists are not associative.
  2. Verify that the constituent edit units of a particular audio language track, if present, in multiple Composition Playlists to be associated are temporally aligned and represent the same audio material. If not, the Composition Playlists are not associative.
  3. Repeat step 2 for the intersection set of all the audio language tracks in each of the Composition Playlist files.
Note that as of this writing, Photon does not yet have support for IMF virtual marker track as well as data tracks such as timed text, hence we do not yet include those track types in our associativity checks. A flow chart of the composition associativity algorithm follows:

How to get Photon, use it and contribute to it

Photon is hosted on the Netflix GitHub page and is licensed under the Apache License Version 2.0 terms making it very easy to use. It can be built using a gradle environment and is accompanied by a completely automated build and continuous integration system called Travis. All releases of Photon are published to Maven Central as a java archive and users can include it in their projects as a dependency using the relevant syntax for their build environment. The code base is structured in the form of packages that are intuitively named to represent the functionality that they embody. We recommend reviewing the classes in the “app” package to start with as they exercise almost all of the core implementation of the library and therefore offer valuable insight into the software structure which can be very useful to anyone that would like to get involved in the project or simply wants to understand the implementation. A complete set of Javadocs and necessary readme files are also maintained at the GitHub location and can be consulted for API reference and general information about the project.

In addition to the initiative of driving adoption of IMF in the industry, the intention for open sourcing Photon has also been to encourage and seek contributions from the Open Source community to help improve what we have built. Photon is still in its early stage of development making it very attractive for new contributions. Some of the areas in which we are seeking feedback as well as contributions but not limited to are as follows:
  1. Software design and architectural improvements.
  2. Well designed APIs and accompanying Java documents.
  3. Code quality and robustness improvements.
  4. More extensive tests and code coverage.

We have simplified the process of contributing to Photon by allowing contributors to submit pull requests for review and/or forking the repository and enhancing it as necessary. Every commit is gated by a test suite, FindBugs and PMD checks before becoming ready to be accepted for merge into the mainline.

It is our belief that significant breakthroughs with Photon can only be achieved by a high level of participation and collaboration within the Open Source community. Hence we require that all Photon submissions adhere to the Apache 2.0 license terms and conditions.

Netflix IMF Roadmap

As we continue to contribute to Photon to bridge the gap between its current feature set and the IMF standard, our  strategic initiatives and plans around IMF include the following:
  1. We are participating in various standardization activities related to IMF. These include our support for standardization of TTML2 in W3C as well as HDR video and immersive audio in the SMPTE IMF group. We are very enthusiastic about ACES for IMF. Netflix has sponsored multiple IMF meetups and interop plugfests.   
  2. We are actively investing in open source software (OSS) activities around IMF and hoping to foster a collaborative and vibrant developer community. Examples of other OSS projects sponsored by Netflix include “imf-validation-tool” and “regxmllib”.
  3. We are engaged in the development of tools that make it easy to use as well as embed IMF in existing workflows. An example of an ongoing project is the ability to transcode from IMF to DPP (Digital Production Partnership - an initiative formed jointly by public service broadcasters in the UK to help producers and broadcasters maximize the potential benefits of digital television production) or IMF to iTunes ecosystems. Another example is the “IMF CPL Editor” project. This will be available on GitHub and will support lightweight editing of the CPL such as metadata and timeline changes.
  4. IMF at its heart is an asset distribution and archival standard. Effective industry-wide automation in the digital supply chain could be achieved by integration with content identification systems such as EIDR and MovieLabs Media Manifest and Avails protocols. Netflix is actively involved in these initiatives.
  5. We are committed to building and maintaining scalable web-services for validating and perhaps even authoring IMF packages. We believe that this would be a significant step towards addressing the needs of the IMF community at large, and further helping drive IMF adoption in the industry.

The Future of Photon

Photon will evolve over time as we continue to build our IMF content ingestion workflow. As indicated in the very first article in this series “IMF: A Prescription for Versionitis” Netflix realizes that IMF will provide a scalable solution to some of the most common challenges in the Digital Supply Chain ecosystem. However, some of the aspects that bring in those efficiencies such as a model for interactive content, integration with content identification systems, scalable web services for IMF validation, etc. are under development. Good authoring tools will drive adoption and we believe they are critical for the success of IMF. By participating in various standardization activities around IMF we have the opportunity to constantly review possible areas of future development that would not only enhance the usability of Photon but also aid in IMF adoption through easy-to-use open source tools.

Conclusions

IMF is an evolving standard - while it addresses many problems in the Digital Supply Chain ecosystem, challenges abound and there is opportunity for further work. The success of IMF will depend upon participation by content partners, fulfillment partners as well as content retailers. At this time, Netflix is 100% committed to the success of IMF.




Tuesday, June 21, 2016

Netflix Billing Migration to AWS


On January 4, 2016, right before Netflix expanded itself into 130 new countries, Netflix Billing infrastructure became 100% AWS cloud-native. Migration of Billing infrastructure from Netflix Data Center(DC) to AWS Cloud was part of a broader initiative. This prior blog post is a great read that summarizes our strategic goals and direction towards AWS migration.  


For a company, its billing solution is its financial lifeline, while at the same time, it is a visible representation of a company’s attitude towards its customers. A great customer experience is one of Netflix’s core values. Considering the sensitive nature of Billing for its direct impact on our monetary relationship with our members as well on financial reporting, this migration needed to be handled as delicately as possible. Our primary goal was to define a secure, resilient and granular path for migration to the Cloud, without impacting the member experience.


This blog entry discusses our approach to migration of a complex Billing ecosystem from Netflix Data Center(DC) into AWS Cloud.


Components of our Billing architecture
Billing infrastructure is responsible for managing the billing state of Netflix members. This includes keeping track of open/paid billing periods, the amount of credit on the member’s account, managing payment status of the member, initiating charge requests and what date the member has paid through. Other than these, billing data feeds into financial systems for revenue and tax reporting for Netflix accounting.  To accomplish above, billing engineering encompasses:


  • Batch jobs to create recurring renewal orders for a global subscriber base, aggregated data feeds into our General Ledger(GL) for daily revenue from all payment methods including gift cards, Tax Service that reads from and posts into Tax engine. Generation of  messaging events and streaming/DVD hold events based on billing state of customers.
  • Billing APIs provide  billing and gift card details to the customer service platform and website. Other than these, Billing APIs are also part of workflows initiated to process user actions like member signup, change plan, cancellation, update address, chargebacks, and refund requests.
  • Integrations with different services  like member account service, payment processing, customer service, customer messaging, DVD website and shipping


Billing systems had integrations in DC as well as in cloud with the cloud-native systems.  At a high level, our pre-migration architecture could be abstracted out as below:-
Considering how much code and data was interacting with Oracle, one of our objectives was to disintegrate our giant Oracle based solution into a services based architecture. Some of our APIs needed to be multi-region and highly available. So we decided  to split our data into multiple data stores. Subscriber data was migrated to Cassandra data store. Our payment processing integration needed ACID transaction. Hence all relevant data was migrated to MYSQL. Following is a representation of our post migration architecture.


Challenges
As we approached the mammoth task of migration, we were keenly aware of the many challenges in front of us


  • Our migration should ideally not take any downtime for user facing flows.
  • Our new architecture in AWS would need to scale to rapidly growing the member base.
  • We had billions of rows of data, constantly changing and composed of all the historical data since Netflix’s inception in 1997. It was growing every single minute in our large shared database on Oracle. To move all this data over to AWS, we needed to first transport  and synchronize the  data in real time, into a double digit Terabyte RDBMS in cloud.
  • Being a SOX system added another layer of complexity, since all the migration and tooling needed to adhere to our SOX processes.
  • Netflix was launching in many new countries and marching towards being global soon.
  • Billing migration needed to happen without adversely impacting other teams that were busy with their own migration and global launch milestones.


Approach
Our approach to migration was guided by simple principles that helped us in defining the way forward. We will cover the most important ones below:


  • Challenge complexity and simplify: It is much easier to simply accept complexity inherent in legacy systems than challenge it, though when you are floating in a lot of data and code, simplification becomes the key. It seemed very intimidating until we spent a few days opening up everything and asking ourselves repeatedly about how else we could simplify.


    • Cleaning up Code: We started chipping away existing code into smaller, efficient modules and first moved some critical dependencies to run from the Cloud. We moved our tax solution to the Cloud first.


Next, we retired serving member billing history from giant tables that were part of  many different code paths. We built a new application to capture billing events, migrated only necessary data into our new Cassandra data store and started serving billing history, globally, from the Cloud.


We spent a good amount of time writing a data migration tool that would transform member  billing attributes spread across many tables in Oracle  into a much simpler Cassandra data structure.


We worked with our DVD engineering counterparts to further simplify our integration and got rid of obsolete code.


    • Purging Data: We took a hard look at every single table to ensure that we were migrating only what we needed and leaving everything else behind. Historical billing data is valuable to legal and customer service teams. Our goal was to migrate only necessary data into the Cloud. So, we worked with impacted teams  to find out what parts of historical data they really needed. We identified alternative data stores that could serve old data for these teams. After that, we started purging data that was obsolete and was not needed for any function.


  • Build tooling to be resilient and compliant: Our goal was to migrate applications incrementally with zero downtime. To achieve this, we built proxies and redirectors to pipe data back into DC. This helped us in keeping our applications in DC , unimpacted by the change, till we were ready to migrate them.


We had to build tooling in order to support our Billing Cloud infrastructure which needed to be SOX compliant.  For SOX compliance we needed to ensure mitigation of unexpected developer actions and auditability of actions.   


Our Cloud deployment tool Spinnaker was enhanced to capture details of deployment and pipe events to Chronos and our Big Data Platform for auditability. We needed to enhance Cassandra client for authentication and auditable actions. We wrote new alerts using Atlas that would help us in monitoring our applications and data in the Cloud.


With the help of our Data analytics team, we built a comparator to reconcile subscriber data in Cassandra datastore against data in Oracle by country and report mismatches. To achieve the above, we heavily used Netflix Big Data Platform to capture deployment events, used sqoop to transport data from our Oracle database and Cassandra clusters to Hive. We wrote Hive queries and MapReduce jobs for needed reports and dashboards.


  • Test with a clean and limited dataset first. How global expansion helped us: As Netflix was launching in new countries, it created a lot of challenges for us, though it also provided an opportunity to test our Cloud infrastructure with new, clean data, not weighted down by legacy. So, we created a new skinny billing infrastructure in Cloud, for all the user facing functionality and a skinny version of our renewal batch process, with integration into DC applications, to complete the billing workflow. Once the data for new countries could be successfully processed in the Cloud, it gave us the confidence to extend the Cloud footprint for existing large legacy countries, especially the US, where we support not only streaming but DVD billing as well.


  • Decouple user facing flows to shield customer experience from downtimes or other migration impacts: As we were getting ready to migrate existing members’ data into Cassandra, we needed downtime to halt processing while we migrated subscription data from Oracle to Cassandra for our APIs and batch renewal in Cloud. All our tooling was built around ability to migrate a country at time and tunnel traffic as needed.


We worked with ecommerce  and membership services to change integration in user workflows to an asynchronous model. We built retry capabilities to rerun failed processing and repeat as needed. We added optimistic customer state management to ensure our members  were not penalized while our processing was halted.  


By doing all the above, we transformed and moved millions of rows from Oracle in DC  to Cassandra in AWS without any obvious user impact.


  • Moving a database needs its own strategic planning: Database movement needs to be planned out while keeping the end goal in sight, or else it can go very wrong. There are many decisions to be made, from storage prediction to absorbing at least a year’s worth of growth in data that translates into number of instances needed, licensing costs for both production and test environments, using RDS services vs. managing larger EC2 instances, ensuring that database architecture can address scalability, availability and reliability of data. Creating disaster recovery plan, planning minimal migration downtime possible and the list goes on. As part of this migration, we decided to migrate from licenced Oracle to open source MYSQL database running on Netflix managed EC2 instances.


While our subscription processing was using data in our Cassandra datastore, our payment processor needed ACID capabilities of an RDBMS to process charge transactions. We still had a multi-terabyte database that would not fit in AWS RDS with TB limitations. With the help of Netflix platform core and database engineering, we  defined a multi-region, scalable architecture for our MYSQL master with DRBD copy and multiple read replicas available in different regions. We also moved all our ETL processing to replicas to avoid resource contention on the Master. Database Cloud Engineering built tooling and alerts for MYSQL instances to ensure monitoring and recovery as needed.


Our other biggest challenge was migrating constantly changing data to MYSQL in AWS, without taking any downtime. After exploring many options, we proceeded with  Oracle GoldenGate, which could replicate our tables across heterogeneous databases, along with ongoing incremental changes. Of course, this was a very large movement of data, that ran in parallel to our production operations and other migration for a couple of months. We conducted iterative testing and issue fixing cycles to run our applications against MYSQL.  Eventually, many weeks before flipping the switch, we started running our test database on MYSQL and would fix and test all issues on MYSQL code branch before doing a final validation on Oracle and releasing in production. Running our test environment against  MYSQL continuously created a great feedback loop for us.


Finally, on January 4, with a flip of a switch, we were able to move our processing and data ETLs against MYSQL.  


Reflection
While our migration to the Cloud was relatively smooth, looking back, there are always a few things we could have done better. We underestimated testing automation needs. We did not have a good way to test end to end flows. Having spent enough effort on these aspects, upfront, would have given us better developer velocity.


Migrating something as critical as billing with scale and legacy that needed to be addressed was plenty of work, though the benefits from the migration and simplification are also numerous. Post migration, we are more efficient and lighter in our software footprint than before. We are able to fully utilize Cloud capabilities in tooling, alerting and monitoring provided by the Netflix platform services. Our applications are able to scale horizontally as needed, which has helped us in keeping up our processing with subscriber growth.


In conclusion, billing migration was a major cross functional engineering effort. Different engineering teams: core platform, security, database engineering, tooling, big data platform, business teams and other engineering teams supported us through this. We plan to cover focused topics on database migration and engineering perspectives as a continuing series of blog posts in the future.


Once in the Cloud, we now see numerous opportunities to further enhance our services by using innovations of AWS and the Netflix platform. Netflix being global is bringing many more interesting challenges to our path. We have started our next big effort to re-architect our billing platform to become even more  efficient and distributed for a global subscriber scale. If you are interested in helping us solve these problems, we are hiring!


-By Stevan Vlaovic, Rahul Pilani, Subir Parulekar & Sangeeta Handa

Monday, June 6, 2016

Toward A Practical Perceptual Video Quality Metric


by Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy and Megha Manohara


At Netflix we care about video quality, and we care about measuring video quality accurately at scale. Our method, Video Multimethod Assessment Fusion (VMAF), seeks to reflect the viewer’s perception of our streaming quality.  We are open-sourcing this tool and invite the research community to collaborate with us on this important project.

Our Quest for High Quality Video

We strive to provide our members with a great viewing experience: smooth video playback, free of annoying picture artifacts. A significant part of this endeavor is delivering video streams with the best perceptual quality possible, given the constraints of the network bandwidth and viewing device. We continuously work towards this goal through multiple efforts.

First, we innovate in the area of video encoding. Streaming video requires compression using standards, such as H.264/AVC, HEVC and VP9, in order to stream at reasonable bitrates. When videos are compressed too much or improperly, these techniques introduce quality impairments, known as compression artifacts. Experts refer to them as “blocking”, “ringing” or “mosquito noise”, but for the typical viewer, the video just doesn’t look right. For this reason, we regularly compare codec vendors on compression efficiency, stability and performance, and integrate the best solutions in the market. We evaluate the different video coding standards to ensure that we remain at the cutting-edge of compression technology. For example, we run comparisons among H.264/AVC, HEVC and VP9, and in the near future we will experiment on the next-generation codecs developed by the Alliance for Open Media (AOM) and the Joint Video Exploration Team (JVET). Even within established standards we continue to experiment on recipe decisions (see Per-Title Encoding Optimization project) and rate allocation algorithms to fully utilize existing toolsets.

We encode the Netflix video streams in a distributed cloud-based media pipeline, which allows us to scale to meet the needs of our business. To minimize the impact of bad source deliveries, software bugs and the unpredictability of cloud instances (transient errors), we automate quality monitoring at various points in our pipeline. Through this monitoring, we seek to detect video quality issues at ingest and at every transform point in our pipeline.

Finally, as we iterate in various areas of the Netflix ecosystem (such as the adaptive streaming or content delivery network algorithms) and run A/B tests, we work to ensure that video quality is maintained or improved by the system refinements. For example, an improvement in the adaptive streaming algorithm that is aimed to reduce playback start delay or re-buffers should not degrade overall video quality in a streaming session.

All of the challenging work described above hinges on one fundamental premise: that we can accurately and efficiently measure the perceptual quality of a video stream at scale. Traditionally, in video codec development and research, two methods have been extensively used to evaluate video quality: 1) Visual subjective testing and 2) Calculation of simple metrics such as PSNR, or more recently, SSIM [1].

Without doubt, manual visual inspection is operationally and economically infeasible
for the throughput of our production, A/B test monitoring and encoding research experiments. Measuring image quality is an old problem, to which a number of simple and practical solutions have been proposed. Mean-squared-error (MSE), Peak-signal-to-noise-ratio (PSNR) and Structural Similarity Index (SSIM) are examples of metrics originally designed for images and later extended to video. These metrics are often used within codecs (“in-loop”) for optimizing coding decisions and for reporting the final quality of encoded video. Although researchers and engineers in the field are well-aware that PSNR does not consistently reflect human perception, it remains the de facto standard for codec comparisons and codec standardization work.

Building A Netflix-Relevant Dataset

To evaluate video quality assessment algorithms, we take a data-driven approach. The first step is to gather a dataset that is relevant to our use case. Although there are publicly available databases for designing and testing video quality metrics, they lack the diversity in content that is relevant to practical streaming services such as Netflix. Many of them are no longer state-of-the-art in terms of the quality of the source and encodes; for example, they contain standard definition (SD) content and cover older compression standards only. Furthermore, since the problem of assessing video quality is far more general than measuring compression artifacts, the existing databases seek to capture a wider range of impairments caused not only by compression, but also by transmission losses, random noise and geometric transformations. For example, real-time transmission of surveillance footage of typically black and white, low-resolution video (640x480) exhibits a markedly different viewing experience than that experienced when watching one’s favorite Netflix show in a living room.

Netflix's streaming service presents a unique set of challenges as well as opportunities for designing a perceptual metric that accurately reflects streaming video quality. For example:

Video source characteristics. Netflix carries a vast collection of movies and TV shows, which exhibit diversity in genre such as kids content, animation, fast-moving action movies, documentaries with raw footage, etc. Furthermore, they also exhibit diverse low-level source characteristics, such as film-grain, sensor noise, computer-generated textures, consistently dark scenes or very bright colors. Many of the quality metrics developed in the past have not been tuned to accommodate this huge variation in source content. For example, many of the existing databases lack animation content and most don’t take into account film grain, a signal characteristic that is very prevalent in professional entertainment content.

Source of artifacts. As Netflix video streams are delivered using the robust Transmission Control Protocol (TCP), packet losses and bit errors are never sources of visual impairments. That leaves two types of artifacts in the encoding process which will ultimately impact the viewer's quality of experience (QoE): compression artifacts (due to lossy compression) and scaling artifacts (for lower bitrates, video is downsampled before compression, and later upsampled on the viewer’s device). By tailoring a quality metric to only cover compression and scaling artifacts, trading generality for precision, its accuracy is expected to outperform a general-purpose one.

To build a dataset more tailored to the Netflix use case, we selected a sample of 34 source clips (also called reference videos), each 6 seconds long, from popular TV shows and movies from the Netflix catalog and combined them with a selection of publicly available clips. The source clips covered a wide range of high-level features (animation, indoor/outdoor, camera motion, face close-up, people, water, obvious salience, number of objects) and low level characteristics (film grain noise, brightness, contrast, texture, motion, color variance, color richness, sharpness). Using the source clips, we encoded H.264/AVC video streams at resolutions ranging from 384x288 to 1920x1080 and bitrates from 375 kbps to 20,000 kbps, resulting in about 300 distorted videos. This sweeps a broad range of video bitrates and resolutions to reflect the widely varying network conditions of Netflix members.

We then ran subjective tests to determine how non-expert observers would score the impairments of an encoded video with respect to the source clip. In standardized subjective testing, the methodology we used is referred to as the Double Stimulus Impairment Scale (DSIS) method. The reference and distorted videos were displayed sequentially on a consumer-grade TV, with controlled ambient lighting (as specified in recommendation ITU-R BT.500-13 [2]). If the distorted video was encoded at a smaller resolution than the reference, it was upscaled to the source resolution before it was displayed on the TV. The observer sat on a couch in a living room-like environment and was asked to rate the impairment on a scale of 1 (very annoying) to 5 (not noticeable).The scores from all observers were combined to generate a Differential Mean Opinion Score or DMOS for each distorted video and normalized in the range 0 to 100, with the score of 100 for the reference video. The set of reference videos, distorted videos and DMOS scores from observers will be referred to in this article as the NFLX Video Dataset.

Traditional Video Quality Metrics

How do the traditional, widely-used video quality metrics correlate to the “ground-truth” DMOS scores for the NFLX Video Dataset?

A Visual Example

VMAF_crowd_fox_crop_1040_592.png

Above, we see portions of still frames captured from 4 different distorted videos; the two videos on top reported a PSNR value of about 31 dB, while the bottom two reported a PSNR value of about 34 dB. Yet, one can barely notice the difference on the “crowd” videos, while the difference is much more clear on the two “fox” videos. Human observers confirm it by rating the two “crowd” videos as having a DMOS score of 82 (top) and 96 (bottom), while rating the two “fox” videos with DMOS scores of 27 and 58, respectively.

Detailed Results

The graphs below are scatter plots showing the observers’ DMOS on the x-axis and the predicted score from different quality metrics on the y-axis. These plots were obtained from a selected subset of the NFLX Video Dataset, which we label as NFLX-TEST (see next section for details). Each point represents one distorted video. We plot the results for four quality metrics:
  • PSNR for luminance component
  • SSIM [1]
  • Multiscale FastSSIM [3]
  • PSNR-HVS [4]
More details on SSIM, Multiscale FastSSIM and PSNR-HVS can be found in the publications listed in the Reference section. For these three metrics we used the implementation in the Daala code base [5] so the titles in subsequent graphs are prefixed with “Daala”.

nflx_test_others.png
Note: The points with the same color correspond to distorted videos stemming from the same reference video. Due to subject variability and reference video normalization to 100, some DMOS scores can exceed 100.

It can be seen from the graphs that these metrics fail to provide scores that consistently predict the DMOS ratings from observers. For example, focusing on the PSNR graph on the upper left corner, for PSNR values around 35 dB, the “ground-truth” DMOS values range anywhere from 10 (impairments are annoying) to 100 (impairments are imperceptible). Similar conclusions can be drawn for the SSIM and multiscale FastSSIM metrics, where a score close to 0.90 can correspond to DMOS values from 10 to 100. Above each plot, we report the Spearman’s rank correlation coefficient (SRCC), the Pearson product-moment correlation coefficient (PCC) and the root-mean-squared-error (RMSE) figures for each of the metrics, calculated after a non-linear logistic fitting, as outlined in Annex 3.1 of ITU-R BT.500-13 [2]. SRCC and PCC values closer to 1.0 and RMSE values closer to zero are desirable. Among the four metrics, PSNR-HVS demonstrates the best SRCC, PCC and RMSE values, but is still lacking in prediction accuracy.

In order to achieve meaningful performance across wide variety of content, a metric should exhibit good relative quality scores, i.e., a delta in the metric should provide information about the delta in perceptual quality. In the graphs below, we select three typical reference videos, a high-noise video, a CG animation and a TV drama, and plot the predicted score vs. DMOS of the different distorted videos for each. To be effective as a relative quality score, a constant slope across different clips within the same range of the quality curve is desirable. For example, referring to the PSNR plot below, in the range 34 dB to 36 dB, a change in PSNR of about 2 dB for TV drama corresponds to a DMOS change of about 50 (50 to 100) but a similar 2 dB change in the same range for the CG animation corresponds to less than 20 (40 to 60) change in DMOS. While SSIM and FastSSIM exhibit more consistent slopes for CG animation and TV drama clips, their performance is still lacking.
blogpost_others2.png
In conclusion, we see that the traditional metrics do not work well for our content. To address this issue we adopted a machine-learning based model to design a metric that seeks to reflect  human perception of video quality.  This metric is discussed in the following section.

Our Method: Video Multimethod Assessment Fusion (VMAF)

Building on our research collaboration with Prof. C.-C. J. Kuo and his group at the University of Southern California [6][7], we developed Video Multimethod Assessment Fusion, or VMAF, that predicts subjective quality by combining multiple elementary quality metrics. The basic rationale is that each elementary metric may have its own strengths and weaknesses with respect to the source content characteristics, type of artifacts, and degree of distortion. By ‘fusing’ elementary metrics into a final metric using a machine-learning algorithm - in our case, a Support Vector Machine (SVM) regressor - which assigns weights to each elementary metric, the final metric could preserve all the strengths of the individual metrics, and deliver a more accurate final score. The machine-learning model is trained and tested using the opinion scores obtained through a subjective experiment (in our case, the NFLX Video Dataset).

The current version of the VMAF algorithm and model (denoted as VMAF 0.3.1), released as part of the VMAF Development Kit open source software, uses the following elementary metrics fused by Support Vector Machine (SVM) regression [8]:
  1. Visual Information Fidelity (VIF) [9]. VIF is a well-adopted image quality metric based on the premise that quality is complementary to the measure of information fidelity loss. In its original form, the VIF score is measured as a loss of fidelity combining four scales. In VMAF, we adopt a modified version of VIF where the loss of fidelity in each scale is included as an elementary metric.
  2. Detail Loss Metric (DLM) [10]. DLM is an image quality metric based on the rationale of separately measuring the loss of details which affects the content visibility, and the redundant impairment which distracts viewer attention. The original metric combines both DLM and additive impairment measure (AIM) to yield a final score. In VMAF, we only adopt the DLM as an elementary metric. Particular care was taken for special cases, such as black frames, where numerical calculations for the original formulation break down.
VIF and DLM are both image quality metrics. We further introduce the following simple feature to account for the temporal characteristics of video:
  1. Motion. This is a simple measure of the temporal difference between adjacent frames. This is accomplished by calculating the average absolute pixel difference for the luminance component.
These elementary metrics and features were chosen from amongst other candidates through iterations of testing and validation.

We compare the accuracy of VMAF to the other quality metrics described above. To avoid unfairly overfitting VMAF to the dataset, we first divide the NFLX Dataset into two subsets, referred to as NFLX-TRAIN and NFLX-TEST. The two sets have non-overlapping reference clips. The SVM regressor is then trained with the NFLX-TRAIN dataset, and tested on NFLX-TEST. The plots below show the performance of the VMAF metric on the NFLX-TEST dataset and on the selected reference clips (high-noise video, a CG animation and TV drama). For ease of comparison, we repeat the plots for PSNR-HVS, the best performing metric from the earlier section. It is clear that VMAF performs appreciably better.vmaf_psnrhvs.png
vmaf_psnrhvs_3content.png
We also compare VMAF to the Video Quality Model with Variable Frame Delay (VQM-VFD) [11], considered by many as state of the art in the field. VQM-VFD is an algorithm that uses a neural network model to fuse low-level features into a final metric. It is similar to VMAF in spirit, except that it extracts features at lower levels such as spatial and temporal gradients.
vqmvfd.png
It is clear that VQM-VFD performs close to VMAF on the NFLX-TEST dataset. Since the VMAF approach allows for incorporation of new elementary metrics into its framework, VQM-VFD could serve as an elementary metric for VMAF as well.
The table below lists the performance, as measured by the SRCC, PCC and RMSE figures, of the VMAF model after fusing different combinations of the individual elementary metrics on the NFLX-TEST dataset, as well as the final performance of VMAF 0.3.1. We also list the performance of VMAF augmented with VQM-VFD. The results justify our premise that an intelligent fusion of high-performance quality metrics results in an increased correlation with human perception.
NFLX-TEST dataset

SRCC
PCC
RMSE
VIF
0.883
0.859
17.409
ADM
0.948
0.954
9.849
VIF+ADM
0.953
0.956
9.941
VMAF 0.3.1 (VIF+ADM
+MOTION)
0.953
0.963
9.277
VQM-VFD
0.949
0.934
11.967
VMAF 0.3.1
+VQM-VFD
0.959
0.965
9.159

Summary of Results

In the tables below we summarize the SRCC, PCC and RMSE of the different metrics discussed earlier, on the NLFX-TEST dataset and three popular public datasets: the VQEG HD (vqeghd3 collection only) [12], the LIVE Video Database [13] and the LIVE Mobile Video Database [14]. The results show that VMAF 0.3.1 outperforms other metrics in all but the LIVE dataset, where it still offers competitive performance compared to the best-performing VQM-VFD. Since VQM-VFD demonstrates good correlation across the four datasets, we are experimenting with VQM-VFD as an elementary metric for VMAF; although it is not part of the open source release VMAF 0.3.1, it may be integrated in subsequent releases.

NFLX-TEST dataset

SRCC
PCC
RMSE
PSNR
0.746
0.725
24.577
SSIM
0.603
0.417
40.686
FastSSIM
0.685
0.605
31.233
PSNR-HVS
0.845
0.839
18.537
VQM-VFD
0.949
0.934
11.967
VMAF 0.3.1
0.953
0.963
9.277

LIVE dataset*

SRCC
PCC
RMSE
PSNR
0.416
0.394
16.934
SSIM
0.658
0.618
12.340
FastSSIM
0.566
0.561
13.691
PSNR-HVS
0.589
0.595
13.213
VQM-VFD
0.763
0.767
9.897
VMAF 0.3.1
0.690
0.655
12.180
*For compression-only impairments (H.264/AVC and MPEG-2 Video)

VQEGHD3 dataset*

SRCC
PCC
RMSE
PSNR
0.772
0.759
0.738
SSIM
0.856
0.834
0.621
FastSSIM
0.910
0.922
0.415
PSNR-HVS
0.858
0.850
0.580
VQM-VFD
0.925
0.924
0.420
VMAF 0.3.1
0.929
0.939
0.372
*For source content SRC01 to SRC09 and streaming-relevant impairments HRC04, HRC07, and HRC16 to HRC21

LIVE Mobile dataset

SRCC
PCC
RMSE
PSNR
0.632
0.643
0.850
SSIM
0.664
0.682
0.831
FastSSIM
0.747
0.745
0.718
PSNR-HVS
0.703
0.726
0.722
VQM-VFD
0.770
0.795
0.639
VMAF 0.3.1
0.872
0.905
0.401

VMAF Development Kit (VDK) Open Source Package

To deliver high-quality video over the Internet, we believe that the industry needs good perceptual video quality metrics that are practical to use and easy to deploy at scale. We have developed VMAF to help us address this need. Today, we are open-sourcing the VMAF Development Kit (VDK 1.0.0) package on Github under Apache License Version 2.0. By open-sourcing the VDK, we hope it can evolve over time to yield improved performance.

The feature extraction (including elementary metric calculation) portion in the VDK core is computationally-intensive and so it is written in C for efficiency. The control code is written in Python for fast prototyping.

The package comes with a simple command-line interface to allow a user to run VMAF in single mode (run_vmaf command) or in batch mode (run_vmaf_in_batch command, which optionally enables parallel execution). Furthermore, as feature extraction is the most expensive operation, the user can also store the feature extraction results in a datastore to reuse them later.

The package also provides a framework for further customization of the VMAF model based on:
  • The video dataset it is trained on
  • The elementary metrics and other features to be used
  • The regressor and its hyper-parameters

The command run_training takes in three configuration files: a dataset file, which contains information on the training dataset, a feature parameter file and a regressor model parameter file (containing the regressor hyper-parameters). Below is sample code that defines a dataset, a set of selected features, the regressor and its hyper-parameters.

##### define a dataset #####
dataset_name = 'example'
yuv_fmt = 'yuv420p'
width = 1920
height = 1080
ref_videos = [
   {'content_id':0, 'path':'checkerboard.yuv'},
   {'content_id':1, 'path':'flat.yuv'},
]
dis_videos = [
   {'content_id':0, 'asset_id': 0, 'dmos':100, 'path':'checkerboard.yuv'}, # ref
   {'content_id':0, 'asset_id': 1, 'dmos':50,  'path':'checkerboard_dis.yuv'},
   {'content_id':1, 'asset_id': 2, 'dmos':100,  'path':'flat.yuv'}, # ref
   {'content_id':1, 'asset_id': 3, 'dmos':80,  'path':'flat_dis.yuv'},
]

##### define features #####
feature_dict = {
   # VMAF_feature/Moment_feature are the aggregate features
   # motion, adm2, dis1st are the atom features
   'VMAF_feature':['motion', 'adm2'],
   'Moment_feature':['dis1st'], # 1st moment on dis video
}

##### define regressor and hyper-parameters #####
model_type = "LIBSVMNUSVR" # libsvm NuSVR regressor
model_param_dict = {
   # ==== preprocess: normalize each feature ==== #
   'norm_type':'clip_0to1', # rescale to within [0, 1]
   # ==== postprocess: clip final quality score ==== #
   'score_clip':[0.0, 100.0], # clip to within [0, 100]
   # ==== libsvmnusvr parameters ==== #
   'gamma':0.85, # selected
   'C':1.0, # default
   'nu':0.5, # default
   'cache_size':200 # default
}

Finally, the FeatureExtractor base class can be extended to develop a customized VMAF algorithm. This can be accomplished by experimenting with other available elementary metrics and features, or inventing new ones. Similarly, the TrainTestModel base class can be extended in order to test other regression models. Please refer to CONTRIBUTING.md for more details. A user could also experiment with alternative machine learning algorithms using existing open-source Python libraries, such as scikit-learn [15], cvxopt [16], or tensorflow [17]. An example integration of scikit-learn’s random forest regressor is included in the package.

The VDK package includes the VMAF 0.3.1 algorithm with selected features and a trained SVM model based on subjective scores collected on the NFLX Video Dataset. We also invite the community to use the software package to develop improved features and regressors for the purpose of perceptual video quality assessment. We encourage users to test VMAF 0.3.1 on other datasets, and help improve it for our use case and potentially extend it to other use cases.

Our Open Questions on Quality Assessment

Viewing conditions. Netflix supports thousands of active devices covering smart TV’s, game consoles, set-top boxes, computers, tablets and smartphones, resulting in widely varying viewing conditions for our members. The viewing set-up and display can significantly affect perception of quality. For example, a Netflix member watching a 720p movie encoded at 1 Mbps on a 4K 60-inch TV may have a very different perception of the quality of that same stream if it were instead viewed on a 5-inch smartphone. The current NFLX Video Dataset covers a single viewing condition -- TV viewing at a standardized distance. To augment VMAF, we are conducting subjective tests in other viewing conditions. With more data, we can generalize the algorithm such that viewing conditions (display size, distance from screen, etc.) can be inputs to the regressor.

Temporal pooling. Our current VMAF implementation calculates quality scores on a per-frame basis. In many use-cases, it is desirable to temporally pool these scores to return a single value as a summary over a longer period of time. For example, a score over a scene, a score over regular time segments, or a score for an entire movie is desirable. Our current approach is a simple temporal pooling that takes the arithmetic mean of the per-frame values. However, this method has the risk of “hiding” poor quality frames. A pooling algorithm that gives more weight to lower scores may be more accurate towards human perception. A good pooling mechanism is especially important when using the summary score to compare encodes of differing quality fluctuations among frames or as the target metric when optimizing an encode or streaming session. A perceptually accurate temporal pooling mechanism for VMAF and other quality metrics remains an open and challenging problem.

A consistent metric. Since VMAF incorporates full-reference elementary metrics, VMAF is highly dependent on the quality of the reference. Unfortunately, the quality of video sources may not be consistent across all titles in the Netflix catalog. Sources come into our system at resolutions ranging from SD to 4K. Even at the same resolution, the best source available may suffer from certain video quality impairments. Because of this, it can be inaccurate to compare (or summarize) VMAF scores across different titles. For example, when a video stream generated from an SD source achieves a VMAF score of 99 (out of 100), it by no means has the same perceptual quality as a video encoded from an HD source with the same score of 99. For quality monitoring, it is highly desirable that we can calculate absolute quality scores that are consistent across sources. After all, when viewers watch a Netflix show, they do not have any reference, other than the picture delivered to their screen. We would like to have an automated way to predict what opinion they form about the quality of the video delivered to them, taking into account all factors that contributed to the final presented video on that screen.

Summary

We have developed VMAF 0.3.1 and the VDK 1.0.0 software package to aid us in our work to deliver the best quality video streams to our members. Our team uses it everyday in evaluating video codecs and encoding parameters and strategies, as part of our continuing pursuit of quality. VMAF, together with other metrics, have been integrated into our encoding pipeline to improve on our automated QC. We are in the early stages of using VMAF as one of the client-side metrics to monitor system-wide A/B tests.

Improving video compression standards and making smart decisions in practical encoding systems is very important in today’s Internet landscape. We believe that using the traditional metrics - metrics that do not always correlate with human perception - can hinder real advancements in video coding technology. However, always relying on manual visual testing is simply infeasible. VMAF is our attempt to address this problem, using samples from our content to help design and validate the algorithms. Similar to how the industry works together in developing new video standards, we invite the community to openly collaborate on improving video quality measures, with the ultimate goal of more efficient bandwidth usage and visually pleasing video for all.

Acknowledgments

We would like to acknowledge the following individuals for their help with the VMAF project: Joe Yuchieh Lin, Eddy Chi-Hao Wu, Professor C.-C Jay Kuo (University of Southern California), Professor Patrick Le Callet (Université de Nantes) and Todd Goodall.

References

[1] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.

[2] BT.500 : Methodology for the Subjective Assessment of the Quality of Television Pictures, https://www.itu.int/rec/R-REC-BT.500

[3] M.-J. Chen and A. C. Bovik, “Fast Structural Similarity Index Algorithm,” Journal of Real-Time Image Processing, vol. 6, no. 4, pp. 281–287, Dec. 2011.

[4] N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. Astola, and V. Lukin, “On Between-coefficient Contrast Masking of DCT Basis Functions,” in Proceedings of the 3 rd International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM ’07), Scottsdale, Arizona, Jan. 2007.


[6] T.-J. Liu, J. Y. Lin, W. Lin, and C.-C. J. Kuo, “Visual Quality Assessment: Recent Developments, Coding Applications and Future Trends,” APSIPA Transactions on Signal and Information Processing, 2013.

[7] J. Y. Lin, T.-J. Liu, E. C.-H. Wu, and C.-C. J. Kuo, “A Fusion-based Video Quality Assessment (FVQA) Index,” APSIPA Transactions on Signal and Information Processing, 2014.

[8] C.Cortes and V.Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.

[9] H. Sheikh and A. Bovik, “Image Information and Visual Quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.

[10] S. Li, F. Zhang, L. Ma, and K. Ngan, “Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments,” IEEE Transactions on Multimedia, vol. 13, no. 5, pp. 935–949, Oct. 2011.

[11] S. Wolf and M. H. Pinson, “Video Quality Model for Variable Frame Delay (VQM_VFD),” U.S. Dept. Commer., Nat. Telecommun. Inf. Admin., Boulder, CO, USA, Tech. Memo TM-11-482, Sep. 2011.

[12] Video Quality Experts Group (VQEG), “Report on the Validation of Video Quality Models for High Definition Video Content,” June 2010, http://www.vqeg.org/

[13] K. Seshadrinathan, R. Soundararajan, A. C. Bovik and L. K. Cormack, "Study of Subjective and Objective Quality Assessment of Video", IEEE Transactions on Image Processing, vol.19, no.6, pp.1427-1441, June 2010.

[14] A. K. Moorthy, L. K. Choi, A. C. Bovik and G. de Veciana, "Video Quality Assessment on Mobile Devices: Subjective, Behavioral, and Objective Studies," IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 6, pp. 652-671, Oct. 2012.

[15] scikit-learn: Machine Learning in Python. http://scikit-learn.org/stable/

[16] CVXOPT: Python Software for Convex Optimization. http://cvxopt.org/

[17] TensorFlow. https://www.tensorflow.org/