Tuesday, May 3, 2016

Selecting the best artwork for videos through A/B testing

At Netflix, we are constantly looking at ways to help our 81.5M members discover great stories that they will love.  A big part of that is creating a user experience that is intuitive, fun, and meaningfully helps members find and enjoy stories on Netflix as fast as possible.  


This blog post and the corresponding non-technical blog by my Creative Services colleague Nick Nelson take a deeper look at the key findings from our work in image selection -- focusing on how we learned, how we improved the service and how we are constantly developing new technologies to make Netflix better for our members.

Gone in 90 seconds

Broadly, we know that if you don’t capture a member’s attention within 90 seconds, that member will likely lose interest and move onto another activity. Such failed sessions could at times be because we did not show the right content or because we did show the right content but did not provide sufficient evidence as to why our member should watch it. How can we make it easy for our members to evaluate if a piece of content is of interest to them quickly?  


As the old saying goes, a picture is worth a thousand words.  Neuroscientists have discovered that the human brain can process an image in as little as 13 milliseconds, and that across the board, it takes much longer to process text compared to visual information.  Will we be able to improve the experience by improving the images we display in the Netflix experience?


This blog post sheds light on the groundbreaking series of A/B tests Netflix did which resulted in increased member engagement.  Our goals were the following:
  1. Identify artwork that enabled members to find a story they wanted to watch faster.
  2. Ensure that our members increase engagement with each title and also watch more in aggregate.
  3. Ensure that we don’t misrepresent titles as we evaluate multiple images.  


The series of tests we ran is not unlike any other area of the product -- where we relentlessly test our way to a better member experience with an increasingly complex set of hypotheses using the insights we have gained along the way.

Background and motivation

When a typical member comes to the above homepage the member glances at several details for each title including the display artwork (e.g. highlighted “Narcos” artwork in the “Popular on Netflix” row), title (“Narcos”), movie ratings (TV-MA), synopsis, star rating, etc. Through various studies, we found that our members look at the artwork first and then decide whether to look at additional details.  Knowing that, we asked ourselves if we could improve the click-through rate for that first glance?  To answer this question, we sought the support of our Creative Services team who work on creating compelling pieces of artwork that convey the emotion of the entire title in a single image, while staying true to the spirit.  The Creative Services team worked with our studio partners and at times with our internal design team to create multiple artwork variants.


     
Examples of artwork that were used in other contexts that don’t naturally lend themselves to be used on the Netflix service.


Historically, this was a largely unexplored area at Netflix and in the industry in general.  Netflix would get title images from our studio partners that were originally created for a variety of purposes. Some were intended for roadside billboards where they don’t live alongside other titles.  Other images were sourced from DVD cover art which don’t work well in a grid layout in multiple form factors (TV, mobile, etc.).  Knowing that, we set out to develop a data driven framework through which we can find the best artwork for each video, both in the context of the Netflix experience and with the goal of increasing overall engagement -- not just move engagement from one title to another.

Testing our way into a better product

Broadly, Netflix’s A/B testing philosophy is about building incrementally, using data to drive decisions, and failing fast.  When we have a complex area of testing such as image selection, we seek to prove out the hypothesis in incremental steps with increasing rigor and sophistication.


Experiment 1 (single title test with multiple test cells)



One of the earliest tests we ran was on the single title “The Short Game” - an inspiring story about several grade school students competing with each other in the game of golf.   If you see the default artwork for this title you might not realize easily that it is about kids and skip right past it.  Could we create a few artwork variants that increase the audience for a title?


Cells
Cell 1 (Control)
Cell 2
Cell 3
Box Art
Default artwork
14% better take rate
6% better take rate


To answer this question, we built a very simple A/B test where members in each test cell get a different image for that title.  We measured the engagement with the title for each variant - click through rate, aggregate play duration, fraction of plays with short duration, fraction of content viewed (how far did you get through a movie or series), etc.  Sure enough, we saw that we could widen the audience and increase engagement by using different artwork.


A skeptic might say that we may have simply moved hours to this title from other titles on the service.  However, it was an early signal that members are sensitive to artwork changes.  It was also a signal that there were better ways we could help our members find the types of stories they were looking for within the Netflix experience. Knowing this, we embarked on an incrementally larger test to see if we could build a similar positive effect on a larger set of titles.  

Experiment 2 (multi-cell explore-exploit test)

The next experiment ran with a significantly larger set of titles across the popularity spectrum - both blockbusters and niche titles.  The hypothesis for this test was that we can improve aggregate streaming hours for a large member allocation by selecting the best artwork across each of these titles.  


This test was constructed as a two part explore-exploit test.  The “explore” test measured engagement of each candidate artwork for a set of titles.  The “exploit” test served the most engaging artwork (from explore test) for future users and see if we can improve aggregate streaming hours.


Explore test cells
Control cell
Explore Cell 1
Explore Cell 2
Explore cell 3

Serve default artwork for all titles
Serve artwork variant 1 for all titles
Serve artwork variant 2 for all titles
Serve artwork variant 3 for all titles
Measure best artwork variant for each title over 35 days and feed the exploit test.


Exploit test cells
Control cell
Exploit Cell 1
Exploit Cell 2
Exploit cell 3

Serve default artwork for the title
Serve winning artwork for the title based on metric 1
Serve winning artwork for the title based on metric 2
Serve winning artwork for the title based on metric 3
Compare by cell the “Total streaming hours” and “Hour share of the titles we tested”


Using the explore member population, we measured the take rate (click-through rate) of all artwork variants for each title.  We computed take rate by dividing number of plays (barring very short plays) by the number of impressions on the device.  We had several choices for the take rate metric across different grains:
  • Should we include members who watch a few minutes of a title, or just those who watched an entire episode, or those who watched the entire show?
  • Should we aggregate take rate at the country level, region level, or across global population?


Using offline modeling, we narrowed our choices to 3 different take rate metrics using a combination of the above factors.  Here is a pictorial summary of how the two tests were connected.




The results from this test were unambiguous - we significantly raised view share of the titles testing multiple variants of the artwork and we were also able to raise aggregate streaming hours.  It proved that we weren't simply shifting hours.  Showing members more relevant artwork drove them to watch more of something they have not discovered earlier.  We also verified that we did not negatively affect secondary metrics like short duration plays, fraction of content viewed, etc.  We did additional longitudinal A/B tests over many months to ensure that simply changing artwork periodically is not as good as finding a better performing artwork and demonstrated the gains don’t just come from changing the artwork.


There were engineering challenges as we pursued this test.  We had to invest in two major areas - collecting impression data consistently across devices at scale and across time.


1. Client side impression tracking:  One of the key components to measuring take rate is knowing how often a title image came into the viewport on the device (impressions).  This meant that every major device platform needed to track every image that came into the viewport when a member stopped to consider it even for a fraction of a second.  Every one of these micro-events is compacted and sent periodically as a part of the member session data.  Every device should consistently measure impressions even though scrolling on an iPad is very different than the navigation on a TV.  We collect billions of such impressions daily with low loss rate across every stage in the network - a low storage device might evict events before successfully sending them, lossiness on the network, etc.


2. Stable identifiers for each artwork:  An area that was surprisingly challenging was creating stable unique ids for each artwork.  Our Creative Services team steadily makes changes to the artwork - changing title treatment, touching up to improve quality, sourcing higher resolution artwork, etc.  
house-of-cards-blog-details.png
The above diagram shows the anatomy of the artwork - it contains the background image, a localized title treatment in most languages we support, an optional ‘new episode’ badge, and a Netflix logo for any of our original content.


These two images have different aspect ratios and localized title treatments but have the same lineage ID.


So, we created a system that automatically grouped artwork that had different aspect ratios, crops, touch ups, localized title treatments but had the same background image.  Images that share the same background image were associated with the same “lineage ID”.


Even as Creative Services changed the title treatment and the crop, we logged the data using the lineage ID of the artwork.  Our algorithms can combine data from our global member base even as their preferred locale varied.  This improved our data particularly in smaller countries and less common languages.

Experiment 3 (single cell title level explore test)

While the earlier experiment was successful, there are faster and more equitable ways to learn the performance of an artwork.  We wish to impose on the fewest number of randomly selected members for the least amount of time before we can confidently determine the best artwork for every title on the service.  


Experiment 2 pre-allocated each title into several equal sized cells -- one per artwork variant. We potentially wasted impressions because every image, including known under-performing ones, continue to get impressions for many days.  Also, based on the allocation size, say 2 million members, we would accurately detect performance of images for popular titles but not for niche titles due to sample size.  If we allocated a lot more members, say 20 million members, then we would accurately learn performance of artwork for niche titles but we would be over exposing poor performing artwork of the popular titles.


Experiment 2 did not handle dynamic changes to the number of images that needed evaluation.  i.e. we could not evaluate 10 images for a popular title while evaluating just 2 for another.


We tried to address all of these limitations in the design for a new “title level explore test”.  In this new experiment, all members of the explore population are in a single cell.  We dynamically assign an artwork variant for every (member, title) pair just before the title gets shown to the member.  In essence, we are performing the A/B test for every title with a cell for each artwork.  Since the allocation happens at the title level, we are now able to accommodate different number of artwork variants per title.


This new test design allowed us to get results even faster than experiment 2 since the first N members, say 1 million, who see a title could be used to evaluate performance of its image variants.  We continue to stay in explore phase as long as it takes for us to determine a significant winner -- typically a few days.  After that, we exploit the win and all members enjoy the benefit by seeing the winning artwork.


Here are some screenshots from the tool that we use to track relative artwork performance.
Dragons: Race to the Edge: the two marked images below significantly outperformed all others.

Conclusion

Over the course of this series of tests, we have found many interesting trends among the winning images as detailed in this blog post.  Images that have expressive facial emotion that conveys the tone of the title do particularly well.  Our framework needs to account for the fact that winning images might be quite different in various parts of the world.  Artwork featuring recognizable or polarizing characters from the title tend to do well.  Selecting the best artwork has improved the Netflix product experience in material ways.  We were able to help our members find and enjoy titles faster.


We are far from done when it comes to improving artwork selection.  We have several dimensions along which we continue to experiment.  Can we move beyond artwork and optimize across all asset types (artwork, motion billboards, trailers, montages, etc.)  and choose between the best asset types for a title on a single canvas?


This project brought together the many strengths of Netflix including a deep partnership between best-in-class engineering teams, our Creative Services design team, and our studio partners.  If you are interested in joining us on such exciting pursuits, then please look at our open job descriptions around product innovation and machine learning.


(on behalf of the teams that collaborated)

Friday, April 29, 2016

It’s All A/Bout Testing: The Netflix Experimentation Platform


Ever wonder how Netflix serves a great streaming experience with high-quality video and minimal playback interruptions? Thank the team of engineers and data scientists who constantly A/B test their innovations to our adaptive streaming and content delivery network algorithms. What about more obvious changes, such as the complete redesign of our UI layout or our new personalized homepage? Yes, all thoroughly A/B tested.

In fact, every product change Netflix considers goes through a rigorous A/B testing process before becoming the default user experience. Major redesigns like the ones above greatly improve our service by allowing members to find the content they want to watch faster. However, they are too risky to roll out without extensive A/B testing, which enables us to prove that the new experience is preferred over the old.

And if you ever wonder whether we really set out to test everything possible, consider that even the images associated with many titles are A/B tested, sometimes resulting in 20% to 30% more viewing for that title!

Results like these highlight why we are so obsessed with A/B testing. By following an empirical approach, we ensure that product changes are not driven by the most opinionated and vocal Netflix employees, but instead by actual data, allowing our members themselves to guide us toward the experiences they love.

In this post we’re going to discuss the Experimentation Platform: the service which makes it possible for every Netflix engineering team to implement their A/B tests with the support of a specialized engineering team. We’ll start by setting some high level context around A/B testing before covering the architecture of our current platform and how other services interact with it to bring an A/B test to life.

Overview

The general concept behind A/B testing is to create an experiment with a control group and one or more experimental groups (called “cells” within Netflix) which receive alternative treatments. Each member belongs exclusively to one cell within a given experiment, with one of the cells always designated the “default cell”. This cell represents the control group, which receives the same experience as all Netflix members not in the test. As soon as the test is live, we track specific metrics of importance, typically (but not always) streaming hours and retention. Once we have enough participants to draw statistically meaningful conclusions, we can get a read on the efficacy of each test cell and hopefully find a winner.

From the participant’s point of view, each member is usually part of several A/B tests at any given time, provided that none of those tests conflict with one another (i.e. two tests which modify the same area of a Netflix App in different ways). To help test owners track down potentially conflicting tests, we provide them with a test schedule view in ABlaze, the front end to our platform. This tool lets them filter tests across different dimensions to find other tests which may impact an area similar to their own.

Screen Shot 2016-04-16 at 11.30.36 AM.png
There is one more topic to address before we dive further into details, and that is how members get allocated to a given test. We support two primary forms of allocation: batch and real-time.

Batch allocations give analysts the ultimate flexibility, allowing them to populate tests using custom queries as simple or complex as required. These queries resolve to a fixed and known set of members which are then added to the test. The main cons of this approach are that it lacks the ability to allocate brand new customers and cannot allocate based on real-time user behavior. And while the number of members allocated is known, one cannot guarantee that all allocated members will experience the test (e.g. if we’re testing a new feature on an iPhone, we cannot be certain that each allocated member will access Netflix on an iPhone while the test is active).

Real-Time allocations provide analysts with the ability to configure rules which are evaluated as the user interacts with Netflix. Eligible users are allocated to the test in real-time if they meet the criteria specified in the rules and are not currently in a conflicting test. As a result, this approach tackles the weaknesses inherent with the batch approach. The primary downside to real-time allocation, however, is that the calling app incurs additional latencies waiting for allocation results. Fortunately we can often run this call in parallel while the app is waiting on other information. A secondary issue with real-time allocation is that it is difficult to estimate how long it will take for the desired number of members to get allocated to a test, information which analysts need in order to determine how soon they can evaluate the results of a test.

A Typical A/B Test Workflow

With that background, we’re ready to dive deeper. The typical workflow involved in calling the Experimentation Platform (referred to as A/B in the diagrams for shorthand) is best explained using the following workflow for an Image Selection test. Note that there are nuances to the diagram below which I do not address in depth, in particular the architecture of the Netflix API layer which acts as a gateway between external Netflix apps and internal services.

In this example, we’re running a hypothetical A/B test with the purpose of finding the image which results in the greatest number of members watching a specific title. Each cell represents a candidate image. In the diagram we’re also assuming a call flow from a Netflix App running on a PS4, although the same flow is valid for most of our Device Apps.

Screen Shot 2016-04-29 at 7.42.46 AM.png
  1. The Netflix PS4 App calls the Netflix API. As part of this call, it delivers a JSON payload containing session level information related to the user and their device.
  2. The call is processed in a script written by the PS4 App team. This script runs in the Client Adaptor Layer of the Netflix API, where each Client App team adds scripts relevant to their app. Each of these scripts come complete with their own distinct REST endpoints. This allows the Netflix API to own functionality common to most apps, while giving each app control over logic specific to them. The PS4 App Script now calls the A/B Client, a library our team maintains, and which is packaged within the Netflix API. This library allows for communication with our backend servers as well as other internal Netflix services.
  3. The A/B Client calls a set of other services to gather additional context about the member and the device.
  4. The A/B Client then calls the A/B Server for evaluation, passing along all the context available to it.
  5. In the evaluation phase:
    1. The A/B Server retrieves all test/cell combinations to which this member is already allocated.
    2. For tests utilizing the batch allocation approach, the allocations are already known at this stage.
    3. For tests utilizing real-time allocation, the A/B Server evaluates the context to see if the member should be allocated to any additional tests. If so, they are allocated.
    4. Once all evaluations and allocations are complete, the A/B Server passes the complete set of tests and cells to the A/B Client, which in turn passes them to the PS4 App Script. Note that the PS4 App has no idea if the user has been in a given test for weeks or the last few microseconds. It doesn’t need to know or care about this.
  6. Given the test/cell combinations returned to it, the PS4 App Script now acts on any tests applicable to the current client request. In our example, it will use this information to select the appropriate piece of art associated with the title it needs to display, which is returned by the service which owns this title metadata. Note that the Experimentation Platform does not actually control this behavior: doing so is up to the service which actually implements each experience within a given test.
  7. The PS4 App Script (through the Netflix API) tells the PS4 App which image to display, along with all the other operations the PS4 App must conduct in order to correctly render the UI.

Now that we understand the call flow, let’s take a closer look at that box labelled “A/B Server”.

The Experimentation Platform

Screen Shot 2016-04-29 at 6.58.44 AM.png
The allocation and retrieval requests described in the previous section pass through REST API endpoints to our server. Test metadata pertaining to each test, including allocation rules, are stored in a Cassandra data store. It is these allocation rules which are compared to context passed from the A/B Client in order to determine a member’s eligibility to participate in a test (e.g. is this user in Australia, on a PS4, and has never previously used this version of the PS4 app).

Member allocations are also persisted in Cassandra, fronted by a caching layer in the form of an EVCache cluster, which serves to reduce the number of direct calls to Cassandra. When an app makes a request for current allocations, the AB Client first checks EVCache for allocation records pertaining to this member. If this information was previously requested within the last 3 hours (the TTL for our cache), a copy of the allocations will be returned from EVCache. If not, the AB Server makes a direct call to Cassandra, passing the allocations back to the AB Client, while simultaneously populating them in EVCache.

When allocations to an A/B test occur, we need to decide the cell in which to place each member. This step must be handled carefully, since the populations in each cell should be as homogeneous as possible in order to draw statistically meaningful conclusions from the test. Homogeneity is measured with respect to a set of key dimensions, of which country and device type (i.e. smart TV, game console, etc.) are the most prominent. Consequently, our goal is to make sure each cell contains similar proportions of members from each country, using similar proportions of each device type, etc. Purely random sampling can bias test results by, for instance, allocating more Australian game console users in one cell versus another. To mitigate this issue we employ a sampling method called stratified sampling, which aims to maintain homogeneity across the aforementioned key dimensions. There is a fair amount of complexity to our implementation of stratified sampling, which we plan to share in a future blog post.

In the final step of the allocation process, we persist allocation details in Cassandra and invalidate the A/B caches associated with this member. As a result, the next time we receive a request for allocations pertaining to this member, we will experience a cache miss and execute the cache related steps described above.

We also simultaneously publish allocation events to a Kafka data pipeline, which feeds into several data stores. The feed published to Hive tables provides a source of data for ad-hoc analysis, as well as Ignite, Netflix’s internal A/B Testing visualization and analysis tool. It is within Ignite that test owners analyze metrics of interest and evaluate the results of a test. Once again, you should expect an upcoming blog post focused on Ignite in the near future.

The latest updates to our tech stack added Spark Streaming, which ingests and transforms data from Kafka streams before persisting them in ElasticSearch, allowing us to display near real-time updates in ABlaze. Our current use cases involve simple metrics, allowing users to view test allocations in real-time across dimensions of interest. However, these additions have laid the foundation for much more sophisticated real-time analysis in the near future.

Upcoming Work

The architecture we’ve described here has worked well for us thus far. We continue to support an ever-widening set of domains: UI, Recommendations, Playback, Search, Email, Registration, and many more. Through auto-scaling we easily handle our platform’s typical traffic, which ranges from 150K to 450K requests per second. From a responsiveness standpoint, latencies fetching existing allocations range from an average of 8ms when our cache is cold to < 1ms when the cache is warm. Real-time evaluations take a bit longer, with an average latency around 50ms.

However, as our member base continues to expand globally, the speed and variety of A/B testing is growing rapidly. For some background, the general architecture we just described has been around since 2010 (with some obvious exceptions such as Kafka). Since then:

  • Netflix has grown from streaming in 2 countries to 190+
  • We’ve gone from 10+ million members to 80+ million
  • We went from dozens of device types to thousands, many with their own Netflix app

International expansion is part of the reason we’re seeing an increase in device types. In particular, there is an increase in the number of mobile devices used to stream Netflix. In this arena, we rely on batch allocations, as our current real-time allocation approach simply doesn’t work: the bandwidth on mobile devices is not reliable enough for an app to wait on us before deciding which experience to serve… all while the user is impatiently staring at a loading screen.

Additionally, some new areas of innovation conduct A/B testing on much shorter time horizons than before. Tests focused on UI changes, recommendation algorithms, etc. often run for weeks before clear effects on user behavior can be measured. However the adaptive streaming tests mentioned at the beginning of this post are conducted in a matter of hours, with internal users requiring immediate turn around time on results.

As a result, there are several aspects of our architecture which we are planning to revamp significantly. For example, while the real-time allocation mechanism allows for granular control, evaluations need to be faster and must interact more effectively with mobile devices.

We also plan to leverage the data flowing through Spark Streaming to begin forecasting per-test allocation rates given allocation rules. The goal is to address the second major drawback of the real-time allocation approach, which is an inability to foresee how much time is required to get enough members allocated to the test. Giving analysts the ability to predict allocation rates will allow for more accurate planning and coordination of tests.
These are just a couple of our upcoming challenges. If you’re simply curious to learn more about how we tackle them, stay tuned for upcoming blog posts. However, if the idea of solving these challenges and helping us build the next generation of Netflix’s Experimentation platform excites you, we’re always looking for talented engineers to join our team!