The Netflix experience is driven by a number of Machine Learning algorithms: personalized ranking, page generation, search, similarity, ratings, etc. On the 6th of January, we simultaneously launched Netflix in 130 new countries around the world, which brings the total to over 190 countries. Preparing for such a rapid expansion while ensuring each algorithm was ready to work seamlessly created new challenges for our recommendation and search teams. In this post, we highlight the four most interesting challenges we’ve encountered in making our algorithms operate globally and, most importantly, how this improved our ability to connect members worldwide with stories they'll love.
Challenge 1: Uneven Video Availability
Before we can add a video to our streaming catalog on Netflix, we need to obtain a license for it from the content owner. Most content licenses are region-specific or country-specific and are often held to terms for years at a time. Ultimately, our goal is to let members around the world enjoy all our content through global licensing, but currently our catalog varies between countries. For example, the dystopian Sci-Fi movie “Equilibrium” might be available on Netflix in the US but not in France. And “The Matrix” might be available in France but not in the US. Our recommendation models rely heavily on learning patterns from play data, particularly involving co-occurrence or sequences of plays between videos. In particular, many algorithms assume that when something was not played it is a (weak) signal that someone may not like a video, because they chose not to play it. However, in this particular scenario we will never observe any members who played both “Equilibrium” and “The Matrix”. A basic recommendation model would then learn that these two movies do not appeal to the same kinds of people just because the audiences were constrained to be different. However, if these two movies were available to the same set of members, we would likely observe a similarity between the videos and between the members who watch them. From this example, it is clear that uneven video availability potentially interferes with the quality of our recommendations.
Our search experience faces a similar challenge. Given a (partial) query from a member, we want to present the most relevant videos in the catalog. However, not accounting for availability differences reduces the quality of this ranking. For example, the top results for a given query from a ranking algorithm unaware of availability differences could include a niche video followed by a well-known one in a case where the latter is only available to a relatively small number of our global members and the former is available much more broadly.
Another aspect of content licenses is that they have start and end dates, which means that a similar problem arises not only across countries, but also within a given country across time. If we compare a well-known video that has only been available on Netflix for a single day to another niche video that was available for six months, we might conclude that the latter is a lot more engaging. However, if the recently added, well-known video had instead been on the site for six months, it probably would have more total engagement.
One can imagine the impact these issues can have on more sophisticated search or recommendation models when they already introduce a bias in something as simple as popularity. Addressing the issue of uneven availability across both geography and time lets our algorithms provide better recommendations for a video already on our service when it becomes available in a new country.
So how can we avoid learning catalog differences and focus on our real goal of learning great recommendations for our members? We incorporate into each algorithm the information that members have access to different catalogs based on geography and time, for example by building upon concepts from the statistical community on handling missing data.
Challenge 2: Cultural Awareness
Another key challenge in making our algorithms work well around the world is to ensure that we can capture local variations in taste. We know that even with the same catalog worldwide we would not expect a video to have the exact same popularity across countries. For example, we expect that Bollywood movies would have a different popularity in India than in Argentina. However, should two members get similar recommendations, if they have similar profiles but if one member lives in India and the other in Argentina? Perhaps if they are both watching a lot of Sci-Fi, their recommendations should be similar. Meanwhile, overall we would expect Argentine members should be recommended more Argentine Cinema and Indian members more Bollywood.
An obvious approach to capture local preferences would be to build models for individual countries. However, some countries are small and we will have very little member data available there. Training a recommendation algorithm on such sparse data leads to noisy results, as the model will struggle to identify clear personalization patterns from the data. So we need a better way.
Prior to our global expansion, our approach was to group countries into regions of a reasonable size that had a relatively consistent catalog and language. We would then build individual models for each region. This could capture the taste differences between regions because we trained separate models whose hyperparameters were tuned differently. Within a region, as long as there were enough members with certain taste preference and a reasonable amount of history, a recommendation model should be able to identify and use that pattern of taste. However, there were several problems with this approach. The first is that within a region the amount of data from a large country would dominate the model and dampen its ability to learn the local tastes for a country with a smaller number of members. It also presented a challenge of how to maintain the groupings as catalogs changed over time and memberships grew. Finally, because we’re continuously running A/B tests with model variants across many algorithms, the combinatorics involving a growing number of regions became overwhelming.
To address these challenges we sought to combine the regional models into a single global model that also improves the recommendations we make, especially in countries where we may not yet have many members. Of course, even though we are combining the data, we still need to reflect local differences in taste. This leads to the question: is local taste or personal taste more dominant? Based on the data we’ve seen so far, both aspects are important, but it is clear that taste patterns do travel globally. Intuitively, this makes sense: if a member likes Sci-Fi movies, someone on the other side of the world who also likes Sci-Fi would be a better source for recommendations than their next-door neighbor who likes food documentaries. Being able to discover worldwide communities of interest means that we can further improve our recommendations, especially for niche interests, as they will be based on more data. Then with a global algorithm we can identify new or different taste patterns that emerge over time.
To refine our models we can use many signals about the content and about our members. In this global context, two important taste signals could be language and location. We want to make our models aware of not just where someone is logged in from but also aspects of a video such as where it is from, what language it is in, and where it is popular. Going back to our example, this information would let us offer different recommendations to a brand new member in India as compared to Argentina, as the distribution of tastes within the two countries is different. We expand on the importance of language in the next section.
Challenge 3: Language
Netflix has now grown to support 21 languages and our catalog includes more local content than ever. This increase creates a number of challenges, especially for the instant search algorithm mentioned above. The key objective of this algorithm is to help every member find something to play whenever they search while minimizing the number of interactions. This is different than standard ranking metrics used to evaluate information retrieval systems, which do not take the amount of interaction into account. When looking at interactions, it is clear that different languages involve very different interaction patterns. For example, Korean is usually typed using the Hangul alphabet where syllables are composed from individual characters. For example, to search for “올드보이” (Oldboy), in the worst possible case, a member would have to enter nine characters: “ㅇ ㅗ ㄹㄷ ㅡ ㅂ ㅗ ㅇㅣ”. Using a basic indexing for the video title, in the best case a member would still need to type three characters: “ㅇ ㅗ ㄹ”, which would be collapsed in the first syllable of that title: “올”. In a Hangul-specific indexing, a member would need to write as little as one character: “ㅇ”. Optimizing for the best results with the minimum set of interactions and automatically adapting to newly introduced languages with significantly different writing systems is an area we’re working on improving.
Another language-related challenge relates to recommendations. As mentioned above, while taste patterns travel globally, ultimately people are most likely to enjoy content presented in a language they understand. For example, we may have a great French Sci-Fi movie on the service, but if there are no subtitles or audio available in English we wouldn’t want to recommend it to a member who likes Sci-Fi movies but only speaks English. Alternatively, if the member speaks both English and French, then there is a good chance it would be an appropriate recommendation. People also often have preferences for watching content that was originally produced in their native language, or one they are fluent in. While we constantly try to add new language subtitles and dubs to our content, we do not yet have all languages available for all content. Furthermore, different people and cultures also have different preferences for watching with subtitles or dubs. Putting this together, it seems clear that recommendations could be better with an awareness of language preferences. However, currently which languages a member understands and to what degree is not defined explicitly, so we need to infer it from ancillary data and viewing patterns.
Challenge 4: Tracking Quality
The objective is to build recommendation algorithms that work equally well for all of our members; no matter where they live or what language they speak. But with so many members in so many countries speaking so many languages, a challenge we now face is how to even figure out when an algorithm is sub-optimal for some subset of our members.
To handle this, we could use some of the approaches for the challenges above. For example, we could look at the performance of our algorithms by manually slicing along a set of dimensions (country, language, catalog, …). However, some of these slices lead to very sparse and noisy data. At the other end of the scale we could be looking at metrics observed globally, but this would dramatically limit our ability to detect issues until they impact a large number of our members. One approach this problem is to learn how to best group observations for the purpose of automatically detecting outliers and anomalies. Just as we work on improving our recommendation algorithms, we are innovating our metrics, instrumentation and monitoring to improve their fidelity and through them our ability to detect new problems and highlight areas to improve our service.
To support a launch of this magnitude, we examined each and every algorithm that is part of our service and began to address these challenges. Along the way, we found not just approaches that will make Netflix better for those signing up in the 130 new countries, but in fact better for all Netflix members worldwide. For example, solving the first and the second challenges let us discover worldwide communities of interest so that we can make better recommendations. Solving the third challenge means that regardless of where our members are based, they can use Netflix in the language that suits them the best, and quickly find the content they’re looking for. Solving the fourth challenge means that we’re able to detect issues at a finer grain and so that our recommendation and search algorithms help all our members find content they love. Of course, our global journey is just beginning and we look forward to making our service dramatically better over time. If you are an algorithmic explorer who finds this type of adventure exciting, take a look at our current job openings.