Every day, millions of members across the globe, from thousands of devices, visit Netflix and generate millions of viewing hours. The majority of these viewing hours are generated through the videos that are recommended by our recommender systems. We continue to invest in improving our recommender systems that aid our members to discover and watch the specific content they love. We are constantly trying to improve the quality of the recommendations using the sound foundation of AB testing.
On that front, we recently AB tested introducing a new row of videos on the home screen called “Trending Now”, which shows the videos that are trending in Netflix infused with some personalization for our members. This post explains how we built the backend infrastructure that powers the Trending Now row.
Traditionally, we pre-compute many of the recommendations for our members based on a combination of explicit signals (viewing history, ratings, My List, etc.) and other implicit signals (scroll activity, navigation, etc.) within Netflix, in near-line fashion. However, the Trending Now row is computed as events happen in real time. This allows us to not only personalize this row based on the context like time of day and day of week, but also react to sudden changes in collective interests of members, due to a real-world events such as Oscars or Halloween.
There are primarily two data streams that are used to determine the trending videos:
- Play events: Videos that are played by our member
- Impression events: Videos seen by our members in their view port
Netflix embraces Service Oriented Architecture (SOA) composed of many small fine grained services that do one thing and one thing well. In that vein, Viewing History Service captures all the videos that are played by our members. Beacon is another service that captures all impression events and user activities within Netflix. The requirement of computing recommendations in real time, presents us with an exciting challenge to make our data collection/processing pipeline a low latency, highly scalable and resilient system. We chose Kafka, a distributed messaging system, for our data pipeline as it has proven to handle millions of events per second. All the data collected by the Viewing History and Beacon services are sent to Kafka.
We built a custom stream processor that consumes the play and impressions events from Kafka and computes the following aggregated data:
- Play popularity: How many times is a video played
- Take rate: Fraction of play events over impression events for a given video
The first step in the data processing layer is to join the play and impression streams. We join them by request id, which is a unique identifier used to tie the front end calls to the backend service calls. With this join, all the plays and impressions events are grouped together for a given request id as illustrated in the figure below.
This joined stream is then partitioned by video id, for all the plays and impression events of a given video to be processed at the same consumer instance. This way, each consumer will be able to atomically calculate the total number of plays and impressions data for every video. The aggregated play popularity and take rate data are persisted into Cassandra, as shown in the figure below.
Real Time Data Monitoring
Given the importance of the data quality to the recommendation system and the user experience, we continuously do canary analysis for the event streams. This involves simple validations such as the presence of mandatory attributes within an event to more complex validations such as finding the absence of an event within a time window. With appropriate alerting in place, within minutes of every UI push, we are able to catch any data regressions with this real time stream monitoring.
It is imperative that the Kafka consumers are able to keep up with the incoming load into Kafka. Processing an event that was minutes old will neither provide a real trending effect nor help find data regression issues soon.
Bringing it all together
On a live user request, the aggregated play popularity and take rate data along with other explicit signals such as members’ viewing history and past ratings are used to compute a personalized Trending now row. The following figure shows the end to end infrastructure for building Trending Now row.
Netflix has a data-driven culture that is key to our success. With billions of member viewing events and tens of millions of categorical preferences, we have endless opportunities to improve our recommendations even further.
We are in the midst of replacing our custom stream processor with Spark Streaming. Stay tuned for an upcoming tech blog on our resiliency testing on Spark Streaming.
If you would like to join us in tackling these kinds of challenges, we are hiring!