Tuesday, August 25, 2015

From Chaos to Control - Testing the resiliency of Netflix’s Content Discovery Platform

By:Leena Janardanan, Bruce Wobbe, Vilas Veeraraghavan

Merchandising Application Platform (MAP) was conceived as a middle-tier service that would handle real time requests for content discovery. MAP does this by aggregating data from disparate data sources and implementing common business logic into one distinct layer. This centralized layer helps provide common experiences across device platforms and helps reduce duplicate, and sometimes, inconsistent business logic. In addition, it also allows recommendation systems - which are typically pre-compute systems - to be de-coupled from the real time path. MAP can be compared to a big funnel through which most of the content discovery data on a user’s screen goes through and is processed.
As an example, MAP generates localized row names for the personalized recommendations on the home page. This happens in real time, based on the locale of the user at the time the request is made. Similarly, application of maturity filters, localizing and sorting categories are examples of logic that lives in MAP.

Localized categories and row names, up-to-date My List and Continue Watching

A classic example of duplicated but inconsistent business logic that MAP consolidated was the “next episode” logic -  the rule to determine if a particular episode was completed and the next episode should be shown. In one platform, it required that credits had started and/or 95% of the episode to be finished. In another platform, it was simply that 90% of the episode had to be finished. MAP consolidated this logic into one simple call that all devices now use.
 MAP also enables discovery data to be a mix of pre-computed and real time data. On the homepage, rows like My List, Continue Watching and Trending Now are examples of real time data whereas rows like “Because you watched” are pre-computed. As an example, if a user added a title to My List on a mobile device and decided to watch the title on a Smart TV, the user would expect My List on the TV to be up-to-date immediately. What this requires is the ability to selectively update some data in real time. MAP provides the APIs and logic to detect if data has changed and update it as needed. This allows us to keep the efficiencies gained from pre-compute systems for most of the data, while also having the flexibility to keep other data fresh.
MAP also supports business logic required for various A/B tests, many of which are active on Netflix at any given time. Examples include: inserting non-personalized  rows, changing the sort order for titles within a row and changing the contents of a row.
The services that generate this data are a mix of pre-compute and real time systems. Depending on the data, the calling patterns from devices for each type of data also vary. Some data is fetched once per session, some of it is pre-fetched when the user navigates the page/screen and other data is refreshed constantly (My List, Recently Watched, Trending Now).


MAP is comprised of two parts - a server and a client. The server is the workhorse which does all the data aggregation and applies business logic. This data is then stored in caches (see EVCache) the client reads. The client primarily serves the data and is the home for resiliency logic. The client decides when a call to the server is taking too long, when to open a circuit (see Hystrix) and, if needed, what type of fallback should be served.

MAP is in the critical path of content discovery. Without a well thought out resiliency story, any failures in MAP would severely impact the user experience and Netflix's availability. As a result, we spend a lot of time thinking about how to make MAP resilient.

Challenges in making MAP resilient
Two approaches commonly used by MAP to improve resiliency are:
(1) Implementing fallback responses for failure scenarios
(2) Load shedding - either by opening circuits to the downstream services or by limiting retries wherever possible.

There are a number of factors that make it challenging to make MAP resilient: 
(1) MAP has numerous dependencies, which translates to multiple points of failure. In addition, the behavior of these dependencies evolves over time, especially as A/B tests are launched, and a solution that works today may not do so in 6 months. At some level, this is a game of Whack-A-Mole as we try to keep up with a constantly changing eco system.

(2) There is no one type of fallback that works for all scenarios:
    • In some cases, an empty response is the only option and devices have to be able to handle that gracefully. E.g. data for the "My List" row couldn't be retrieved.
    • Various degraded modes of performance can be supported. E.g. if the latest personalized home page cannot be delivered, fallbacks can range from stale, personalized recommendations to non-personalized recommendations.
    • In other cases, an exception/error code might be the right response, indicating to clients there is a problem and giving them the ability to adapt the user experience - skip a step in a workflow, request different data, etc.

How do we go from Chaos to Control?

Early on, failures in MAP or its dependent services caused SPS dips like this:

It was clear that we needed to make MAP more resilient. The first question to answer was - what does resiliency mean for MAP? It came down to these expectations:
(1) Ensure an acceptable user experience during a MAP failure, e.g. that the user can browse our selection and continue to play videos
(2) Services that depend on MAP i.e. the API service and device platforms are not impacted by a MAP failure and continue to provide uninterrupted services
(3) Services that MAP depends on are not overwhelmed by excessive load from MAP

It is easy enough to identify obvious points of failure. For example - if a service provides data X, we could ensure that MAP has a fallback for data X being unavailable. What is harder is knowing the impact of failures in multiple services - different combinations of them - and the impact of higher latencies.

This is where the Latency Monkey and FIT come in. Running Latency Monkey in our production environment allows us to detect problems caused by latent services. With Latency Monkey testing, we have been able to fix incorrect behaviors and fine tune various parameters on the backend services like:
(1) Timeouts for various calls
(2) Thresholds for opening circuits via Hystrix
(3) Fallbacks for certain use cases
(4) Thread pool settings

FIT, on the other hand, allows us to simulate specific failures. We restrict the scope of failures to a few test accounts. This allows us to validate fallbacks as well as the user experience. Using FIT, we are able to sever connections with:
(1) Cache that handles MAP reads and writes 
(2) Dependencies that MAP interfaces with
(3) MAP service itself

What does control look like?

In a successful run of FIT or Chaos Monkey, this is how metrics look like now:
Total requests served by MAP before and during the test(no impact)

MAP successful fallbacks during the test(high fallback rate)

On a lighter note, our failure simulations uncovered some interesting user experience issues, which have since been fixed.

  1. Simulating failures in all the dependent services of MAP server caused an odd data mismatch to happen:
The Avengers shows graphic for Peaky Blinders

  1. Severing connections to MAP server and the cache caused these duplicate titles to be served:

  1. When the cache was made unavailable mid session, some rows looked like this:

  1. Simulating a failure in the “My List” service caused the PS4 UI to be stuck on adding a title to My List:

In an ever evolving ecosystem of many dependent services, the future of resiliency testing resides in automation. We have taken small but significant steps this year towards making some of these FIT tests automated. The goal is to build these tests out so they run during every release and catch any regressions.

Looking ahead for MAP, there are many more problems to solve. How can we make MAP more performant? Will our caching strategy scale to the next X million customers? How do we enable faster innovation without impacting reliability? Stay tuned for updates!