Two approaches commonly used by MAP to improve resiliency are:
(1) Implementing fallback responses for failure scenarios
(2) Load shedding - either by opening circuits to the downstream services or by limiting retries wherever possible.
There are a number of factors that make it challenging to make MAP resilient:
(2) There is no one type of fallback that works for all scenarios:
- In some cases, an empty response is the only option and devices have to be able to handle that gracefully. E.g. data for the "My List" row couldn't be retrieved.
- Various degraded modes of performance can be supported. E.g. if the latest personalized home page cannot be delivered, fallbacks can range from stale, personalized recommendations to non-personalized recommendations.
- In other cases, an exception/error code might be the right response, indicating to clients there is a problem and giving them the ability to adapt the user experience - skip a step in a workflow, request different data, etc.
How do we go from Chaos to Control?
(1) Ensure an acceptable user experience during a MAP failure, e.g. that the user can browse our selection and continue to play videos
(2) Services that depend on MAP i.e. the API service and device platforms are not impacted by a MAP failure and continue to provide uninterrupted services
(3) Services that MAP depends on are not overwhelmed by excessive load from MAP
It is easy enough to identify obvious points of failure. For example - if a service provides data X, we could ensure that MAP has a fallback for data X being unavailable. What is harder is knowing the impact of failures in multiple services - different combinations of them - and the impact of higher latencies.
(1) Timeouts for various calls
(2) Thresholds for opening circuits via Hystrix
(3) Fallbacks for certain use cases
(4) Thread pool settings
(1) Cache that handles MAP reads and writes
(2) Dependencies that MAP interfaces with
(3) MAP service itself
What does control look like?
- Simulating failures in all the dependent services of MAP server caused an odd data mismatch to happen:
- Severing connections to MAP server and the cache caused these duplicate titles to be served:
- When the cache was made unavailable mid session, some rows looked like this:
- Simulating a failure in the “My List” service caused the PS4 UI to be stuck on adding a title to My List: