by Kolton Andrus, Naresh Gopalani, Ben Schmaus
It's no secret that at Netflix we enjoy deliberately breaking things to test our production systems. Doing so lets us validate our assumptions and prove that our mechanisms for handling failure will work when called upon. Netflix has a tradition of implementing a range of tools that create failure, and it is our pleasure to introduce you to the latest of these solutions, FIT or Failure Injection Testing.
FIT is a platform that simplifies creation of failure within our ecosystem with a greater degree of precision for what we fail and who we will impact. FIT also allows us to propagate our failures across the entirety of Netflix in a consistent and controlled manner.
Why We Built FIT
While breaking things is fun, we do not enjoy causing our customers pain. Some of our Monkeys, by design, can go a little too wild when let out of their cages. Latency Monkey in particular has bitten our developers, leaving them wary about unlocking the cage door.
Latency monkey adds a delay and/or failure on the server side of a request for a given service. This provides us good insight into how calling applications behave when their dependency slows down - threads pile up, the network becomes congested, etc. Latency monkey also impacts all calling applications - whether they want to participate or not, and can result in customer pain if proper fallback handling, timeouts, and bulkheads don’t work as expected. With the complexity of our system it is virtually impossible for us to anticipate where failures will happen when turning latency monkey loose. Validating these behaviors often is risky, but critical to remain resilient.
What we need is a way to limit the impact of failure testing while still breaking things in realistic ways. We need to control the outcome until we have confidence that the system degrades gracefully, and then increase it to exercise the failure at scale. This is where FIT comes in.
How FIT works
Simulating failure starts when the FIT service pushes failure simulation metadata to Zuul. Requests matching the failure scope at Zuul are decorated with failure. This may be an added delay to a service call, or failure in reaching the persistence layer. Each injection point touched checks the request context to determine if there is a failure for that specific component. If found, the injection point simulates that failure appropriately. Below is an outline of a simulated failure, demonstrating some of the inflection points in which failure can be injected.
We only want to break those we intend, so limiting the potential blast radius is critical. To achieve this we use Zuul, which provides many powerful capabilities for inspecting and managing traffic. Before forwarding a request, Zuul checks a local store of FIT metadata to determine if this request should be impacted. If so, Zuul decorates the request with a failure context, which is then propagated to all dependent services.
For most failure tests, we use Zuul to isolate impacted requests to only a specific test account or a specific device. Once validated at that level, we expand the scope to a small percentage of production requests. If the failure tests still looks good, we will gradually dial up the chaos to 100%.
We have several key “building block” components that are used within Netflix. They help us to isolate failure and define fallbacks (Hystrix), communicate with dependencies (Ribbon), cache data (EVCache), or persist data (Astyanax). Each of these layers make perfect inflection points to inject failure. These layers interface with the FIT context to determine if this request should be impacted. The failure behavior is provided to that layer, which determines how to emulate that failure in a realistic fashion: sleep for a delay period, return a 500, throw an exception, etc.
Whether we are recreating a past outage, or proactively testing the loss of a dependency, we need to know what could fail in order to build a simulation. We use an internal system that traces requests through the entirety of the Netflix ecosystem to find all of the injection points along the path. We then use these to create failure scenarios, which are sets of injection points which should or should not fail. One such example is our critical services scenario, the minimum set of our services required to stream. Another may be the loss of an individual service, including its persistence and caching layers.
Failure testing tools are only as valuable as their usage. Our device testing teams have developed automation which: enables failure, launches Netflix on a device, browses through several lists, selects a video, and begins streaming. We began by validating this process works if only our critical services are available. Currently we are extending this to identify every dependency touched during this process, and systematically failing each one individually. As this is running continuously, it helps us to identify vulnerabilities when introduced.
FIT has proven useful to bridge the gap between isolated testing and large scale chaos exercises, and make such testing self service. It is one of many tools we have to help us build more resilient systems. The scope of the problem extends beyond just failure testing, we need a range of techniques and tools: designing for failure, better detection and faster diagnosis, regular automated testing, bulkheading, etc. If this sounds interesting to you, we’re looking for great engineers to join our reliability, cloud architecture, and API teams!