Principles of Resiliency
- A failure in a service dependency should not break the user experience for members
- The API should automatically take corrective action when one of its service dependencies fails
- The API should be able to show us what’s happening right now, in addition to what was happening 15-30 minutes ago, yesterday, last week, etc.
Keep the Streams Flowing
- A request to the remote service times out
- The thread pool and bounded task queue used to interact with a service dependency are at 100% capacity
- The client library used to interact with a service dependency throws an exception
- Custom fallback - in some cases a service’s client library provides a fallback method we can invoke, or in other cases we can use locally available data on an API server (eg, a cookie or local JVM cache) to generate a fallback response
- Fail silent - in this case the fallback method simply returns a null value, which is useful if the data provided by the service being invoked is optional for the response that will be sent back to the requesting client
- Fail fast - used in cases where the data is required or there’s no good fallback and results in a client getting a 5xx response. This can negatively affect the device UX, which is not ideal, but it keeps API servers healthy and allows the system to recover quickly when the failing service becomes available again.