We want to provide an amazing experience to each member, winning the “moments of truth” where they decide what entertainment to enjoy. To do that, we need to understand the health of our system. To quickly and easily understand the health of the system, we need a simple metric that a diverse set of people can comprehend. In this post we will discuss how we discovered and aligned everyone around one operational metric indicating service health, enabling us to streamline production operations and improve availability. We will detail how we approach signal analysis, deviation detection, and alerting for that signal and for other general use cases.
Creating the Right Signal
In the early days of Netflix streaming, circa 2008, we manually tracked hundreds of metrics, relying on humans to detect problems. Our approach worked for tens of servers and thousands of devices, but not for the thousands of servers and millions of devices that were in our future. Complexity and human-reliant approaches don’t scale; simplicity and algorithm-driven approaches do.
We sought out a single indicator that closely approximated our most important activity: viewing. We discovered that a server-side metric related to playback starts (the act of “clicking play”) had both a predictable pattern and fluctuated significantly when UI/device/server problems were happening. The Netflix streaming pulse was created. We named it “SPS” for “starts per second”.
Example of stream starts per second, comparing week over week (red = current week, black = prior week)
The SPS pattern is regular within each geographic region, with variations due to external influences like major regional holidays and events. The regional SPS pattern is one cycle per day and oscillates in a rising and falling pattern. The peaks occur in the evening and the troughs occur in the early morning hours. On regional holidays, when people are off work and kids are home from school, we see increased viewing in the daytime hours, as our members have more free time to enjoy viewing a title on Netflix.
Because there is consistency in the streaming behaviors of our members, with a moderate amount of data we can establish reliable predictions on how many stream starts we should expect at any point of the week. Deviations from the prediction are a powerful way to tell the entire company when there is a problem with the service. We use these deviations to trigger alerts and help us understand when service has been fully restored.
Its simplicity allows SPS to be central to our internal vernacular. Production problems are categorized as “SPS impacting” or “not SPS impacting,” indicating their severity. Overall service availability is measured using expected versus actual SPS levels.
Deviation Detection Models
To maximize the power of SPS, we need reliable ways to find deviations in our actual stream starts relative to our expected stream starts. The following are a range of techniques that we have explored in detecting such deviations.
A common starting place when trying to detect change is to define a fixed boundary that characterizes normal behavior. This boundary can be described as a floor or ceiling which, if crossed, indicates a deviation from normal behavior. The simplicity of static thresholds makes them a popular choice when trying to detect a presence, or increase, in a signal. For example, detecting when there is an increase in CPU usage:
Example where a static threshold could be used to help detect high CPU usage
However, static thresholds are insufficient in accurately capturing deviations in oscillating signals. For example, a static threshold would not be suitable for accurately detecting a drop in SPS due to its oscillating nature.
Another technique that can be used to detect deviations is to use an exponential smoothing function, such as exponential moving average, to compute an upper or lower threshold that bounds the original signal. These techniques assign exponentially decreasing weights as the observations get older. The benefit of this approach is that the bound is no longer static and can “move” with the input signal, as shown in the image below:
Example of data smoothing using moving average
Another benefit is that exponential smoothing techniques take into account all past observations. In addition, exponential smoothing requires only the most recent observation to be kept. These aspects make it desirable for real-time alerting.
Double Exponential Smoothing
To detect a change in SPS behavior, we use Double Exponential Smoothing (DES) to define an upper and lower boundary that captures the range of acceptable behavior. This technique includes a parameter that takes into account any trend in the data, which works well for following the oscillating trend in SPS. There are more advanced smoothing techniques, such as triple exponential smoothing, which also take into account seasonal trends. However, we do not use these techniques as we are interested in detecting a deviation in behavior over a short period of time which does not contain a pronounced seasonal trend.
Before creating a DES model one must first select values for the data smoothing factor and the trend smoothing factor. To visualize the effect these parameters have on DES, see this interactive visualization. The estimation of these parameters is crucial as they can greatly affect accuracy. While these parameters are typically determined by an individual's intuition or trial and error, we have experimented with data-driven approaches to automatically initialize them (motivated by Gardner ). We are able to apply those identified parameters to signals that share similar daily patterns and trends, for example SPS at the device level.
The image below shows an example where DES has been used to create a lower bound in an attempt to capture a deviation in normal behavior. Shortly after 18:00 there is a drop in SPS which crosses the DES threshold, alerting us to a potential issue with our service. By alerting on this drop, we can respond and take actions to restore service health.
While DES accurately identifies a drop in SPS, it is unable to predict when the system has recovered. In the example below, the sharp recovery of SPS at approximately 20:00 is not accurately modeled by DES causing it to underpredict and generate false alarms for a short period of time:
In spite of these shortcomings, DES has been an effective mechanism for detecting actionable deviations in SPS and other operational signals.
We have begun experimenting with Bayesian techniques in a stream mining setting to improve our ability to detect deviations in SPS. An example of this is Bayesian switchpoint detection and Markov Chain Monte Carlo (MCMC) . See  for a concise introduction to using MCMC for anomaly detection and  for Bayesian switchpoint detection.
Bayesian techniques offer some advantages over DES in this setting. Those familiar with probabilistic programming techniques know that the underlying models can be fairly complex, but they can be made to be non-parametric by drawing parameters from uniform priors when possible. Using the posteriors from such calculations as priors for the next iteration allows us to create models that evolve as they receive more data.
Unfortunately, our experiments with Bayesian anomaly detection have revealed downsides compared to DES. MCMC is significantly more computationally intensive than DES, so much so that some are exploiting graphics cards in order to reduce the run time , a common technique for computationally intensive processes . Furthermore the underlying model is not as easily interpreted by a human due to the complexity of the parameter interactions. These limitations, especially the performance related ones, restrict our ability to apply these techniques to a broad set of metrics in real time.
Bayesian techniques, however, do not solve the entire problem of data stream mining in a non-stationary setting. There exists a rich field of research on the subject of real-time data stream mining . MCMC is by design a batch process, though it can be applied in a mini-batch fashion. Our current research is incorporating learnings from the stream-mining community in stream classification and drift detection. Additionally, our Data Science and Engineering team has been working on an approach based on Robust Principal Component Analysis (PCA) to deal with high cardinality data. We’re excited to see what comes from this research in 2015.
We have streamlined production operations and improved availability by creating a single directional metric that indicates service health: SPS. We have experimented with and used a number of techniques to derive additional insight from this metric including threshold-based alerting, exponential and double exponential smoothing, and bayesian and stream mining approaches. SPS is the pulse of Netflix streaming, focusing the minds at Netflix on ensuring streaming is working when you want it to be.
If you would like to join us in tackling these kinds of challenges, we are hiring!