As we built RAD we identified four generic challenges that are ubiquitous in outlier detection on “big data.”
- High cardinality dimensions: High cardinality data sets - especially those with large combinatorial permutations of column groupings - makes human inspection impractical.
- Minimizing False Positives: A successful anomaly detection tool must minimize false positives. In our experience there are many alerting platforms that “sound an alarm” that goes ultimately unresolved. The goal is to create alerting mechanisms that can be tuned to appropriately balance noise and information.
- Seasonality: Hourly/Weekly/Bi-weekly/Monthly seasonal effects are common and can be mis-identified as outliers deserving attention if not handled properly. Seasonal variability needs to be ignored.
- Data is not always normally distributed: This has been a particular challenge since Netflix has been growing over the last 24 months. Generally though, an outlier tool must be robust so that it works on data that is not normally distributed.
In addition to addressing the challenges above, we wanted a solution with a generic interface (supporting application development). We met these objectives with a novel algorithm encased in a wrapper for easy deployment in our ETL environment.
We initially tested techniques like moving averages with standard deviations and time series/regression models (ARIMAX) but found that these simpler methods were not robust enough in high cardinality data.
The algorithm we finally settled on uses Robust Principal Component Analysis (RPCA) to detect anomalies. PCA uses the Singular Value Decomposition (SVD) to find low rank representations of the data. The robust version of PCA (RPCA) identifies a low rank representation, random noise, and a set of outliers by repeatedly calculating the SVD and applying “thresholds” to the singular values and error for each iteration. For more information please refer to the original paper by Candes et al. (2009).
Below is an interactive visualization of the algorithm at work on a simple/random dataset and on public climate data.
Since Apache Pig is the primary ETL language at Netflix, we wrapped this algorithm in a Pig function enabling engineers to easily use it with just a few additional lines of code. We’ve open-sourced both the Java function that implements the algorithm and the Pig wrapper. The details and a sample application (with code) can be found here.
The following are two popular applications where we initially implemented this anomaly detection system at Netflix with great success.
Netflix processes millions of transactions every day across tens of thousands of banking institutions/infrastructures in both real-time and batch environments. We’ve used the above solution to detect anomalies in failures in the payment network at a bank level. With the above system, business managers were able to follow up with their counterparts in the payment industry and thereby reducing the impact on Netflix customers
Our signup flow was another important point of application. Today Netflix customers sign up across the world on hundreds of different types of browsers or devices. Identifying anomalies across unique combinations of country, browser/device and language helps our engineers understand and react to customer sign up problems in a timely manner.
A robust algorithm is paramount to the success of any anomaly detection system and RPCA has worked very well for detecting anomalies. Along with the algorithm, our focus on simplifying the implementation with a Pig wrapper made the tool a great success. The applications listed above have helped the Netflix data teams understand and react to anomalies faster--which reduces the impact to Netflix customers and our overall business.