Thursday, January 16, 2014

Improving Netflix’s Operational Visibility with Real-Time Insight Tools

By Ranjit Mavinkurve, Justin Becker and Ben Christensen


For Netflix to be successful, we have to be vigilant in supporting the tens of millions of connected devices that are used by our 40+ million members throughout 40+ countries. These members consume more than one billion hours of content every month and account for nearly a third of the downstream Internet traffic in North America during peak hours.
From an operational perspective, our system environments at Netflix are large, complex, and highly distributed. And at our scale, humans cannot continuously monitor the status of all of our systems. To maintain high availability across such a complicated system, and to help us continuously improve the experience for our customers, it is critical for us to have exceptional tools coupled with intelligent analysis to proactively detect and communicate system faults and identify areas of improvement.

In this post, we will talk about our plans to build a new set of insight tools and systems that create greater visibility into our increasingly complicated and evolving world.

Extending Our Current Insight Capabilities

Our existing insight tools include dashboards that display the status of our systems in near real time, and alerting mechanisms that notify us of major problems. While these tools are highly valuable, we have the opportunity to do even better.

Perspective
Many of our current insight tools are systems-oriented. They are built from the perspective of the system providing the metrics. This results in a proliferation of custom tools and views that require specialized knowledge to use and interpret. Also, some of these tools tend to focus more on system health and not as much on our customers’ streaming experience. 

Instead, what we need is a cohesive set of tools that provide relevant insight, effective visualization of that insight, and smooth navigability from the perspective of the tools’ consumers. The consumers of these tools are internal staff members such as engineers who want to view the health of a particular part of our system or look at some aspect of our customers’ streaming experience. To reduce the time needed to detect, diagnose, and resolve problems, it is important for these tools to be highly effective.

Context

When attempting to deeply understand a particular part of the system, or when troubleshooting a problem, it is invaluable to have access to accurate, up-to-date information and relevant context. Rich and relevant context is a highly desirable feature to have in our insight tools.

Consider the following example. Our current insight tools provide ways to visualize and analyze a metric and trigger an alert based on the metric. We use this to track stream-starts-per-second (SPS), a metric used to gauge user engagement. If SPS rises above or drops below its "normal" range, our monitoring and alerting system triggers an alert. This helps us detect that our customers are unable to stream. However, since the system does not provide context around the alert, we have to use a variety of other means to diagnose the problem, delaying resolution of the problem. Instead, suppose we had a tool tracking all changes to our environment. Now, if an alert for SPS were triggered, we could draw correlations between the system changes and the alert to provide context to help troubleshoot the problem. 

Alerts are by no means the only place where context is valuable. In fact, the insight provided by virtually any view or event is significantly more effective when relevant context is provided. 

Automation
While we need to be able to analyze and identify interesting or unusual behavior and surface any anomalies, we also need for this to happen automatically and continuously. Further, not only should it be possible to add new anomaly detection or analysis algorithms easily, but also the system itself should be able to create and apply new rules and actions for these algorithms dynamically via machine learning.

Ad Hoc Queries
In the streaming world, we have multiple facets like customer, device, movie, region, ISP, client version, membership type, etc., and we often need quick insight based on a combination of these facets. For example, we may want to quickly view the current time-weighted average bitrate for content delivered to Xbox 360s in Mexico on Megacable ISP. An insight system should be able to provide quick answers to such questions in order to enable quick analysis and troubleshooting.

Dynamic Visualizations
Our existing insight tools have dashboards with time-series graphs that are very useful and effective. With our new insight tools, we want to take our tools to a whole new level, with rich, dynamic data visualizations that visually communicate relevant, up-to-date details of the state of our environments for any operational facets of interest. For example, we want to surface interesting patterns, system events, and anomalies via visual cues within a dynamic timeline representation that is updated in near real-time.

Cohesive Insight
We have many disparate insight tools that are owned and used by separate teams. What we need is a set of tools that allow for shared communication, insights, and context in a cohesive manner. This is vital for the smooth operation of our complex, dynamic operational environments. We need a single, convenient place to communicate anomalous and informational changes as well as user-provided context.

The New Way

With our next generation of insight tools, we have the opportunity to create new and transformative ways to effectively deliver insights and extend our existing insight capabilities. We plan to build a new set of tools and systems for operational visibility that provide the insight features and capabilities that we need.

Timeliness Matters

Operational insight is much more valuable when delivered immediately. For example, when a streaming outage occurs, we want to take remedial action as soon as possible, to minimize downtime. We want anomaly detection to be automatic, but we also want it quickly, in near real-time. For ad hoc analysis as well, quick insight is the key to troubleshooting a problem and getting to a fast resolution.

Event Stream Processing

Given the importance of timeliness, a key piece of technology for the backend system for our new set of insight tools for operational visibility will be dynamic, real-time Event Stream Processing (aka Complex Event Processing) with the ability to run ad hoc dynamic queries on live event streams. Event processing enables one to identify meaningful events, such as opportunities or threats, and respond to them as quickly as possible. This fits the goals of our insight tools very well.

Scaling Challenge
Our massive scale poses an interesting challenge with regard to data scaling. How can we track many millions of metrics with fine-grained context and also support fast ad hoc queries on live data while remaining cost-effective? If we persisted every permutation of each metric along with all of its associated dimensions for all the events flowing through our system, we could end up with trillions of values, taking up several TBs of disk space every day. 

Since the value of fine-grained insight data for operational visibility is high when the data is fresh but diminishes over time, we need a system that exploits these characteristics of insight data and has the ability to make recent data available quickly, without having to persist all of the data. 

We have the opportunity to build an innovative and powerful stream processing system that meets our insight requirements, such as support for ad hoc faceted queries on live streams of big data, and has the ability to scale dynamically to handle multiple, simultaneous queries.

A Picture is Worth a Thousand Words

Good visualization helps to communicate and deliver insights effectively. As we develop our new insight tools for operational visibility, it is vital that the front-end interface to this system provide dynamic data visualizations that can communicate the insights in a very effective manner. As mentioned earlier, we want to tailor the insights and views to meet the needs of the tool’s consumers.

We envision the front-end to our new operational insight tool to be an interactive, single-page web application with rich and dynamic data visualizations and dashboards updated in real-time. The design below is a mockup (with fake data) for one of the views.


There are several components within the design: a top level navigation bar to switch between different views in the system, a breadcrumbs component highlighting the selected facets, a main view module (a map in this instance), a key metrics component, a timeline and an incident view, on the right side of the screen. The main view communicates data based on the selected facets.  


The mockup shown above (with fake data) represents another view in the system and displays routes in our edge tier with request rates, error rates and other key metrics for each route updated in near real-time.

All views in the system are dynamic and reflect the current operational state based on selected facets. A user can modify the facets and immediately see changes to the user interface.

Summary

Operational visibility with real-time insight enables us to deeply understand our operational systems, make product and service improvements, and find and fix problems quickly so that we can continue to innovate rapidly and delight our customers at every interaction. We are building a new set of tools and systems for operational visibility at Netflix with powerful insight capabilities.

Join Us!

Do you want to help design and build the next generation of innovative insight tools for operational visibility at Netflix? This is a greenfield project where you can have a large impact working in a small team. Are you interested in real-time stream processing or data visualization? We are looking for talented engineers www.netflix.com/jobs.