Wednesday, April 8, 2015

Introducing Vector: Netflix's On-Host Performance Monitoring Tool

Vector is an open source host-level performance monitoring framework, which exposes hand-picked, high-resolution system and application metrics to every engineer’s browser. Having the right metrics available on demand and at a high resolution is key to understanding how a system behaves and correctly troubleshooting performance issues. Previously, we'd login to instances as needed, run a variety of commands, and sift through the output for the metrics that matter. Vector cuts down the time to get to those metrics, helping us respond to incidents more quickly.

Vector provides a simple way for users to visualize and analyze system and application-level metrics in near real-time. It leverages the battle tested open source system monitoring framework, Performance Co-Pilot (PCP),  layering on top a flexible and user-friendly UI. The UI polls metrics at up to 1 second resolution, rendering the data in completely configurable dashboards that simplify cross-metric correlation and analysis.

PCP’s stateless model makes it lightweight and robust. Its overhead on hosts is negligible, as clients are responsible for keeping track of state, sampling rate, and computation. Additionally, metrics are not aggregated across hosts or persisted outside of the user’s browser session, keeping the framework light. Vector requires only your local browser and PCP installed on the host you wish to monitor. No intermediate collector, server, or database infrastructure is required.

We are excited to release Vector to the community and look forward to feedback and collaboration!

High-Level Architecture


Vector itself is a web application that runs completely inside the user's browser. It was built with AngularJS and leverages D3.js for charts. In the future, the Vector package will also include custom metric agents.

Vector has a “default” dashboard exposed at launch.  This dashboard is a simple page that holds a few options including UI object visibility flags, widget definitions, and a set of loaded widgets. Once loaded, it will display the set of loaded widgets and present the user with controls to include any of the additional predefined widgets.

Widgets are loaded into dashboards. A widget object will contain details about a specific widget, like its name, template, style, and more importantly, the data model to be used. Data models are, in a nutshell, objects that control the metrics required for each widget and how the values are used in it. Data model prototypes are relatively simple. They extend a base WidgetDataModel prototype and define their own init and destroy functions. Most of what is done in those functions is adding and removing metrics from the metric poller list, creating callback functions that deal with the data points returned from the poller itself, and referencing the right data structure to be used in the charts.

Generic data models were also created so they could be reused on new widgets without having to create a specific data model for it. More details about the data models can be found on Vector's wiki page.

Metrics are polled from Performance Co-Pilot's web daemon. They are referenced by unique names and current values are returned with a timestamp in order for them to be normalized.Vector makes use of two data structures to store metrics and their values. The "raw" metric data structure holds the original metric values that came from PCP. The "derived" metric data structure holds metrics that were modified by a data model function, like a cumulative function or a normalization function.

The metric poller is the component that goes over the list of "raw" metrics and polls them from PCP via HTTP, given the selected polling interval. It also executes all data model functions and consequently updates the "derived" metric data structure. Charts are automatically updated every time the data structure is updated.

Performance Co-Pilot (PCP) is a system performance and analysis framework. It provides metric agents, a metric collector and a web daemon that is leveraged by the metric poller to collect metric values. More details about PCP can be found at pcp.io.

Getting Started


In order to get started, you should first install Performance Co-Pilot (PCP) on each host you plan to monitor. PCP will collect the metrics and make them available for Vector. The pmcd and pmwebd services need to be running on each host, the latter of which needs to expose its tcp/44323 network port.

Optional monitoring agents can also be installed in order to collect specific metrics that are not supported by PCP's system agent.

Once PCP is installed, you should be able to run Vector and connect to the target host.

Performance Co-Pilot (PCP)

Vector depends on Performance Co-Pilot (PCP) to collect metrics on each host you plan to monitor.

Since Vector depends on version 3.10 or higher, the packages currently available on most Linux distro repositories would not suffice. Until newer versions are available in the repositories, you should be able to install PCP from binary packages made available by the PCP development team on:

ftp.pcp.io

Or build it from source. To do so, get the current version of the source code:

$ git clone git://git.pcp.io/pcp

Then build and install:

$ cd pcp
$ ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var
$ make
$ sudo make install

More information on how to install Performance Co-Pilot can be found at:


Vector

Vector is a static web application that runs inside the client's browser. It can run locally or deployed to any HTTP server available, like Apache or Nginx.

To run in locally, first clone the repo:

$ git clone https://github.com/Netflix/vector.git

Make sure you have Bower installed on your system. Bower is a package management system for client-side programming, optimized for the front-end development.


And install all dependencies:

$ cd vector
$ bower install

You can run Vector with Gulp. Gulp is an automated task runner and includes a development web server with live reload. In order to start Gulp’s web server, first make sure you have Gulp installed on your system:


Then, install all dependencies and execute the default Gulp task:

$ npm install
$ gulp

You can also run Vector with Python's SimpleHTTPServer:

$ cd vector/app
$ python -m SimpleHTTPServer 8080

Then open Vector on your browser:

http://localhost:8080

And enter the hostname from the server you plan on monitoring. That's it!

Widgets & Dashboards
Vector's UI is based on dashboards and widgets. You can have one dashboard per browser tab/window. Dashboards are completely configurable and can have multiple widgets. Currentl.y there are no limits on the amount of widgets a dashboard can contain, but real-time rendering of multiple charts can consume a significant amount of CPU and slow down the application. Currently, changes made to dashboards are not persisted.

Window & Interval
Vector's UI aims to be extremely simple. Besides the hostname, there are only two configuration options, window and interval. The window option allows the user to select the rolling window size, represented in minutes, for all widgets in a dashboard. The interval option allows the user to select the metric polling interval, represented in seconds. If you have many widgets in a dashboard and the application starts to show signs of slowness, you should be able to decrease the window size and/or increase the interval to reduce CPU utilization.

Dashboards & Widgets

Vector comes with a predefined set of widgets and dashboards that can be easily extended. Here is a short list of metrics available by default.

CPU

  • Load Average
  • Runnable
  • CPU Utilization
  • Per-CPU Utilization
  • Context Switches

Memory

  • Memory Utilization
  • Page Faults

Disk

  • Disk IOPS
  • Disk Throughput
  • Disk Utilization
  • Disk Latency

Network

  • Network Drops
  • TCP Retransmits
  • TCP Connections
  • Network Throughput
  • Network Packets
Currently, there are only two pre-configured dashboards on Vector. The "default" dashboard, with a set commonly used widgets, and an empty dashboard. To change dashboards, click on the "widget" drop-down menu and select the desired dashboard.

Next Steps

  • More widgets and dashboards
  • User-defined dashboards
  • Metric snapshots
  • CPU Flame Graphs
  • Disk Latency Heat Maps
  • Integration with Servo
  • Support for Cassandra

Conclusion

Observability is key to understanding how an application behaves under certain conditions and is paramount to successfully troubleshoot any performance issue. Vector allows us to closely monitor hosts in near real-time and easily correlate metrics, making them accessible to every engineer, simplifying the process of troubleshooting issues. It proved to be an invaluable tool to help us achieve great performance and we plan to continue building and improving it!

You can find Vector on GitHub and on netflix.github.io!