Wednesday, June 26, 2013

Introducing Lipstick on A(pache) Pig

by Jeff Magnusson, Charles Smith, John Lee, and Nathan Bates

We’re pleased to announce Lipstick (our Pig workflow visualization tool) as the latest addition to the suite of Netflix Open Source Software.

At Netflix, Apache Pig is used heavily amongst developers when productionizing complex data transformations and workflows against our big data.  Pig provides good facilities for code reuse in the form of Python and Java UDFs and Pig macros. It also exposes a simple grammar that allows our users to easily express workflows on big datasets without getting “lost of the weeds” worrying about complicated MapReduce logic.

While Pig’s high level of abstraction is one of its most attractive features, scripts can quickly reach a level of complexity upon which the flow of execution, and it’s relation to the MapReduce jobs being executed, become difficult to conceptualize.  This tends to prolong and complicate the effort required to develop, maintain, debug, and monitor the execution of scripts in our environment. In order to address these concerns we have developed Lipstick, a tool that enables developers to visualize and monitor the execution of their data flows at a logical level.

Lipstick was initially developed as a stand-alone tool that produced a graphical depiction of a Pig workflow.  While useful, we quickly realized that combining the workflow with information about the job as it ran gave the developer insight that previously required a lot of sifting through logs (or a Pig expert) to piece together.   Now, as an implementation of Pig Progress Notification Listener, Lipstick piggybacks on top of all Pig scripts executed in our environment notifying a Lipstick server of job executions and periodically reporting progress as the script executes.


The screenshot above shows Lipstick in action.  In this example the developer would see:
  • This script compiled into 4 MapReduce jobs (two of which we can see represented by the blue bounding boxes)
  • Which logical operations execute in the mappers (blue header) vs the reducers (orange header)
  • Row counts from load / store / dump operations, as well as in between MapReduce jobs
Had the script been currently executing, the boxes representing MapReduce jobs would have been flashing colors (blue or orange) to represent that they were currently executing in the map or reduce phase, and intermediate row counts would have been updating periodically as the Pig script heartbeat back to the Lipstick server.

Lipstick has many cool features (check out the user guide to learn more), but there are two that we think are especially useful:
Clicking on intermediate row counts between MapReduce jobs displays a sample of intermediate results.
A toggle that switches between optimized and unoptimized versions of the logical plan.  This allows users to easily see how Pig is applying optimizations to the script (e.g. filters pushed into the loader).
In the months we've been using Lipstick, it has already proven its worth many times over and we are just getting started.  If you would like to use Lipstick yourself or help us make it better, download it and give us your feedback.  If you like building tools that makes it easier to work with big data (like Lipstick) check out our jobs page as well.