by: Daniel C. Weeks
In a post last year we discussed our big data architecture and the advantages of working with big data in the cloud (read more here). One of the key points from the article is that Netflix leverages Amazon’s Simple Storage Service (S3) as the “source of truth” for all data warehousing. This differentiates us from the more traditional configuration where Hadoop’s distributed file system is the storage medium with data and compute residing in the same cluster. Decentralizing the data warehouse frees us to explore new ways to manage big data infrastructure but also introduces a new set of challenges.
From a platform management perspective, being able to run multiple clusters isolated by concerns is both convenient and effective. We experiment with new software and perform live upgrades by simply diverting jobs from one cluster to another or adjust the size and number of clusters based on need as opposed to capacity. Genie, our execution service, abstracts the configuration and resource management for job submissions by providing a centralized service to query across all big data resources. This cohesive infrastructure abstracts all of the orchestration from the execution and allows the platform team to be flexible and adapt to dynamic environments without impacting users of the system.
However, as a user of the system, understanding where and how a particular job executes can be confusing. We have hundreds of platform users ranging from running casual queries to ETL developers and data scientists running tens to hundreds of queries every day. Navigating the maze of tools, logs, and data to gather information about a specific run can be difficult and time consuming. Some of the most common questions we hear are:
Why did my job run slower today than yesterday?
Can we expand the cluster to speed up my job?
What cluster did my job run on?
How do I get access to task logs?
These questions can be hard to answer in our environment because clusters are not persistent. By the time someone notices a problem, the cluster that ran the query, along with detailed information, may already be gone or archived.
To help answer these questions and empower our platform users to explore and improve their job performance, we created a tool: Inviso (latin: to go to see, visit, inspect, look at). Inviso is a job search and visualization tool intended to help big data users understand execution performance. Netflix is pleased to add Inviso to our open source portfolio under the Apache License v2.0 and is available on github.
Inviso provides an easy interface to find jobs across all clusters, access other related tools, visualize performance, make detailed information accessible, and understand the environment in which jobs run.
Searching for Jobs
Finding a specific job run should be easy, but with each Hive or Pig script abstracting multiple Hadoop jobs, finding and pulling together the full execution workflow can be painful. To simplify this process, Inviso indexes every job configuration across all clusters into ElasticSearch and provides a simple search interface to query. Indexing job configurations into ElasticSearch is trivial because the structure is simple and flat. With the ability to use the full lucene query syntax, finding jobs is straightforward and powerful.
The search results are displayed in a concise table reverse ordered by time with continuous scrollback and links to various tools like the job history page, Genie, or Lipstick. Clicking the links will take you directly to the specific page for that job. Being able to look back over months of different runs of the same job allows for detailed analysis of how the job evolves over time.
In addition to the interface provided by Inviso, the ElasticSearch index is quite effective for other use cases. Since the index contains the full text of hive or pig script, searching for table or UDF usage is possible as well. Internally, we use the index to search for dependencies and scripts when modifying/deprecating/upgrading datasources, UDFs, etc. For example, when we last upgraded Hive, the new version had keyword conflicts with some existing scripts and we were able to identify the scripts and owners to upgrade prior to rolling out the new version of Hive. Others use it to identify who is using a specific table in case they want to change the structure or retire the table.
Visualizing Job Performance
Simply finding a job and the corresponding hadoop resources doesn’t make it any easier to understand the performance. Stages of a Hive or Pig script might execute in serially or parallel impacting the total runtime. Inviso correlates the various stages and lays them out in a swimlane diagram to show the parallelism. Hovering over a job provides detailed information including the full set of counters. The stages taking the longest time and where to focus effort to improve performance is readily apparent.
|Overview Diagram Showing Stages of a Pig Job|
Below the workflow diagram is a detailed task diagram for each job showing the individual execution of every task attempt. Laying these out in time order shows how tasks were allocated and executed. This visual presentation can quickly convey obvious issues with jobs including data skew, slow attempts, inconsistent resource allocation, speculative execution, and locality. Visualizing job performance in this compact format allows users to quickly scan the behavior of many jobs for problems. Hovering over an individual task will bring up task specific details including counters making it trivial to compare task details and performance.
Tasks are ordered by scheduler allocation providing insight into how many resources were available at the time and how long it took for the attempt to start. The color indicates the task type or status. Failed or killed tasks even present the failure reason and stack trace, so delving into the logs isn’t necessary. If you do want to look at a specific task log, simply select the task and click the provided link to go directly to the log for that task.
|Diagram Showing Task Details|
|Diagram of Execution Locality|
The detailed information used to populate this view comes directly from the job history file produced for every mapreduce job. Inviso has a single REST endpoint to parse the detailed information for a job and represent it as a json structure. While this capability is similar to what the MapReduce History Server REST API provides, the difference is that Inviso provides the complete structure in a single response. Gathering this information from the History Server would require thousands of requests with the current API and could impact the performance of other tools that rely on the history server such as Pig and Hive clients. We also use this REST API to collect metrics to aggregate job statistics and identify performance issues and failure causes.
With job performance we tend to think of how a job will run in isolation, but that’s rarely the case in any production environment. At Netflix, we have clusters isolated by concen: multiple clusters for production jobs, ad-hoc queries, reporting, and some dedicated to smaller efforts (e.g. benchmarking, regression testing, test data pipeline). The performance of any specific run is a function of the cluster capacity and the allocation assigned by the scheduler. If someone is running a job at the same time as our ETL pipeline, which has higher weight, they might get squeezed out due to the priority we assign ETL.
Similar to how Inviso indexes job configurations, REST endpoints are polled on the Resource Manager to get the current metrics for all clusters and indexes the results into ElasticSearch. With this information we can query and reconstitute the state of the cluster for any timespan going back days or months. So even though the a cluster may be gone or the job details are purged from the system, you can look back at how busy the cluster was when a job ran to determine if the performance was due to congestion.
|Application Stream: Running applications with ETL and Data Pipeline Activity Highlighted|
In a second graph on the same page, Inviso displays the capacity and backlog on the cluster using the running, reserved, and pending task metrics available from the Resource Manager’s REST API. This view has a range selector to adjust the timespan in the first graph and looks back over a longer period.
This view provides a way to gauge the load and backlog of the cluster. When large jobs are submitted the pending tasks will spike and slowly drain as the cluster works them off. If the cluster is unable to work off the backlog, the cluster might need to be expanded. Another insight this view provides is periods when clusters have underutilized capacity. For example, ad-hoc clusters are used less frequently at night, which is an opportune time to run a large backfill job. Inviso makes these types of usage patterns clear so we can shift resources or adjust usage patterns to take full advantage of the cluster.
Putting it all Together
With the increasing size and complexity of Hadoop deployments, being able to locate and understand performance is key to running an efficient platform. Inviso provides a convenient view of the inner workings of jobs and platform. By simply overlaying a new view on existing infrastructure, Inviso can operate inside any Hadoop environment with a small footprint and provide easy access and insight.
Given an existing cluster (in the datacenter or cloud), setting up Inviso should only take a few minutes, so give it a shot. If you like it and want to make it better, send some pull requests our way.