Wednesday, May 22, 2013

Garbage Collection Visualization

What is garbage collection visualization?

By Brian Moore

In short, “garbage collection visualization” (hereafter shortened to “gcviz”) is the process of turning gc.log[1] into x/y time-series scatter plots (pictures). That is, turning garbage collector (GC) logging output into two types of charts, GC event charts:



and heap size charts:
A GC event occurs when one of the JVM-configured collectors operates on the heap, and a heap size chart shows the size of the heap after a GC event. Both of these types of information are present in gc.log[1].

Setting the context for gcviz

Earlier, Shobana wrote about the metadata that Video Metadata Services (VMS) maintains in-memory for clients to access along extremely low-latency (microseconds), high-volume (10^11 daily requests), large-data (gigabytes) code paths. Being able to serve-up data in this setting requires a deep understanding of Java garbage collection.


Netflix is (mostly) a Java shop. Netflix deploys Java applications into Apache Tomcat, an application server (itself written in Java). Tomcat, the Netflix application, and all of the libraries that are dependencies of the Netflix application allocate from the same heap. Netflix Java applications are typically long-running (weeks/months) and have large heaps (typically tens of gigabytes).

At this scale, the overhead required to manage the heap is significant, and garbage collection pauses that last more than 500 milliseconds will almost certainly interfere with network activity.

Any GC event that “stops the world” (STW) will pause Tomcat, the application and the libraries the application needs to run. New inbound connections cannot be established (the Tomcat accept thread is blocked) and I/O on outbound connections will stall until the GC completes (each java thread in a STW event is at a safepoint and unable to accomplish meaningful work). It is therefore important to ensure that any required GC pauses are kept as small as possible to ensure that an application remains available. GC pauses are often seen as the cause of an issue when in fact they are a side-effect of the allocation behavior of the combination of Tomcat, the application and the libraries the application uses.

Before gcviz was written, Netflix determined the influence (or absence) of GC events in outages via several methods:
  • hand-crafting excel spreadsheets
  • using PrintGCStats/PrintGCFixup/GCHisto
  • visually skimming gc.log, looking for “long” stop-the-world events
While these methods were effective, they were time-consuming and difficult for many Netflix developers. Inside Netflix, gcviz has been incorporated into Netflix’s base AMI and integrated into Asgard, making visual GC event analysis of any application trivial (a click of a button) for all Netflix developers.

Why is gcviz important?
gcviz is important for several reasons:
  • Convenient: The developer loads the event and heap chart pages directly into their browser by clicking on a link in Asgard. gcviz does all the processing behind the scenes. This convenience allows us to quickly reinforce or reject the oft-repeated claim: bad application behavior is "because of long GC pauses".
  • Visual: Images can convey time-series information more quickly and densely than text.
  • Causation: gcviz can help application developers establish or refute a correlation between time-of-day-based observed application degradation and JVM-wide GC behavior.
  • Clarity: The semantics of each of the different GC event types (concurrent-mark, ParNew) can be made implicit in the image instead of being given equal weight in the text of gc.log. This is useful because, for example, a long running concurrent-sweep should be no cause for alarm because it runs in parallel with the application program.
  • Iterative: gcviz allows for quick interpretation of experimental results. What effect does modifying a particular setting or algorithm have on the heap usage or GC subsystem? gcviz allows one to quickly and visually understand any impact a GC-related change has made. gcviz is now being effectively used in canary deployments within Netflix where GC behavior is compared to an existing baseline.

Prior (and contemporary) Art
Netflix is not the first organization that has seen the benefits of visualizing garbage collection data. Other projects involved in the visualization or interpretation of garbage collection data include the following:

Why develop another solution?
While the tools mentioned above work well under their own conditions, Netflix had other requirements and constraints to consider and realized that none of the existing tools met all of Netflix’s needs. The following requirements were considered and followed during the design and implementation of gcviz. gcviz needs to:
  • Run outside the context of the application under analysis, but on each instance on which the application runs.
  • Use gc.log as the source-of-truth
    • historical analysis (no need to attach to a running vm)
    • filesystem/logs can be read even when HTTP port is busy and/or application threads are blocked
  • Read gc logs with the JVM options that Netflix commonly (but not universally) uses:
    • -verbose:gc
    • -verbose:sizes
    • -Xloggc:/apps/tomcat/logs/gc.log
    • -XX:+PrintGCDetails
    • -XX:+PrintGCDateStamps
    • -XX:+PrintTenuringDistribution
    • -XX:-UseGCOverheadLimit
    • -XX:+ExplicitGCInvokesConcurrent
    • -XX:+UseConcMarkSweepGC
    • -XX:+CMSScavengeBeforeRemark
    • -XX:+CMSParallelRemarkEnabled
  • Gracefully handle any arguments passed to the JVM.
  • Remain independent of any special terminal/display requirements (must be able to run inside and outside Netflix’s base AMI without modification)
  • Be able to run standalone to leverage special display capabilities if they are present
  • Be accessible from Asgard
  • Be dead-simple to use, requiring no specialized gc knowledge
  • Correlate netflix-internal events with gc activity. One example of a netflix-internal event is a cache refresh.
  • Retain its reports over time to enable long-term comparative analysis of a given application.

A Small Example: GC Event Chart

To understand how the GC event charts are constructed, it may be helpful to consider an example. In the picture below, gcviz visualizes a gc.log that contains three gc events:


  • a DefNew GC event at 60 seconds after JVM boot,
  • a DefNew GC event at 120 seconds after JVM boot, and
  • a DefNew GC event at 180 seconds after JVM boot


These three events took 100 milliseconds, 200 milliseconds, and 100 milliseconds, respectively. These three events would produce a chart with three DefNew “dots” on it. The “x” value of the dot would be the seconds since JVM boot, converted to absolute time, and the “y” value of the dot would be the duration in seconds. The red color of the dot indicates that DefNew is a “stop-the-world” GC event. Hopefully this is more clear in pictorial form:



Real-world GC Event Charts

In mid-2012, Netflix used gcviz to analyze some performance problems a class of applications was having. These performance problems had 80-120 second garbage collection pauses as one of its symptoms. The gcviz event chart looked like this:


After the problem was identified and fixed the event chart looked like this:


The pauses had been reduced from 120 seconds to under 5 seconds. In addition, any Netflix engineer could quickly see that the symptom had been eliminated without needing to dig through gc.log.

Real-world Heap Size Charts

In another mid-2012 event, the heap usage of a Netflix application went higher than the designers of that application intended. The application designers intended memory usage to peak around 15GB. The chart below shows that heap usage peaked at 2.0 times 10^7 kilobytes (19.07 gigabytes) and it also shows “flumes” for periods of time where the heap usage went above its previous/standard high-water mark:


after that problem was identified and fixed the high-water mark returned to 1.6 times 10^7 kilobytes (15.23 gigabytes) and the flumes reduced and the heap usage pattern became far more regular. In addition the low-water mark dropped from 11.44 gigabytes to 9.54 gigabytes:

JVM collector compatibility

Currently, gcviz supports all of the HotSpot/JDK 7 collectors, with the exception of G1.

Additionally collected data

In addition to collecting gc.log data, gcviz collects additional system information (cpu usage, network usage, underlying virtual memory usage, etc.) to help correlate gc events with application events. In addition gcviz can be configured to capture a jmap histogram of live objects by object count, bytes required and class name.

Open-sourcing details

gcviz has been open sourced under the Apache License, version 2.0 at https://github.com/Netflix/gcviz

Conclusion
As a company, Netflix considers data visualization of paramount importance. Most of Netflix’s major systems contain significant visualization components (for example, the Hystrix Dashboard and Pytheas). Most visualization at Netflix occurs on continuous data, but visualizing discrete data has its place too. Being able to quickly differentiate between a GC-indicated allocation problem and other types of error conditions has been valuable in operating the many services required to bring streaming TV and movies into living rooms all across the world.

Notes
[1] I didn’t want to lose you by quoting gc.log so early in the blog post! gc.log is a file created by the JVM option (for example) -Xloggc:/apps/tomcat/logs/gc.log. Specifying this option is recommended and common inside Netflix. Its contents are something like the following:


2013-01-01T18:30:16.651+0000: 2652735.877: [CMS-concurrent-sweep-start]
2013-01-01T18:30:21.777+0000: 2652741.003: [CMS-concurrent-sweep: 5.126/5.126 secs] [Times: user=5.13 sys=0.00, real=5.12 secs]
2013-01-01T18:30:21.777+0000: 2652741.004: [CMS-concurrent-reset-start]
2013-01-01T18:30:21.842+0000: 2652741.068: [CMS-concurrent-reset: 0.065/0.065 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]
2013-01-01T19:28:47.041+0000: 2656246.267: [GC 2656246.267: [ParNewDesired survivor size 786432000 bytes,new threshold 15 (max 15)- age   1:   26395600 bytes,   26395600 total- age   2:       1376 bytes,   26396976 total- age   3:       4184 bytes,   26401160 total- age   4:    9591072 bytes,   35992232 total- age   5:     747344 bytes,   36739576 total- age   6:   18239512 bytes,   54979088 total- age   7:    7398216 bytes,   62377304 total- age   8:    4702664 bytes,   67079968 total- age   9:       5584 bytes,   67085552 total- age  10:       3728 bytes,   67089280 total- age  11:       2416 bytes,   67091696 total- age  12:   10838496 bytes,   77930192 total- age  13:    1682368 bytes,   79612560 total- age  14:   17756736 bytes,   97369296 total- age  15:    6775352 bytes,  104144648 total: 6103985K->124729K(10752000K), 0.1872850 secs] 11541109K->5564874K(29184000K), 0.1874910 secs] [Times: user=0.72 sys=0.00, real=0.19 secs]
2013-01-01T19:28:47.229+0000: 2656246.456: [GC [1 CMS-initial-mark: 5440145K(18432000K)] 5583733K(29184000K), 0.1454260 secs] [Times: user=0.15 sys=0.00, real=0.15 secs]
2013-01-01T19:28:47.375+0000: 2656246.602: [CMS-concurrent-mark-start]
2013-01-01T19:29:02.574+0000: 2656261.800: [CMS-concurrent-mark: 15.195/15.199 secs] [Times: user=15.24 sys=0.03, real=15.19 secs]
2013-01-01T19:29:02.574+0000: 2656261.801: [CMS-concurrent-preclean-start]
2013-01-01T19:29:02.638+0000: 2656261.864: [CMS-concurrent-preclean: 0.061/0.064 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]
2013-01-01T19:29:02.638+0000: 2656261.865: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 2013-01-01T19:29:08.589+0000: 2656267.816: [CMS-concurrent-abortable-preclean: 5.946/5.951 secs] [Times: user=5.96 sys=0.01, real=5.95 secs]