Wednesday, May 29, 2013

Conformity Monkey - Keeping your cloud instances following best practices

By Michael Fu and Cory Bennett, Engineering Tools

Cloud computing makes it much easier to launch new applications or start new instances. At Netflix, engineers can easily launch a new application in Asgard with a few clicks. With this freedom there are sometimes consequences where launched applications or instances may not follow some best practices. This can happen when an engineer isn't familiar with best practices or when those practices have not been well publicized. For example, some required security groups may be missing from instances and can cause security gaps. Or perhaps a health check url is not defined for instances in Eureka which would result in automatic failure detection and failover being disabled.

Introducing the Conformity Monkey
At Netflix, we use Conformity Monkey, another member of Simian Army, to check all instances in our cloud for their conformity. Today, we are proud to announce that the source code for Conformity Monkey, is now open and available to the public.

What is Conformity Monkey?

Conformity Monkey is a service which runs in the Amazon Web Services (AWS) cloud looking for instances that are not conforming to predefined rules for the best practices. Similar to Chaos Monkey and Janitor Monkey, the design of Conformity Monkey is flexible enough to allow extending it to work with other cloud providers and conformity rules. By default, conformity check is performed every hour. The schedule can be easily re-configured to fit your business' need.

Conformity Monkey determines whether an instance is nonconforming by applying a set of rules on it. If any of the rules determines that the instance is not conforming, the monkey sends an email notification to the owner of the instance. We provide a collection of conformity rules in the open sourced version that are currently used at Netflix and believed general enough to be used by most users. The design of Conformity Monkey also makes it simple to customize rules or to add new ones.

There can be exceptions when you want to ignore warnings of a specific conformity rule for some applications. For example, a security group to open a specific port probably is not needed by instances of some applications. We allow you to customize the set of conformity rules to be applied to a cluster of instances by excluding unneeded ones.

How Conformity Monkey Works

Conformity Monkey works in two stages: mark and notify. First, Conformity Monkey loops through all autoscaling groups in your cloud and applies the specified set of conformity rules to the instances in each group. If any conformity rule determines an instance as not conforming, the autoscaling group is marked as nonconforming and the instances that break the rule are recorded. Every autoscaling group is associated with an owner email, which can be obtained from an internal system, or can be set in a configuration file. The simplest way is using a default email address, e.g. your team's email list for all the autoscaling groups. Conformity Monkey sends email notification about the nonconforming groups to the owner, with the details of the broken conformity rule and the instances that failed the conformity check. The application owners can then take necessary actions to fix the failed instances or to exclude the conformity rule if they believe the conformity check is not necessary for the application. We allow you to set different frequency for conformity check and notification. For example, at Netflix, conformity check is performed every hour, and notification is only sent once per day at noon time. This reduces the number of emails people receive about the same conformity warning. The real-time result of conformity check for every autoscaling group is shown in a separate UI.

Configuration and Customization

The conformity rules for each autoscaling group, and the parameters used to configure each individual rule, are all configurable. You can easily customize Conformity Monkey with the most appropriate set of rules for your autoscaling groups by setting Conformity Monkey properties in a configuration file. You can also create your own rules, and we encourage you to contribute your conformity rules to the project so that all can benefit.

Storage and Costs

Conformity Monkey stores its data in an Amazon SimpleDB table by default. You can easily check the SimpleDB records to find out the last conformity check results for your autoscaling groups. At Netflix we have a UI for showing the conformity check results and we have plans to open source it in the future as well.

There could be associated costs with Amazon SimpleDB, but in most cases the activity of Conformity Monkey should be small enough to fall within Amazon's Free Usage Tier. Ultimately the costs associated with running Conformity Monkey are your responsibility. For your reference, the costs of Amazon SimpleDB can be found at


Conformity Monkey helps keep our cloud instances following best practices. We hope you find Conformity Monkey to be useful for your business as well. We'd appreciate any feedback on it. We're always looking for new members to join the team. If you are interested in working on great open source software, take a look at for current openings!

Conformity Monkey Links

Wednesday, May 22, 2013

Garbage Collection Visualization

What is garbage collection visualization?

By Brian Moore

In short, “garbage collection visualization” (hereafter shortened to “gcviz”) is the process of turning gc.log[1] into x/y time-series scatter plots (pictures). That is, turning garbage collector (GC) logging output into two types of charts, GC event charts:

and heap size charts:

A GC event occurs when one of the JVM-configured collectors operates on the heap, and a heap size chart shows the size of the heap after a GC event. Both of these types of information are present in gc.log[1].

Setting the context for gcviz

Earlier, Shobana wrote about the metadata that Video Metadata Services (VMS) maintains in-memory for clients to access along extremely low-latency (microseconds), high-volume (10^11 daily requests), large-data (gigabytes) code paths. Being able to serve-up data in this setting requires a deep understanding of Java garbage collection.

Netflix is (mostly) a Java shop. Netflix deploys Java applications into Apache Tomcat, an application server (itself written in Java). Tomcat, the Netflix application, and all of the libraries that are dependencies of the Netflix application allocate from the same heap. Netflix Java applications are typically long-running (weeks/months) and have large heaps (typically tens of gigabytes).

At this scale, the overhead required to manage the heap is significant, and garbage collection pauses that last more than 500 milliseconds will almost certainly interfere with network activity.

Any GC event that “stops the world” (STW) will pause Tomcat, the application and the libraries the application needs to run. New inbound connections cannot be established (the Tomcat accept thread is blocked) and I/O on outbound connections will stall until the GC completes (each java thread in a STW event is at a safepoint and unable to accomplish meaningful work). It is therefore important to ensure that any required GC pauses are kept as small as possible to ensure that an application remains available. GC pauses are often seen as the cause of an issue when in fact they are a side-effect of the allocation behavior of the combination of Tomcat, the application and the libraries the application uses.

Before gcviz was written, Netflix determined the influence (or absence) of GC events in outages via several methods:
  • hand-crafting excel spreadsheets
  • using PrintGCStats/PrintGCFixup/GCHisto
  • visually skimming gc.log, looking for “long” stop-the-world events
While these methods were effective, they were time-consuming and difficult for many Netflix developers. Inside Netflix, gcviz has been incorporated into Netflix’s base AMI and integrated into Asgard, making visual GC event analysis of any application trivial (a click of a button) for all Netflix developers.

Why is gcviz important?
gcviz is important for several reasons:
  • Convenient: The developer loads the event and heap chart pages directly into their browser by clicking on a link in Asgard. gcviz does all the processing behind the scenes. This convenience allows us to quickly reinforce or reject the oft-repeated claim: bad application behavior is "because of long GC pauses".
  • Visual: Images can convey time-series information more quickly and densely than text.
  • Causation: gcviz can help application developers establish or refute a correlation between time-of-day-based observed application degradation and JVM-wide GC behavior.
  • Clarity: The semantics of each of the different GC event types (concurrent-mark, ParNew) can be made implicit in the image instead of being given equal weight in the text of gc.log. This is useful because, for example, a long running concurrent-sweep should be no cause for alarm because it runs in parallel with the application program.
  • Iterative: gcviz allows for quick interpretation of experimental results. What effect does modifying a particular setting or algorithm have on the heap usage or GC subsystem? gcviz allows one to quickly and visually understand any impact a GC-related change has made. gcviz is now being effectively used in canary deployments within Netflix where GC behavior is compared to an existing baseline.

Prior (and contemporary) Art
Netflix is not the first organization that has seen the benefits of visualizing garbage collection data. Other projects involved in the visualization or interpretation of garbage collection data include the following:

Why develop another solution?
While the tools mentioned above work well under their own conditions, Netflix had other requirements and constraints to consider and realized that none of the existing tools met all of Netflix’s needs. The following requirements were considered and followed during the design and implementation of gcviz. gcviz needs to:
  • Run outside the context of the application under analysis, but on each instance on which the application runs.
  • Use gc.log as the source-of-truth
    • historical analysis (no need to attach to a running vm)
    • filesystem/logs can be read even when HTTP port is busy and/or application threads are blocked
  • Read gc logs with the JVM options that Netflix commonly (but not universally) uses:
    • -verbose:gc
    • -verbose:sizes
    • -Xloggc:/apps/tomcat/logs/gc.log
    • -XX:+PrintGCDetails
    • -XX:+PrintGCDateStamps
    • -XX:+PrintTenuringDistribution
    • -XX:-UseGCOverheadLimit
    • -XX:+ExplicitGCInvokesConcurrent
    • -XX:+UseConcMarkSweepGC
    • -XX:+CMSScavengeBeforeRemark
    • -XX:+CMSParallelRemarkEnabled
  • Gracefully handle any arguments passed to the JVM.
  • Remain independent of any special terminal/display requirements (must be able to run inside and outside Netflix’s base AMI without modification)
  • Be able to run standalone to leverage special display capabilities if they are present
  • Be accessible from Asgard
  • Be dead-simple to use, requiring no specialized gc knowledge
  • Correlate netflix-internal events with gc activity. One example of a netflix-internal event is a cache refresh.
  • Retain its reports over time to enable long-term comparative analysis of a given application.

A Small Example: GC Event Chart

To understand how the GC event charts are constructed, it may be helpful to consider an example. In the picture below, gcviz visualizes a gc.log that contains three gc events:

  • a DefNew GC event at 60 seconds after JVM boot,
  • a DefNew GC event at 120 seconds after JVM boot, and
  • a DefNew GC event at 180 seconds after JVM boot

These three events took 100 milliseconds, 200 milliseconds, and 100 milliseconds, respectively. These three events would produce a chart with three DefNew “dots” on it. The “x” value of the dot would be the seconds since JVM boot, converted to absolute time, and the “y” value of the dot would be the duration in seconds. The red color of the dot indicates that DefNew is a “stop-the-world” GC event. Hopefully this is more clear in pictorial form:

Real-world GC Event Charts

In mid-2012, Netflix used gcviz to analyze some performance problems a class of applications was having. These performance problems had 80-120 second garbage collection pauses as one of its symptoms. The gcviz event chart looked like this:

After the problem was identified and fixed the event chart looked like this:

The pauses had been reduced from 120 seconds to under 5 seconds. In addition, any Netflix engineer could quickly see that the symptom had been eliminated without needing to dig through gc.log.

Real-world Heap Size Charts

In another mid-2012 event, the heap usage of a Netflix application went higher than the designers of that application intended. The application designers intended memory usage to peak around 15GB. The chart below shows that heap usage peaked at 2.0 times 10^7 kilobytes (19.07 gigabytes) and it also shows “flumes” for periods of time where the heap usage went above its previous/standard high-water mark:

after that problem was identified and fixed the high-water mark returned to 1.6 times 10^7 kilobytes (15.23 gigabytes) and the flumes reduced and the heap usage pattern became far more regular. In addition the low-water mark dropped from 11.44 gigabytes to 9.54 gigabytes:

JVM collector compatibility

Currently, gcviz supports all of the HotSpot/JDK 7 collectors, with the exception of G1.

Additionally collected data

In addition to collecting gc.log data, gcviz collects additional system information (cpu usage, network usage, underlying virtual memory usage, etc.) to help correlate gc events with application events. In addition gcviz can be configured to capture a jmap histogram of live objects by object count, bytes required and class name.

Open-sourcing details

gcviz has been open sourced under the Apache License, version 2.0 at

As a company, Netflix considers data visualization of paramount importance. Most of Netflix’s major systems contain significant visualization components (for example, the Hystrix Dashboard and Pytheas). Most visualization at Netflix occurs on continuous data, but visualizing discrete data has its place too. Being able to quickly differentiate between a GC-indicated allocation problem and other types of error conditions has been valuable in operating the many services required to bring streaming TV and movies into living rooms all across the world.

[1] I didn’t want to lose you by quoting gc.log so early in the blog post! gc.log is a file created by the JVM option (for example) -Xloggc:/apps/tomcat/logs/gc.log. Specifying this option is recommended and common inside Netflix. Its contents are something like the following:

2013-01-01T18:30:16.651+0000: 2652735.877: [CMS-concurrent-sweep-start]
2013-01-01T18:30:21.777+0000: 2652741.003: [CMS-concurrent-sweep: 5.126/5.126 secs] [Times: user=5.13 sys=0.00, real=5.12 secs]
2013-01-01T18:30:21.777+0000: 2652741.004: [CMS-concurrent-reset-start]
2013-01-01T18:30:21.842+0000: 2652741.068: [CMS-concurrent-reset: 0.065/0.065 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]
2013-01-01T19:28:47.041+0000: 2656246.267: [GC 2656246.267: [ParNewDesired survivor size 786432000 bytes,new threshold 15 (max 15)- age   1:   26395600 bytes,   26395600 total- age   2:       1376 bytes,   26396976 total- age   3:       4184 bytes,   26401160 total- age   4:    9591072 bytes,   35992232 total- age   5:     747344 bytes,   36739576 total- age   6:   18239512 bytes,   54979088 total- age   7:    7398216 bytes,   62377304 total- age   8:    4702664 bytes,   67079968 total- age   9:       5584 bytes,   67085552 total- age  10:       3728 bytes,   67089280 total- age  11:       2416 bytes,   67091696 total- age  12:   10838496 bytes,   77930192 total- age  13:    1682368 bytes,   79612560 total- age  14:   17756736 bytes,   97369296 total- age  15:    6775352 bytes,  104144648 total: 6103985K->124729K(10752000K), 0.1872850 secs] 11541109K->5564874K(29184000K), 0.1874910 secs] [Times: user=0.72 sys=0.00, real=0.19 secs]
2013-01-01T19:28:47.229+0000: 2656246.456: [GC [1 CMS-initial-mark: 5440145K(18432000K)] 5583733K(29184000K), 0.1454260 secs] [Times: user=0.15 sys=0.00, real=0.15 secs]
2013-01-01T19:28:47.375+0000: 2656246.602: [CMS-concurrent-mark-start]
2013-01-01T19:29:02.574+0000: 2656261.800: [CMS-concurrent-mark: 15.195/15.199 secs] [Times: user=15.24 sys=0.03, real=15.19 secs]
2013-01-01T19:29:02.574+0000: 2656261.801: [CMS-concurrent-preclean-start]
2013-01-01T19:29:02.638+0000: 2656261.864: [CMS-concurrent-preclean: 0.061/0.064 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]
2013-01-01T19:29:02.638+0000: 2656261.865: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 2013-01-01T19:29:08.589+0000: 2656267.816: [CMS-concurrent-abortable-preclean: 5.946/5.951 secs] [Times: user=5.96 sys=0.01, real=5.95 secs]

Thursday, May 16, 2013

Announcing Pytheas

Today, we are excited to bring you Pytheas : web resource and rich UI framework. This piece of software is heavily used at Netflix in building quick prototypes and web applications that explore/visualize large data sets.

Pytheas integrates Guice and Jersey frameworks to wire REST web-service endpoints together with dynamic UI controls in a web application. The framework is designed to support the most common web UI  components needed to build data exploration / dashboard style applications. It not only serves as a quick prototyping tool, but also acts as a foundation for integrating multiple data sources in a single application.

UI components bundle

The UI library bundled with Pytheas is based on a number of Javascript Open Source frameworks such as BootstrapJQuery-UIDataTablesD3 etc. It also contains a number of JQuery plugins that we wrote to support specific use cases that we encountered in building Netflix internal applications/ dashboards. Some of the plugins include support for ajax data driven selection boxes with dynamic filter control, pop-over dialog box form templates, inline portlets, breadcrumbs, loading spinner etc.

Modular Design

An application based on Pytheas framework consists of one or more Pytheas modulesEach module is loosely coupled from each other. The module is responsible for supplying its own data resources and in fact can also provide its own rendering mechanism. Each data resource is a Jersey REST endpoint owned by the module. 

By default Pytheas uses FreeMarker as the rendering template engine for each resource. The framework provide a library of reusable FreeMarker macros that can be embedded in a page to allow for rendering commonly used UI components. Each Pytheas module gets access to all the common page building blocks such as page layout containers, header, footer, navbar etc. which gets embedded by the framework.

Although Pytheas provides FreeMarker as the default template engine, the framework allows for plugging in your own template engine for each module. It'll need to supply its own Jersey Provider with it.

Getting Started

Pytheas project contains a simple helloworld application that serves as a template for building new applications using the framework. Please refer to instructions on how to run helloworld application from a command line.