Wednesday, May 29, 2013

Conformity Monkey - Keeping your cloud instances following best practices


By Michael Fu and Cory Bennett, Engineering Tools



Cloud computing makes it much easier to launch new applications or start new instances. At Netflix, engineers can easily launch a new application in Asgard with a few clicks. With this freedom there are sometimes consequences where launched applications or instances may not follow some best practices. This can happen when an engineer isn't familiar with best practices or when those practices have not been well publicized. For example, some required security groups may be missing from instances and can cause security gaps. Or perhaps a health check url is not defined for instances in Eureka which would result in automatic failure detection and failover being disabled.

Introducing the Conformity Monkey
At Netflix, we use Conformity Monkey, another member of Simian Army, to check all instances in our cloud for their conformity. Today, we are proud to announce that the source code for Conformity Monkey, is now open and available to the public.

What is Conformity Monkey?

Conformity Monkey is a service which runs in the Amazon Web Services (AWS) cloud looking for instances that are not conforming to predefined rules for the best practices. Similar to Chaos Monkey and Janitor Monkey, the design of Conformity Monkey is flexible enough to allow extending it to work with other cloud providers and conformity rules. By default, conformity check is performed every hour. The schedule can be easily re-configured to fit your business' need.

Conformity Monkey determines whether an instance is nonconforming by applying a set of rules on it. If any of the rules determines that the instance is not conforming, the monkey sends an email notification to the owner of the instance. We provide a collection of conformity rules in the open sourced version that are currently used at Netflix and believed general enough to be used by most users. The design of Conformity Monkey also makes it simple to customize rules or to add new ones.

There can be exceptions when you want to ignore warnings of a specific conformity rule for some applications. For example, a security group to open a specific port probably is not needed by instances of some applications. We allow you to customize the set of conformity rules to be applied to a cluster of instances by excluding unneeded ones.

How Conformity Monkey Works

Conformity Monkey works in two stages: mark and notify. First, Conformity Monkey loops through all autoscaling groups in your cloud and applies the specified set of conformity rules to the instances in each group. If any conformity rule determines an instance as not conforming, the autoscaling group is marked as nonconforming and the instances that break the rule are recorded. Every autoscaling group is associated with an owner email, which can be obtained from an internal system, or can be set in a configuration file. The simplest way is using a default email address, e.g. your team's email list for all the autoscaling groups. Conformity Monkey sends email notification about the nonconforming groups to the owner, with the details of the broken conformity rule and the instances that failed the conformity check. The application owners can then take necessary actions to fix the failed instances or to exclude the conformity rule if they believe the conformity check is not necessary for the application. We allow you to set different frequency for conformity check and notification. For example, at Netflix, conformity check is performed every hour, and notification is only sent once per day at noon time. This reduces the number of emails people receive about the same conformity warning. The real-time result of conformity check for every autoscaling group is shown in a separate UI.

Configuration and Customization

The conformity rules for each autoscaling group, and the parameters used to configure each individual rule, are all configurable. You can easily customize Conformity Monkey with the most appropriate set of rules for your autoscaling groups by setting Conformity Monkey properties in a configuration file. You can also create your own rules, and we encourage you to contribute your conformity rules to the project so that all can benefit.

Storage and Costs

Conformity Monkey stores its data in an Amazon SimpleDB table by default. You can easily check the SimpleDB records to find out the last conformity check results for your autoscaling groups. At Netflix we have a UI for showing the conformity check results and we have plans to open source it in the future as well.

There could be associated costs with Amazon SimpleDB, but in most cases the activity of Conformity Monkey should be small enough to fall within Amazon's Free Usage Tier. Ultimately the costs associated with running Conformity Monkey are your responsibility. For your reference, the costs of Amazon SimpleDB can be found at http://aws.amazon.com/simpledb/pricing/


Summary

Conformity Monkey helps keep our cloud instances following best practices. We hope you find Conformity Monkey to be useful for your business as well. We'd appreciate any feedback on it. We're always looking for new members to join the team. If you are interested in working on great open source software, take a look at jobs.netflix.com for current openings!

Conformity Monkey Links



Wednesday, May 22, 2013

Garbage Collection Visualization

What is garbage collection visualization?

By Brian Moore

In short, “garbage collection visualization” (hereafter shortened to “gcviz”) is the process of turning gc.log[1] into x/y time-series scatter plots (pictures). That is, turning garbage collector (GC) logging output into two types of charts, GC event charts:



and heap size charts:
A GC event occurs when one of the JVM-configured collectors operates on the heap, and a heap size chart shows the size of the heap after a GC event. Both of these types of information are present in gc.log[1].

Setting the context for gcviz

Earlier, Shobana wrote about the metadata that Video Metadata Services (VMS) maintains in-memory for clients to access along extremely low-latency (microseconds), high-volume (10^11 daily requests), large-data (gigabytes) code paths. Being able to serve-up data in this setting requires a deep understanding of Java garbage collection.


Netflix is (mostly) a Java shop. Netflix deploys Java applications into Apache Tomcat, an application server (itself written in Java). Tomcat, the Netflix application, and all of the libraries that are dependencies of the Netflix application allocate from the same heap. Netflix Java applications are typically long-running (weeks/months) and have large heaps (typically tens of gigabytes).

At this scale, the overhead required to manage the heap is significant, and garbage collection pauses that last more than 500 milliseconds will almost certainly interfere with network activity.

Any GC event that “stops the world” (STW) will pause Tomcat, the application and the libraries the application needs to run. New inbound connections cannot be established (the Tomcat accept thread is blocked) and I/O on outbound connections will stall until the GC completes (each java thread in a STW event is at a safepoint and unable to accomplish meaningful work). It is therefore important to ensure that any required GC pauses are kept as small as possible to ensure that an application remains available. GC pauses are often seen as the cause of an issue when in fact they are a side-effect of the allocation behavior of the combination of Tomcat, the application and the libraries the application uses.

Before gcviz was written, Netflix determined the influence (or absence) of GC events in outages via several methods:
  • hand-crafting excel spreadsheets
  • using PrintGCStats/PrintGCFixup/GCHisto
  • visually skimming gc.log, looking for “long” stop-the-world events
While these methods were effective, they were time-consuming and difficult for many Netflix developers. Inside Netflix, gcviz has been incorporated into Netflix’s base AMI and integrated into Asgard, making visual GC event analysis of any application trivial (a click of a button) for all Netflix developers.

Why is gcviz important?
gcviz is important for several reasons:
  • Convenient: The developer loads the event and heap chart pages directly into their browser by clicking on a link in Asgard. gcviz does all the processing behind the scenes. This convenience allows us to quickly reinforce or reject the oft-repeated claim: bad application behavior is "because of long GC pauses".
  • Visual: Images can convey time-series information more quickly and densely than text.
  • Causation: gcviz can help application developers establish or refute a correlation between time-of-day-based observed application degradation and JVM-wide GC behavior.
  • Clarity: The semantics of each of the different GC event types (concurrent-mark, ParNew) can be made implicit in the image instead of being given equal weight in the text of gc.log. This is useful because, for example, a long running concurrent-sweep should be no cause for alarm because it runs in parallel with the application program.
  • Iterative: gcviz allows for quick interpretation of experimental results. What effect does modifying a particular setting or algorithm have on the heap usage or GC subsystem? gcviz allows one to quickly and visually understand any impact a GC-related change has made. gcviz is now being effectively used in canary deployments within Netflix where GC behavior is compared to an existing baseline.

Prior (and contemporary) Art
Netflix is not the first organization that has seen the benefits of visualizing garbage collection data. Other projects involved in the visualization or interpretation of garbage collection data include the following:

Why develop another solution?
While the tools mentioned above work well under their own conditions, Netflix had other requirements and constraints to consider and realized that none of the existing tools met all of Netflix’s needs. The following requirements were considered and followed during the design and implementation of gcviz. gcviz needs to:
  • Run outside the context of the application under analysis, but on each instance on which the application runs.
  • Use gc.log as the source-of-truth
    • historical analysis (no need to attach to a running vm)
    • filesystem/logs can be read even when HTTP port is busy and/or application threads are blocked
  • Read gc logs with the JVM options that Netflix commonly (but not universally) uses:
    • -verbose:gc
    • -verbose:sizes
    • -Xloggc:/apps/tomcat/logs/gc.log
    • -XX:+PrintGCDetails
    • -XX:+PrintGCDateStamps
    • -XX:+PrintTenuringDistribution
    • -XX:-UseGCOverheadLimit
    • -XX:+ExplicitGCInvokesConcurrent
    • -XX:+UseConcMarkSweepGC
    • -XX:+CMSScavengeBeforeRemark
    • -XX:+CMSParallelRemarkEnabled
  • Gracefully handle any arguments passed to the JVM.
  • Remain independent of any special terminal/display requirements (must be able to run inside and outside Netflix’s base AMI without modification)
  • Be able to run standalone to leverage special display capabilities if they are present
  • Be accessible from Asgard
  • Be dead-simple to use, requiring no specialized gc knowledge
  • Correlate netflix-internal events with gc activity. One example of a netflix-internal event is a cache refresh.
  • Retain its reports over time to enable long-term comparative analysis of a given application.

A Small Example: GC Event Chart

To understand how the GC event charts are constructed, it may be helpful to consider an example. In the picture below, gcviz visualizes a gc.log that contains three gc events:


  • a DefNew GC event at 60 seconds after JVM boot,
  • a DefNew GC event at 120 seconds after JVM boot, and
  • a DefNew GC event at 180 seconds after JVM boot


These three events took 100 milliseconds, 200 milliseconds, and 100 milliseconds, respectively. These three events would produce a chart with three DefNew “dots” on it. The “x” value of the dot would be the seconds since JVM boot, converted to absolute time, and the “y” value of the dot would be the duration in seconds. The red color of the dot indicates that DefNew is a “stop-the-world” GC event. Hopefully this is more clear in pictorial form:



Real-world GC Event Charts

In mid-2012, Netflix used gcviz to analyze some performance problems a class of applications was having. These performance problems had 80-120 second garbage collection pauses as one of its symptoms. The gcviz event chart looked like this:


After the problem was identified and fixed the event chart looked like this:


The pauses had been reduced from 120 seconds to under 5 seconds. In addition, any Netflix engineer could quickly see that the symptom had been eliminated without needing to dig through gc.log.

Real-world Heap Size Charts

In another mid-2012 event, the heap usage of a Netflix application went higher than the designers of that application intended. The application designers intended memory usage to peak around 15GB. The chart below shows that heap usage peaked at 2.0 times 10^7 kilobytes (19.07 gigabytes) and it also shows “flumes” for periods of time where the heap usage went above its previous/standard high-water mark:


after that problem was identified and fixed the high-water mark returned to 1.6 times 10^7 kilobytes (15.23 gigabytes) and the flumes reduced and the heap usage pattern became far more regular. In addition the low-water mark dropped from 11.44 gigabytes to 9.54 gigabytes:

JVM collector compatibility

Currently, gcviz supports all of the HotSpot/JDK 7 collectors, with the exception of G1.

Additionally collected data

In addition to collecting gc.log data, gcviz collects additional system information (cpu usage, network usage, underlying virtual memory usage, etc.) to help correlate gc events with application events. In addition gcviz can be configured to capture a jmap histogram of live objects by object count, bytes required and class name.

Open-sourcing details

gcviz has been open sourced under the Apache License, version 2.0 at https://github.com/Netflix/gcviz

Conclusion
As a company, Netflix considers data visualization of paramount importance. Most of Netflix’s major systems contain significant visualization components (for example, the Hystrix Dashboard and Pytheas). Most visualization at Netflix occurs on continuous data, but visualizing discrete data has its place too. Being able to quickly differentiate between a GC-indicated allocation problem and other types of error conditions has been valuable in operating the many services required to bring streaming TV and movies into living rooms all across the world.

Notes
[1] I didn’t want to lose you by quoting gc.log so early in the blog post! gc.log is a file created by the JVM option (for example) -Xloggc:/apps/tomcat/logs/gc.log. Specifying this option is recommended and common inside Netflix. Its contents are something like the following:


2013-01-01T18:30:16.651+0000: 2652735.877: [CMS-concurrent-sweep-start]
2013-01-01T18:30:21.777+0000: 2652741.003: [CMS-concurrent-sweep: 5.126/5.126 secs] [Times: user=5.13 sys=0.00, real=5.12 secs]
2013-01-01T18:30:21.777+0000: 2652741.004: [CMS-concurrent-reset-start]
2013-01-01T18:30:21.842+0000: 2652741.068: [CMS-concurrent-reset: 0.065/0.065 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]
2013-01-01T19:28:47.041+0000: 2656246.267: [GC 2656246.267: [ParNewDesired survivor size 786432000 bytes,new threshold 15 (max 15)- age   1:   26395600 bytes,   26395600 total- age   2:       1376 bytes,   26396976 total- age   3:       4184 bytes,   26401160 total- age   4:    9591072 bytes,   35992232 total- age   5:     747344 bytes,   36739576 total- age   6:   18239512 bytes,   54979088 total- age   7:    7398216 bytes,   62377304 total- age   8:    4702664 bytes,   67079968 total- age   9:       5584 bytes,   67085552 total- age  10:       3728 bytes,   67089280 total- age  11:       2416 bytes,   67091696 total- age  12:   10838496 bytes,   77930192 total- age  13:    1682368 bytes,   79612560 total- age  14:   17756736 bytes,   97369296 total- age  15:    6775352 bytes,  104144648 total: 6103985K->124729K(10752000K), 0.1872850 secs] 11541109K->5564874K(29184000K), 0.1874910 secs] [Times: user=0.72 sys=0.00, real=0.19 secs]
2013-01-01T19:28:47.229+0000: 2656246.456: [GC [1 CMS-initial-mark: 5440145K(18432000K)] 5583733K(29184000K), 0.1454260 secs] [Times: user=0.15 sys=0.00, real=0.15 secs]
2013-01-01T19:28:47.375+0000: 2656246.602: [CMS-concurrent-mark-start]
2013-01-01T19:29:02.574+0000: 2656261.800: [CMS-concurrent-mark: 15.195/15.199 secs] [Times: user=15.24 sys=0.03, real=15.19 secs]
2013-01-01T19:29:02.574+0000: 2656261.801: [CMS-concurrent-preclean-start]
2013-01-01T19:29:02.638+0000: 2656261.864: [CMS-concurrent-preclean: 0.061/0.064 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]
2013-01-01T19:29:02.638+0000: 2656261.865: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 2013-01-01T19:29:08.589+0000: 2656267.816: [CMS-concurrent-abortable-preclean: 5.946/5.951 secs] [Times: user=5.96 sys=0.01, real=5.95 secs]

Thursday, May 16, 2013

Announcing Pytheas

Today, we are excited to bring you Pytheas : web resource and rich UI framework. This piece of software is heavily used at Netflix in building quick prototypes and web applications that explore/visualize large data sets.

Pytheas integrates Guice and Jersey frameworks to wire REST web-service endpoints together with dynamic UI controls in a web application. The framework is designed to support the most common web UI  components needed to build data exploration / dashboard style applications. It not only serves as a quick prototyping tool, but also acts as a foundation for integrating multiple data sources in a single application.




UI components bundle


The UI library bundled with Pytheas is based on a number of Javascript Open Source frameworks such as BootstrapJQuery-UIDataTablesD3 etc. It also contains a number of JQuery plugins that we wrote to support specific use cases that we encountered in building Netflix internal applications/ dashboards. Some of the plugins include support for ajax data driven selection boxes with dynamic filter control, pop-over dialog box form templates, inline portlets, breadcrumbs, loading spinner etc.





Modular Design


An application based on Pytheas framework consists of one or more Pytheas modulesEach module is loosely coupled from each other. The module is responsible for supplying its own data resources and in fact can also provide its own rendering mechanism. Each data resource is a Jersey REST endpoint owned by the module. 

By default Pytheas uses FreeMarker as the rendering template engine for each resource. The framework provide a library of reusable FreeMarker macros that can be embedded in a page to allow for rendering commonly used UI components. Each Pytheas module gets access to all the common page building blocks such as page layout containers, header, footer, navbar etc. which gets embedded by the framework.

Although Pytheas provides FreeMarker as the default template engine, the framework allows for plugging in your own template engine for each module. It'll need to supply its own Jersey Provider with it.







Getting Started


Pytheas project contains a simple helloworld application that serves as a template for building new applications using the framework. Please refer to instructions on how to run helloworld application from a command line.



Wednesday, May 8, 2013

Object Cache for Scaling Video Metadata Management

By Shobana Radhakrishnan

As the largest Internet TV network, one of the most interesting challenges we face at Netflix is scaling services to the ever-increasing demands of over 36 million customers from over 40 countries.


Each movie or TV show on Netflix is described by a complex set of metadata. This includes the obvious information such as title, genre, synopsis, cast, maturity rating etc. It also includes links to images, trailers, encoded video files, subtitles and the individual episodes and seasons. Finally there are many tags that are used to create custom genres, such as “upbeat”, “cerebral”, “strong female lead”. These all have to be translated into many languages, so the actual text is tokenized and encoded.


This metadata must be made available for several different services, which each require a different facet of the data. Front-end services for display purposes need links to images, while algorithms that do discovery and recommendations use the tags extensively and search thousands of movies looking for the best few to show to a user. Powering this while utilizing resources extremely efficiently is one of the key goals of our Video Metadata Services (VMS) Platform.

Some examples of functionality enabled by VMS are metadata correlation for recommending titles, surfacing metadata such as actors and synopsis to help users make viewing choices (example below), and enabling streaming decisions based on device and bit rates.


As we set out to build the platform, there were a few key requirements to address:

  • Handle over a 100 billion requests a day with extremely low latency for user-facing apps
  • Handle very large dataset size of 20-30GB across all countries and devices
  • Work with high data complexity metadata processing (described in detail here)
  • Quick start-up times to make auto-scaling work efficiently

We took advantage of the fact that real-time access to the very latest data is not necessary. For new content flowing to the site, it has a contract-start-time as part of the metadata, so the metadata is updated well in advance of the new content being ready to show, and the personalization algorithms ignore the title until the start time is reached. The main reason to push an update is to fix metadata errors.

We used the following approach in our initial cloud deployment:
  • Implement a server that interacts with existing relational database systems, generates data snapshots periodically and uploads them to an S3 bucket
  • One server was configured to process data per country and generate data snapshots after evaluating data correlation and applying business and filter logic
  • A client-side Object Cache loads the relevant snapshots based on client application configuration and makes the data available to the application as in memory Java objects

The cache is then refreshed periodically based on snapshots generated by VMS servers, with a variable frequency fine-tuned based on application. There is server-side processing for compression and various optimizations to enable very compact data storage in memory, followed by deserialization and further optimization on the client-side before constructing and refreshing the cache. The diagram below shows the architecture with this implementation.


This worked very well for us and was running in production for some time, but we faced a new challenge as Netflix expanded internationally. In this architecture, we needed to configure server instances and process metadata for each country, whether it was country-specific or global. For example metadata such as trailer data, subtitles, dubbing, language translations and localization as well as contract information varies based on the country but metadata such as genre, cast, director do not. We started out serving the US catalog, added Canada as a second server, but when we added Latin America, we had to be able to serve different content in every jurisdiction, which added 42 more variants that needed their own server. The UK, Ireland and the four Nordic countries added six more.

In addition to the operational overhead of 50 servers, this also resulted in an increase in the client object cache footprint as there was duplication of data across the setup for countries. Also the start-up time on the client went up as it was working hard to de-duplicate data that was global across countries to manage the footprint. This processing was also involved at each refresh impacting user-facing application behavior. This prompted us to look for a solution that would help us scale on these fronts while supporting the increasing business needs.

In a previous post we shared results from a case study for memory optimization with the NetflixGraph Library used extensively by our recommendations engine.

Following these encouraging results and also based on identifying areas of duplicate data or processing, we made a few changes in the architecture:

  • Streamlined our VMS Server to be structured around islands (a collection of countries that have similar characteristics) instead of per country
  • Moved metadata processing and de-deduplication to the server-side and applied memory optimization techniques based on the NetflixGraph to the blobs generated
  • Enabled operationally easier configuration based on what metadata an application was interested in rather than all the data

The architecture after these changes is shown below with the key changes highlighted.


This helped achieve a huge reduction in our cache memory footprint as well as significantly better warm-up and refresh times, in addition to simplifying operational management by requiring fewer servers.

VMS leverages and integrates with several Open Source solutions available from Netflix such as Karyon, Netflix Graph, Governator, Curator and Archaius and is a major user of our Open Source ecosystem. The entire Metadata Platform Infrastructure is also tested using the Chaos Monkey and other members of the Simian Army to ensure it is resilient.

Conclusions
An object cache with periodic refreshes is a good solution when there is a low latency requirement with relatively high tolerance for staleness for large amounts of data. Furthermore, moving heavy data processing to the server-side improves client-side cache performance characteristics, especially when the request-response pattern between the client and the server does not involve real-time queries.

Questions we continue to work on include:
  • How best to add a lot of extra metadata just for Netflix original series like House of Cards?
  • How do we ensure cache efficiency with ever-changing business landscape and needs?
  • How can we ensure that updates can propagate through system components quickly and become available to user-facing applications and services right away?
  • How quickly can data issues be identified/resolved for a seamless user experience?

Given the very distributed nature of the processing and systems, tooling is as important as building solid data propagation mechanisms, data correctness and memory optimization.

We have more work planned around the above, including moving to an event-based architecture for faster metadata propagation and further structural streamlining of the blob images so we can remain resilient as our data size increases over time. Stay tuned for more on these in upcoming blog posts.

If you are interested in solving such challenges, check out jobs.netflix.com or reach out to me on LinkedIn, we are always looking for top-notch developers and QA engineers to join our team.