Monday, November 26, 2012

Introducing Hystrix for Resilience Engineering

by Ben Christensen

In a distributed environment, failure of any given service is inevitable. Hystrix is a library designed to control the interactions between these distributed services providing greater tolerance of latency and failure. Hystrix does this by isolating points of access between the services, stopping cascading failures across them, and providing fallback options, all of which improve the system's overall resiliency.

Hystrix evolved out of resilience engineering work that the Netflix API team began in 2011. Over the course of 2012, Hystrix continued to evolve and mature, eventually leading to adoption across many teams within Netflix. Today tens of billions of thread-isolated and hundreds of billions of semaphore-isolated calls are executed via Hystrix every day at Netflix and a dramatic improvement in uptime and resilience has been achieved through its use.

The following links provide more context around Hystrix and the challenges that it attempts to address:

Getting Started

Hystrix is available on GitHub at http://github.com/Netflix/Hystrix

Full documentation is available at http://github.com/Netflix/Hystrix/wiki including Getting Started, How To Use, How It Works and Operations examples of how it is used in a distributed system.

You can get and build the code as follows:
$ git clone git://github.com/Netflix/Hystrix.git
$ cd Hystrix/
$ ./gradlew build

Coming Soon

In the near future we will also be releasing the real-time dashboard for monitoring Hystrix as we do at Netflix:


We hope you find Hystrix to be a useful library. We'd appreciate any and all feedback on it and look forward to fork/pulls and other forms of contribution as we work on its roadmap.

Are you interested in working on great open source software? Netflix is hiring!

http://jobs.netflix.com

Tuesday, November 20, 2012

Announcing Blitz4j - a scalable logging framework

By Karthikeyan Ranganathan

We are proud to announce Blitz4j , a critical component of the Netflix logging infrastructure that helps Netflix achieve high volume logging without affecting scalability of the applications.



What is Blitz4j?


Blitz4j is a logging framework built on top of log4j to reduce multithreaded contention and enable highly scalable logging without affecting application performance characteristics.
At Netflix, Blitz4j is used to log billions of events for monitoring, business intelligence reporting, debugging and other purposes. Blitz4j overcomes traditional log4j bottlenecks and comes built with a highly scalable and customizable asynchronous framework. Blitz4j also comes with the ability to convert the existing log4j appenders to use the asynchronous model without changing the existing log4j configurations.
Blitz4j makes runtime reconfigurations of log4j pretty easy. Blitz4j also tries to mitigate data loss and provides a way to summarize the log information during log storms.

Why is scalable logging important?


Logging is a critical part of any application infrastructure. At Netflix, we collect data for monitoring, business intelligence reporting etc. There is also a need to turn on finer grain of logging level for debugging customer issues.
In addition, in a service-oriented architecture, you depend on other central services and if those services break unexpectedly, your applications tend to log orders of magnitude higher than normal. This is where the scalability of the logging infrastructure comes to the fore. Any scalable logging infrastructure should be able to handle these kind of log storms providing useful information about the breakages without affecting the application performance.

History of Blitz4j


At Netflix, log4j has been used as a logging framework for a few years. It had worked fine for us, until the point where there was a real need to log lots of data. When our traffic increased and when the need for per-instance logging went up, log4j's frailties started to get exposed.

Problems with Log4j

Contended Synchronization with Root Logger
Log4j follows a hierarchical model and that makes it easy to turn off/on logging based on a package or a class level (if your logger definition follows that model). In this model, root logger is at the top of the hierarchy. In most cases, all loggers have to get access the root logger to log to the appenders configured there. One of the biggest problems here is that locking is needed on the root logger to write to the appenders. For a high traffic application that logs lot of data this is a big contention point as all application threads have to synchronize on the root logger.

public
  void callAppenders(LoggingEvent event) {
    int writes = 0;
   for(Category c = this; c != null; c=c.parent) {
     // Protected against simultaneous call to addAppender, removeAppender,...
      synchronized(c) {
    if(c.aai != null) {
      writes += c.aai.appendLoopOnAppenders(event);
    }
    if(!c.additive) {
      break;
    }
      }
    }

    if(writes == 0) {
      repository.emitNoAppenderWarning(this);
    }
  }


This severely limits the application scalability. Even if the critical section is executed quickly, this is a huge bottleneck in high volume logging applications.
The reason for the lock seems to be for protection against potential change in the list of appenders which should be a rare event. For us, the thread dump has exposed this contention numerous times.
Asynchronous appender to the rescue?
The longer the time spent in logging to the appender, more threads wait on the logging to complete. Any buffering here helps the scalability of the application tremendously. Some log4j appenders (such as File Appender) comes with the ability to buffer the data and that helps this problem quite a bit. The built in log4j asynchronous appender alleviates this problem quite a bit, but it still does not remove this synchronization. For us, thread dumps revealed another point of contention when logging to the appender with the use of asynchronous appender.
It was quite clear that the built-in asynchronous appender was less scalable because of this synchronization.

public
  synchronized 
  void doAppend(LoggingEvent event) {
    if(closed) {
      LogLog.error("Attempted to append to closed appender named ["+name+"].");
      return;
    }
    
    if(!isAsSevereAsThreshold(event.getLevel())) {
      return;
    }

    Filter f = this.headFilter;
    
    FILTER_LOOP:
    while(f != null) {
      switch(f.decide(event)) {
      case Filter.DENY: return;
      case Filter.ACCEPT: break FILTER_LOOP;
      case Filter.NEUTRAL: f = f.getNext();
      }
    }
    
    this.append(event);    
  }


Deadlock Vulnerability
This double locking (root logger and appender) also makes the application vulnerable to deadlocks if your appender by any chance tries to take a lock on a resource and if that resource tries to log to the appender at the same time.
Locking on Logger Cache
In log4j, loggers are cached in a Hashtable and that needs to be locked for any retrieval of a cached logger. When you want to change the log4j settings dynamically, there are 2 steps in the process.
  1. Reset and empty out all current log4j configurations
  2. Load all configurations including new configurations
During the reset process, locks have to be held on both the cache and the individual loggers. If any of the appenders tried to look up the logger from the cache at the same time,we have the classic case of locks trying to be held in opposite directions and the chance of a deadlock.

public
  void shutdown() {
    Logger root = getRootLogger();

    // begin by closing nested appenders
    root.closeNestedAppenders();

    synchronized(ht) {
      Enumeration cats = this.getCurrentLoggers();
      while(cats.hasMoreElements()) {
    Logger c = (Logger) cats.nextElement();
    c.closeNestedAppenders();
      }

Why Blitz4j?


Central to all the contention and the deadlock vulnerabilities is the locking model in log4j. If the log4j used any of the concurrent data structures with JDK 1.5 and above, most of the problems would be solved. That is exactly what blitz4j does.
Blitz4j overrides key parts of the log4j architecture to remove the locks and replace them with concurrent data structures.Blitz4j puts the emphasis more on application performance and stability rather than accuracy in logging. This means Blitz4j leans more towards the asynchronous model of logging and tries to make the logging useful by retaining the time-order of logging messages.
While the log4j's built-in asynchronous appender is similar in functionality to the one offered by blitz4j, Blitz4j comes with the following differences
  1. Remove all critical synchronizations with concurrent data structures.
  2. Extreme configurability in terms of in-memory buffer and worker threads
  3. More isolation of application threads from logging threads by replacing the wait-notify model with an executor pool model.
  4. Better handling of log messages during log storms with configurable summary.
Apart from the above, Blitz4j also provides the following.
  1. Ability to dynamically configure log4j levels for debugging production problems without affecting the application performance.
  2. Automatic conversion of any log4j appender to the asynchronous model statically or at runtime.
  3. Realtime metrics regarding performance using Servo and dynamic configurability using Archaius.
If your application is equipped with enough memory, it is possible to achieve both application and logging performance without any logging data loss. The power of this model comes to the fore during log storms when the critical dependencies break unexpectedly causing orders of magnitude increase in logging.

Why not use LogBack?

For a new project, LogBack may be an apt choice. For existing projects, there seems to be considerable amount of work to achieve the promised scalability. Besides, Blitz4j has stood the test of time and scrutiny at Netflix and given the familiarity and ubiquity of log4j, it is our architectural choice here at Netflix.

Blitz4j Performance

The  graph below from a couple of our streaming servers which logs about 300-500 lines a second, gives an indication of performance of blitz4j (with asynchronous appender) as compared to log4j (without asynchronous appender).
In a steady state, the latter is atleast 3 times more expensive than blitz4j. There are numerous spikes that happen with the log4j implementation that are due to the synchronizations talked about earlier.


Other things we observed:
When log4j (without synchronizations removed) is used with asynchronous appender, it scales much better (graph not included here), but it just takes that much higher amount of logging for the spikes to show up.

We have also observed with blit4j even with very high amount of logging turned on, the application response times remained unaffected when compared to log4j (with synchronization) where the response times degraded rapidly when the amount of logging increased.

References

 Blitz4j Source
Blitz4j Wiki


If building critical infrastructure components like this, for a service that millions of people use world wide, excites you, take a look at http://jobs.netflix.com.


Tuesday, November 6, 2012

Edda - Learn the Stories of Your Cloud Deployments

By Cory Bennett, Engineering Tools

Operating "in the cloud" has its challenges, and one of those challenges is that nothing is static. Virtual host instances are constantly coming and going, IP addresses can get reused by different applications, and firewalls suddenly appear as security configurations are updated. At Netflix we needed something to help us keep track of our ever-shifting environment within Amazon Web Services (AWS). Our solution is Edda.

Today we are proud to announce that the source code for Edda is open and available.

What is Edda?


Edda is a service that polls your AWS resources via AWS APIs and records the results. It allows you to quickly search through your resources and shows you how they have changed over time.

Previously this project was known within Netflix as Entrypoints (and mentioned in some blog posts), but the name was changed as the scope of the project grew. Edda (meaning "a tale of Norse mythology"), seemed appropriate for the new name, as our application records the tales of Asgard.

Why did we create Edda?


Dynamic Querying


At Netflix we need to be able to quickly query and analyze our AWS resources with widely varying search criteria. For instance, if we see a host with an EC2 hostname that is causing problems on one of our API servers then we need to find out what that host is and what team is responsible, Edda allows us to do this. The APIs AWS provides are fast and efficient but limited in their querying ability. There is no way to find an instance by the hostname, or find all instances in a specific Availability Zone without first fetching all the instances and iterating through them.

With Edda's REST APIs we can use matrix arguments to find the resources that we are looking for. Furthermore, we can trim out unnecessary data in the responses with Field Selectors.

Example: Find any instances that have ever had a specific public IP address:
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b"]
Now find out what AutoScalingGroup the instances were tagged with:
$ export INST_API=http://edda/api/v2/view/instances
$ curl "$INST_API;publicIpAddress=1.2.3.4;_pp;_since=0;_expand:(instanceId,tags)"
[
  {
    "instanceId" : "i-0123456789",
    "tags" : [
      {
        "key" : "aws:autoscaling:groupName",
        "value" : "app1-v123"
      }
    ]
  },
  {
    "instanceId" : "i-012345678a",
    "tags" : [
      {
        "key" : "aws:autoscaling:groupName",
        "value" : "app2-v123"
      }
    ]
  },
  {
    "instanceId" : "i-012345678b",
    "tags" : [
      {
        "key" : "aws:autoscaling:groupName",
        "value" : "app3-v123"
      }
    ]
  }
]

History/Changes


When trying to analyze causes and impacts of outages we have found the historical data stored in Edda to be very valuable. Currently AWS does not provide APIs that allow you to see the history of your resources, but Edda records each AWS resource as versioned documents that can be recalled via the REST APIs. The "current state" is stored in memory, which allows for quick access. Previous resource states and expired resources are stored in MongoDB (by default), which allows for efficient retrieval. Not only can you see how resources looked in the past, but you can also get unified diff output quickly and see all the changes a resource has gone through.

For example, this shows the most recent change to a security group:
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
 {
   "class" : "com.amazonaws.services.ec2.model.SecurityGroup",
   "description" : "App1",
   "groupId" : "sg-0123456789",
   "groupName" : "app1-frontend",
   "ipPermissions" : [
     {
       "class" : "com.amazonaws.services.ec2.model.IpPermission",
       "fromPort" : 80,
       "ipProtocol" : "tcp",
       "ipRanges" : [
         "10.10.1.1/32",
         "10.10.1.2/32",
+        "10.10.1.3/32",
-        "10.10.1.4/32"
       ],
       "toPort" : 80,
       "userIdGroupPairs" : [ ]
     }
   ],
   "ipPermissionsEgress" : [ ],
   "ownerId" : "2345678912345",
   "tags" : [ ],
   "vpcId" : null
 }

High Level Architecture


Edda is a Scala application that can both run on a single instance or scale up to many instances running behind a load balancer for high availability. The data store that Edda currently supports is MongoDB, which is also versatile enough to run on either a single instance along with the Edda service, or be grown to include large replication sets. When running as a cluster, Edda will automatically select a leader which then does all the AWS polling (by default every 60 seconds) and persists the data. The other secondary servers will be refreshing their in-memory records (by default every 30 seconds) and handling REST requests.

Currently only MongoDB is supported for the persistence layer, but we are analyzing alternatives. MongoDB supports JSON documents and allows for advanced query options, both of which are necessary for Edda. However, as our previous blogs have indicated, Netflix is heavily invested in Cassandra. We are therefore looking at some options for advance query services that can work in conjunction with Cassandra.

Edda was designed to allow for easily implementing custom crawlers to track collections of resources other than those of AWS. In the near future we will be releasing some examples we have implemented which track data from AppDynamics, and others which track our Asgard applications and clusters.

Configuration


There are many configuration options for Edda. It can be configured to poll a single AWS region (as we run it here) or to poll across multiple regions. If you have multiple AWS accounts (ie. test and prod), Edda can be configured to poll both from the same instance. Edda currently polls 15 different resource types within AWS. Each collection can be individually enabled or disabled. Additionally, crawl frequency and cache refresh rates can all be tweaked.

Coming up


In the near future we are planning to release some new collections for Edda to monitor. The first will be APIs that allow us to pull details about application health and traffic patterns out of AppDynamics. We also plan to release APIs that track our Asgard application and cluster resources.

Summary


We hope you find Edda to be a useful tool. We'd appreciate any and all feedback on it. Are you interested in working on great open source software? Netflix is hiring! http://jobs.netflix.com

Edda Links