Tuesday, November 20, 2012

Announcing Blitz4j - a scalable logging framework

By Karthikeyan Ranganathan

We are proud to announce Blitz4j , a critical component of the Netflix logging infrastructure that helps Netflix achieve high volume logging without affecting scalability of the applications.



What is Blitz4j?


Blitz4j is a logging framework built on top of log4j to reduce multithreaded contention and enable highly scalable logging without affecting application performance characteristics.
At Netflix, Blitz4j is used to log billions of events for monitoring, business intelligence reporting, debugging and other purposes. Blitz4j overcomes traditional log4j bottlenecks and comes built with a highly scalable and customizable asynchronous framework. Blitz4j also comes with the ability to convert the existing log4j appenders to use the asynchronous model without changing the existing log4j configurations.
Blitz4j makes runtime reconfigurations of log4j pretty easy. Blitz4j also tries to mitigate data loss and provides a way to summarize the log information during log storms.

Why is scalable logging important?


Logging is a critical part of any application infrastructure. At Netflix, we collect data for monitoring, business intelligence reporting etc. There is also a need to turn on finer grain of logging level for debugging customer issues.
In addition, in a service-oriented architecture, you depend on other central services and if those services break unexpectedly, your applications tend to log orders of magnitude higher than normal. This is where the scalability of the logging infrastructure comes to the fore. Any scalable logging infrastructure should be able to handle these kind of log storms providing useful information about the breakages without affecting the application performance.

History of Blitz4j


At Netflix, log4j has been used as a logging framework for a few years. It had worked fine for us, until the point where there was a real need to log lots of data. When our traffic increased and when the need for per-instance logging went up, log4j's frailties started to get exposed.

Problems with Log4j

Contended Synchronization with Root Logger
Log4j follows a hierarchical model and that makes it easy to turn off/on logging based on a package or a class level (if your logger definition follows that model). In this model, root logger is at the top of the hierarchy. In most cases, all loggers have to get access the root logger to log to the appenders configured there. One of the biggest problems here is that locking is needed on the root logger to write to the appenders. For a high traffic application that logs lot of data this is a big contention point as all application threads have to synchronize on the root logger.

public
  void callAppenders(LoggingEvent event) {
    int writes = 0;
   for(Category c = this; c != null; c=c.parent) {
     // Protected against simultaneous call to addAppender, removeAppender,...
      synchronized(c) {
    if(c.aai != null) {
      writes += c.aai.appendLoopOnAppenders(event);
    }
    if(!c.additive) {
      break;
    }
      }
    }

    if(writes == 0) {
      repository.emitNoAppenderWarning(this);
    }
  }


This severely limits the application scalability. Even if the critical section is executed quickly, this is a huge bottleneck in high volume logging applications.
The reason for the lock seems to be for protection against potential change in the list of appenders which should be a rare event. For us, the thread dump has exposed this contention numerous times.
Asynchronous appender to the rescue?
The longer the time spent in logging to the appender, more threads wait on the logging to complete. Any buffering here helps the scalability of the application tremendously. Some log4j appenders (such as File Appender) comes with the ability to buffer the data and that helps this problem quite a bit. The built in log4j asynchronous appender alleviates this problem quite a bit, but it still does not remove this synchronization. For us, thread dumps revealed another point of contention when logging to the appender with the use of asynchronous appender.
It was quite clear that the built-in asynchronous appender was less scalable because of this synchronization.

public
  synchronized 
  void doAppend(LoggingEvent event) {
    if(closed) {
      LogLog.error("Attempted to append to closed appender named ["+name+"].");
      return;
    }
    
    if(!isAsSevereAsThreshold(event.getLevel())) {
      return;
    }

    Filter f = this.headFilter;
    
    FILTER_LOOP:
    while(f != null) {
      switch(f.decide(event)) {
      case Filter.DENY: return;
      case Filter.ACCEPT: break FILTER_LOOP;
      case Filter.NEUTRAL: f = f.getNext();
      }
    }
    
    this.append(event);    
  }


Deadlock Vulnerability
This double locking (root logger and appender) also makes the application vulnerable to deadlocks if your appender by any chance tries to take a lock on a resource and if that resource tries to log to the appender at the same time.
Locking on Logger Cache
In log4j, loggers are cached in a Hashtable and that needs to be locked for any retrieval of a cached logger. When you want to change the log4j settings dynamically, there are 2 steps in the process.
  1. Reset and empty out all current log4j configurations
  2. Load all configurations including new configurations
During the reset process, locks have to be held on both the cache and the individual loggers. If any of the appenders tried to look up the logger from the cache at the same time,we have the classic case of locks trying to be held in opposite directions and the chance of a deadlock.

public
  void shutdown() {
    Logger root = getRootLogger();

    // begin by closing nested appenders
    root.closeNestedAppenders();

    synchronized(ht) {
      Enumeration cats = this.getCurrentLoggers();
      while(cats.hasMoreElements()) {
    Logger c = (Logger) cats.nextElement();
    c.closeNestedAppenders();
      }

Why Blitz4j?


Central to all the contention and the deadlock vulnerabilities is the locking model in log4j. If the log4j used any of the concurrent data structures with JDK 1.5 and above, most of the problems would be solved. That is exactly what blitz4j does.
Blitz4j overrides key parts of the log4j architecture to remove the locks and replace them with concurrent data structures.Blitz4j puts the emphasis more on application performance and stability rather than accuracy in logging. This means Blitz4j leans more towards the asynchronous model of logging and tries to make the logging useful by retaining the time-order of logging messages.
While the log4j's built-in asynchronous appender is similar in functionality to the one offered by blitz4j, Blitz4j comes with the following differences
  1. Remove all critical synchronizations with concurrent data structures.
  2. Extreme configurability in terms of in-memory buffer and worker threads
  3. More isolation of application threads from logging threads by replacing the wait-notify model with an executor pool model.
  4. Better handling of log messages during log storms with configurable summary.
Apart from the above, Blitz4j also provides the following.
  1. Ability to dynamically configure log4j levels for debugging production problems without affecting the application performance.
  2. Automatic conversion of any log4j appender to the asynchronous model statically or at runtime.
  3. Realtime metrics regarding performance using Servo and dynamic configurability using Archaius.
If your application is equipped with enough memory, it is possible to achieve both application and logging performance without any logging data loss. The power of this model comes to the fore during log storms when the critical dependencies break unexpectedly causing orders of magnitude increase in logging.

Why not use LogBack?

For a new project, LogBack may be an apt choice. For existing projects, there seems to be considerable amount of work to achieve the promised scalability. Besides, Blitz4j has stood the test of time and scrutiny at Netflix and given the familiarity and ubiquity of log4j, it is our architectural choice here at Netflix.

Blitz4j Performance

The  graph below from a couple of our streaming servers which logs about 300-500 lines a second, gives an indication of performance of blitz4j (with asynchronous appender) as compared to log4j (without asynchronous appender).
In a steady state, the latter is atleast 3 times more expensive than blitz4j. There are numerous spikes that happen with the log4j implementation that are due to the synchronizations talked about earlier.


Other things we observed:
When log4j (without synchronizations removed) is used with asynchronous appender, it scales much better (graph not included here), but it just takes that much higher amount of logging for the spikes to show up.

We have also observed with blit4j even with very high amount of logging turned on, the application response times remained unaffected when compared to log4j (with synchronization) where the response times degraded rapidly when the amount of logging increased.

References

 Blitz4j Source
Blitz4j Wiki


If building critical infrastructure components like this, for a service that millions of people use world wide, excites you, take a look at http://jobs.netflix.com.