Monday, June 18, 2012

Announcing Archaius: Dynamic Properties in the Cloud

By Allen Wang and Sudhir Tonse


Netflix has a culture of being dynamic when it comes to decision making. This trait comes across both in the business domain as well as in technology and operations.
It follows that we like the ability to effect changes in the behavior of our deployed services dynamically at run-time. Availability is of the utmost importance to us, so we would like to accomplish this without having to bounce servers.
Furthermore, we want the ability to dynamically change properties (and hence the logic and behavior of our services) based on a request or deployment context. For example, we want to configure properties for an application instance or request, based on factors like the Amazon Region the service is deployed in, the country of origin (of the request), the device the movie is playing on etc.

What is Archaius?

 








(Image obtained from http://en.wikipedia.org/wiki/File:Calumma_tigris-2.jpg)

Archaius, is the dynamic, multi dimensional, properties framework that addresses these requirements and use cases.
The code name for the project comes from an endangered species of Chameleons. More information can be found at http://en.wikipedia.org/wiki/Archaius_tigris. We chose Archaius, as Chameleons are known for changing their color (a property) based on their environment and situation.


We are pleased to announce the public availability of Archaius as an important milestone in our continued goal of open sourcing the Netflix Platform Stack. (Available at http://github.com/netflix)

Why Archaius?

To understand why we built Archaius, we need to enumerate the pain points of configuration management and the ecosystem that the system operates in. Some of these are captured below, and drove the requirements.
  • Static updates require server pushes; this was operationally undesirable and caused a dent in the availability of the service/application.
  • A Push method of updating properties could not be employed as this system would need to know all the server instances to push the configuration to at any given point in time ((i.e. the list of hostnames and property locations). 
    • This was a possibility in our own data center where we owned all the servers. In the cloud, the instances are ephemeral and their hostnames/ip addresses are not known in advance. Furthermore, the number of these instances fluctuate based on the ASG settings. (for more information on how Netflix uses Auto Scaling Group feature of AWS, please visit here or here).
  • Given that property changes had to be applied at run time, it was clear that the codebase had to use a common mechanism which allowed it to consume properties in a uniform manner, from different sources (both static and dynamic).
  • There was a need to have different properties for different applications and services under different contexts. See the section "Netflix Deployment Overview" for an overview of services and context.
  • Property changes needed to be journaled. This allowed us to correlate any issues in production to a corresponding run time property change.
  • Properties had to be applied based on the Context. i.e. The property had to be multi dimensional. At Netflix, the context was based on "dimensions" such as Environment (development, test, production), Deployed Region (us-east-1, us-west-1 etc.), "Stack" (a concept in which each app and the services in its dependency graph were isolated for a specific purpose; e.g. "iPhone App launch Stack") etc.

Use Cases/Examples

  • Enable or disable certain features based on the request context. 
  • A UI presentation logic layer may have a default configuration to display 10 Movie Box Shots in a single display row. If we determine that we would like to display 5 instead, we can do so using Archaius' Dynamic Properties.
  • We can override the behaviors of the circuit breakers. Reference: Resiliency and Circuit breakers
  • Connection and request timeouts for calls to internal and external services can be adjusted as needed
  • In case we get alerted on errors observed in certain services, we can change the Log Levels (i.e. DEBUG, WARN etc.) dynamically for particular packages/components on these services. This enables us to parse the log files to inspect these errors. Once we are done inspecting the logs, we can reset the Log Levels using Dynamic Properties.
  • Now that Netflix is deployed in an ever growing global infrastructure, Dynamic Properties allow us to enable different characteristics and features based on the International market.
  • Certain infrastructural components benefit from having configurations changed at Runtime based on aggregate site wide behavior. For e.g. a distributed cache component's TTL (time to live) can be changed at runtime based on external factors.
  • Connection pools had to be set differently for the same client library based on which application/service it was deployed in. (For example, in a light weight, low Requests Per Second (RPS) application, the number of connections in a connection pool to a particular service/db will be set to a lower number compared to a high RPS application)
  • The changes in properties can be effected on on a particular instance, a particular region, a stack of deployed services or an entire farm of a particular application at run-time.

Netflix Deployment Overview




 

Example Deployment Context

  • Environment = TEST
  • Region = us-east-1
  • Stack = MyTestStack
  • AppName = cherry
The diagram above shows a hypothetical simplistic overview of a typical deployment architecture at Netflix. Netflix has several services and applications that are consumer facing. These are referred to as Edge Services/Applications. These are typically fronted by Amazon's ELB. Each application/service depends on a set of mid-tier services and persistence technologies (Amazon S3, Cassandra etc.) sometimes fronted by a distributed cache.

Every service or application has a unique "AppName" associated with it. Most services at Netflix are stateless and hosted on multiple instances deployed across multiple Availability Zones of an Amazon Region. The available environments could be "test" or "production" etc. A Stack is logical grouping. For example, an Application and the Mid-Tier Services in its dependency graph can all be logically grouped as belonging to a Stack called "MyTestStack". This is typically done to run different tests on isolated and controlled deployments.

The red oval boxes in the diagram above called "Shared Libraries" are the various common code used by multiple applications. For example, Astyanax, our open sourced Cassandra Client is one such shared library. Turns out that we may need to configure the connection pool differently for each of the applications that is using the Astyanax library. Furthermore it could vary in different Amazon Regions and within different "Stacks" of deployments. Sometimes, we may want to tweak this connection pool parameter at runtime. These are the capabilities that Archaius offers.
i.e. The ability to specifically target a subset or an aggregation of components with a view towards configuring their behavior at static (initial loading) or runtime is what enables us to address the use cases outlined above.

The examples and diagrams in this article show a representative view of how Archaius is used at Netflix. Archaius, the Open sourced version of the project is configurable and extendable to meet your specific needs and deployment environment (even if your deployment of choice is not the EC2 Cloud).

Overview of Archaius


 
Archaius includes a set of java configuration management APIs that are used at Netflix. It is primarily implemented as an extension of Apache's Common Configuration library. Notable features are:
  • Dynamic, Typed Properties
  • High throughput and Thread Safe Configuration operations
  • A polling framework that allows for obtaining property changes from a Configuration Source
  • A Callback mechanism that gets invoked on effective/"winning" property mutations (in the ordered hierarchy of Configurations)
  • A JMX MBean that can be accessed via JConsole to inspect and invoke operations on properties
At the heart of Archaius is the concept of Composite Configuration which is an ordered list of one or more Configurations. Each Configuration can be sourced from a Configuration Source such as JDBC, REST API, a .properties file etc. Configuration Sources can optionally be polled at runtime for changes (In the above diagram, the Persisted DB Configuration Source which is an RDBMS containing properties in a table, is polled every so often for changes). The final value of a property is determined based on the top most Configuration that contains that property. i.e. If a property is present in multiple configurations, the actual value seen by the application will be the value that is present in the topmost slot in the hierarchy of Configurations. The order of the configurations in the hierarchy can be configured.

A rough template for handling a request and using Dynamic Property based execution is shown below: 
 
void handleFeatureXYZRequest(Request params ...){
  if (featureXYZDynamicProperty.get().equals("useLongDescription"){
   showLongDescription();
  } else {
   showShortSnippet();
  }
}
The source code for Archaius is hosted on GitHub at https://github.com/Netflix/archaius.

References

  1. Apache's Common Configuration library
  2. Archaius Features
  3. Archaius User Guide

Conclusion

Archaius forms an important component of the Netflix Cloud Platform. It offers the ability to control various sub systems and components at runtime without any impact to the availability of the services. We hope that this is a useful addition to the list of projects open sourced by Netflix, and invite the open source community to help us improve Archaius and other components.

Interested in helping us take Netflix Cloud Platform to the next level? We are looking for talented engineers.

- Allen Wang, Sr. Software Engineer, Cloud Platform (Core Infrastructure)
- Sudhir Tonse (@stonse), Manager, Cloud Platform (Core Infrastructure)