Monday, January 28, 2013

Announcing Ribbon: Tying the Netflix Mid-Tier Services Together


by Allen Wang and Sudhir Tonse

Netflix embraces a fine-grained Service Oriented Architecture as the underpinning of its Cloud based deployment model. Currently, we run hundreds of such fine grained services that are collectively responsible in handling the customer facing requests via a few "Edge Services" such as the Netflix API Service. A lightweight REST based protocol is the choice for inter process communication amongst these services.





The Netflix Internal Web Service Framework (aka NIWS) forms the bedrock of this architecture in the realm of communication. Together with the previously announced Eureka, which aids in service discovery, NIWS provides all the components pertinent for making REST calls.

NIWS is comprised of a REST client and server framework, based on JSR-311 which is a RESTful API specification for Java. Our services use various payload data serialization formats such as Avro, XML, JSON, Thrift and Google Protocol Buffers. NIWS provides the mechanism for serializing and deserializing the data.



Today, we are happy to announce Ribbon as the latest offering of our hugely popular and growing Open Source libraries hosted on GitHub.

Ribbon, as a first cut, mainly offers the client side software load balancing algorithms which have been battle tested at Netflix along with a few of the other components that form our Inter Process Communication stack (aka NIWS). We plan to continue open sourcing the rest the of the NIWS stack in the coming months. Please note that the loadbalancers mentioned are the internal client-side loadbalancers used alongside Eureka that are primarily used for load balancing requests to our mid-tier services. For our public facing Edge Services we continue to use Amazon's ELB Service

Deployment Topology


DiagramTypical (representative) deployment architecture at Netflix.

A typical deployment architecture at Netflix is a multi-region, multi-zone deployment which aids in better availability and resiliency.
The Amazon ELB provides load balancing for customer/device facing requests while internal mid-tier requests are handled via the Ribbon framework.

Eureka provides the service registry for all Netflix services. Ribbon clients are typically created and configured for each of the target services. Ribbon's Client component offers a good set of configuration options such as connection timeouts, retries, retry algorithm (exponential, bounded backoff)  etc.
Ribbon comes built in with a pluggable and customizable LoadBalancing component. Some of the load balancing strategies offered are listed below, with more to follow.

  • Simple Round Robin LB
  • Weighted Response Time LB
  • Zone Aware Round Robin LB
  • Random LB

Battle Tested Features

The main benefit of Ribbon is that it offers a simple inter process communication mechanism with features built based on our operational learnings and experience in the Amazon Cloud. Ribbon's Zone Aware Load Balancer for example, is built with circuit tripping logic and can be configured to favor the target service instance based on Zone Affinity (i.e it favors the zone in which the calling service itself is hosted, thus benefiting from reduced latency and savings in cost). It monitors the operational behavior of the instances running in each zone and has the capability to quickly (at real time) drop an entire Zone out of rotation. This helps us be resilient in the face of Zone outages as described in a prior blog post.

Zone Aware Load Balancer




The picture above shows the Zone Aware LoadBalancer in action.  The LoadBalancer will do the following when picking a server:
  1. The LoadBalancer will calculate and examine zone stats of all available zones. If the active requests per server for any zone has reached a configured threshold, this zone will be dropped from the active server list. In case more than one zone has reached the threshold, the zone with the most active requests per server will be dropped.
  2. Once the the worst zone is dropped, a zone will be chosen among the rest with the probability proportional to its number of instances.
  3. A server will be returned from the chosen zone with a given Rule (A Rule is a loadbalacing strategy, for example a simple Round Robin Rule)
For each request, the steps above will be repeated. That is to say, each zone related load balancing decisions are made at real time with the up-to-date statistics aiding the choice.

Some of the features of Ribbon are listed below with more to follow.

  • Easy integration with a Service Discovery component such as Netflix's Eureka
  • Runtime configuration using Archaius
  • Operational Metrics exposed via JMX and published via Servo
  • Multiple and pluggable serialization choices (via JSR-311, Jersey)
  • Asynchronous and Batch operations (coming up)
  • Automatic SLA framework (coming up)
  • Administration/Metrics console (coming up)
Please visit the wiki pages at GitHub for detailed documentation on Ribbon.

At Netflix, we typically wrap our REST calls made via Ribbon using Hystrix which provides Latency and Fault Tolerance in Distributed Systems.

If you would like to contribute to our highly scalable libraries and frameworks for ephemeral distributed environments, please take a look at http://netflix.github.com. You can follow us on twitter at @NetflixOSS.
We will be hosting a NetflixOSS Open House on the 6th of February, 2013 (Limited seats. RSVP needed).

We are constantly looking for great talent to join us and we welcome you to take a look at our Jobs page or contact @stonse for positions in the Cloud Platform Infrastructure team.

Resources

  1. Netflix Open Source Dashboard
  2. Ribbon
  3. Eureka (Service Discovery and Metadata)
  4. Archaius (Dynamic configurations)
  5. Hystrix (Latency and Fault Tolerance)
  6. Simian Army