Wednesday, June 26, 2013

HTML5 Video in IE 11 on Windows 8.1

By Anthony Park and Mark Watson.

We've previously discussed our plans to use HTML5 video with the proposed "Premium Video Extensions" in any browser which implements them.  These extensions are the future of premium video on the web, since they allow playback of premium video directly in the browser without the need to install plugins.

Today, we're excited to announce that we've been working closely with Microsoft to implement these extensions in Internet Explorer 11 on Windows 8.1.  If you install the Windows 8.1 Preview from Microsoft, you can visit Netflix.com today in Internet Explorer 11 and watch your favorite movies and TV shows using HTML5!

Microsoft implemented the Media Source Extensions (MSE) using the Media Foundation APIs within Windows.  Since Media Foundation supports hardware acceleration using the GPU, this means that we can achieve high quality 1080p video playback with minimal CPU and battery utilization.  Now a single charge gets you more of your favorite movies and TV shows!

Microsoft also has an implementation of the Encrypted Media Extensions (EME) using Microsoft PlayReady DRM.  This provides the content protection needed for premium video services like Netflix.

Finally, Microsoft implemented the Web Cryptography API (WebCrypto) in Internet Explorer, which allows us to encrypt and decrypt communication between our JavaScript application and the Netflix servers.

We expect premium video on the web to continue to shift away from using proprietary plugin technologies to using these new Premium Video Extensions.  We are thrilled to work so closely with the Microsoft team on advancing the HTML5 platform, which gets a big boost today with Internet Explorer’s cutting edge support for premium video.  We look forward to these APIs being available on all browsers.

Introducing Lipstick on A(pache) Pig

by Jeff Magnusson, Charles Smith, John Lee, and Nathan Bates

We’re pleased to announce Lipstick (our Pig workflow visualization tool) as the latest addition to the suite of Netflix Open Source Software.

At Netflix, Apache Pig is used heavily amongst developers when productionizing complex data transformations and workflows against our big data.  Pig provides good facilities for code reuse in the form of Python and Java UDFs and Pig macros. It also exposes a simple grammar that allows our users to easily express workflows on big datasets without getting “lost of the weeds” worrying about complicated MapReduce logic.

While Pig’s high level of abstraction is one of its most attractive features, scripts can quickly reach a level of complexity upon which the flow of execution, and it’s relation to the MapReduce jobs being executed, become difficult to conceptualize.  This tends to prolong and complicate the effort required to develop, maintain, debug, and monitor the execution of scripts in our environment. In order to address these concerns we have developed Lipstick, a tool that enables developers to visualize and monitor the execution of their data flows at a logical level.

Lipstick was initially developed as a stand-alone tool that produced a graphical depiction of a Pig workflow.  While useful, we quickly realized that combining the workflow with information about the job as it ran gave the developer insight that previously required a lot of sifting through logs (or a Pig expert) to piece together.   Now, as an implementation of Pig Progress Notification Listener, Lipstick piggybacks on top of all Pig scripts executed in our environment notifying a Lipstick server of job executions and periodically reporting progress as the script executes.


The screenshot above shows Lipstick in action.  In this example the developer would see:
  • This script compiled into 4 MapReduce jobs (two of which we can see represented by the blue bounding boxes)
  • Which logical operations execute in the mappers (blue header) vs the reducers (orange header)
  • Row counts from load / store / dump operations, as well as in between MapReduce jobs
Had the script been currently executing, the boxes representing MapReduce jobs would have been flashing colors (blue or orange) to represent that they were currently executing in the map or reduce phase, and intermediate row counts would have been updating periodically as the Pig script heartbeat back to the Lipstick server.

Lipstick has many cool features (check out the user guide to learn more), but there are two that we think are especially useful:
Clicking on intermediate row counts between MapReduce jobs displays a sample of intermediate results.
A toggle that switches between optimized and unoptimized versions of the logical plan.  This allows users to easily see how Pig is applying optimizations to the script (e.g. filters pushed into the loader).
In the months we've been using Lipstick, it has already proven its worth many times over and we are just getting started.  If you would like to use Lipstick yourself or help us make it better, download it and give us your feedback.  If you like building tools that makes it easier to work with big data (like Lipstick) check out our jobs page as well.

Friday, June 21, 2013

Genie is out of the bottle!

Genie is out of the bottle

by Sriram Krishnan

In a prior tech blog, we had discussed the architecture of our petabyte-scale data warehouse in the cloud. Salient features of our architecture include the use of Amazon’s Simple Storage Service (S3) as our "source of truth", leveraging the elasticity of the cloud to run multiple dynamically resizable Hadoop clusters to support various workloads, and our horizontally scalable Hadoop Platform as a Service called Genie.

Today, we are pleased to announce that Genie is now open source, and available to the public from the Netflix OSS GitHub site.

What is Genie?

Genie provides job and resource management for the Hadoop ecosystem in the cloud. From the perspective of the end-user, Genie abstracts away the physical details of various (potentially transient) Hadoop resources in the cloud, and provides a REST-ful Execution Service to submit and monitor Hadoop, Hive and Pig jobs without having to install any Hadoop clients. And from the perspective of a Hadoop administrator, Genie provides a set of Configuration Services, which serve as a registry for clusters, and their associated Hive and Pig configurations.

Why did we build Genie?

There are two main reasons why we built Genie. Firstly, we run multiple Hadoop clusters in the cloud to support different workloads at Netflix. Some of them are launched as needed, and are hence transient - for instance, we spin up “bonus” Hadoop clusters nightly to augment our resources for ETL (extract, transform, load) processing. Others are longer running (viz. our regular “SLA” and “ad-hoc” clusters) - but may still be re-spun from time to time, since we work under the operating assumption that cloud resources may go down at any time. Users need to discover the latest incarnations of these clusters by name, or by the type of workloads that they support. In the data center, this is generally not an issue since Hadoop clusters don’t come up or go down frequently, but this is much more common in the cloud.

Secondly, end-users simply want to run their Hadoop, Hive or Pig jobs - very few of them are actually interested in launching their own clusters, or even installing all the client-side software and downloading all the configurations needed to run such jobs. This is generally true in both the data center and the cloud. A REST-ful API to run jobs opens up a wealth of opportunities, which we have exploited by building web UIs, workflow templates, and visualization tools that encapsulate all our common patterns of use.

What Genie Isn’t

Genie is not a workflow scheduler, such as Oozie. Genie’s unit of execution is a single Hadoop, Hive or Pig job. Genie doesn’t schedule or run workflows - in fact, we use an enterprise scheduler (UC4) at Netflix to run our ETL.

Genie is not a task scheduler, such as the Hadoop fair share or capacity schedulers either. We think of Genie as a resource match-maker, since it matches a job to an appropriate cluster based on the job parameters and cluster properties. If there are multiple clusters that are candidates to run a job, Genie will currently choose a cluster at random. It is possible to plug in a custom load balancer to choose a cluster more optimally - however, such a load balancer is currently not available.

Finally, Genie is not an end-to-end resource management tool - it doesn’t provision or launch clusters, and neither does it scale clusters up and down based on their utilization. However, Genie is a key complementary tool, serving as a repository of clusters, and an API for job management.

How Genie Works

The following diagram explains the core components of Genie, and its two classes of Hadoop users - administrators, and end-users.

Genie itself is built on top of the following Netflix OSS components:

  • Karyon, which provides bootstrapping, runtime insights, diagnostics, and various cloud-ready hooks,
  • Eureka, which provides service registration and discovery,
  • Archaius, for dynamic property management in the cloud,
  • Ribbon, which provides Eureka integration, and client-side load-balancing for REST-ful interprocess communication, and
  • Servo, which enables exporting metrics, registering them with JMX (Java Management Extensions), and publishing them to external monitoring systems such as Amazon's CloudWatch.

Genie can be cloned from GitHub, built, and deployed into a container such as Tomcat. But it is not of much use unless someone (viz. an administrator) registers a Hadoop cluster with it. Registration of a cluster with Genie is as follows:

  • Hadoop administrators first spin up a Hadoop cluster, e.g. using the EMR client API.
  • They then upload the Hadoop and Hive configurations for this cluster (*-site.xml’s) to some location on S3.
  • Next, the administrators use the Genie client to discover a Genie instance via Eureka, and make a REST-ful call to register a cluster configuration using a unique id, and a cluster name, along with a few other properties - e.g. that it supports “SLA” jobs, and the “prod” metastore. If they are creating a new metastore configuration, then they may also have to register a new Hive or Pig configuration with Genie.

After a cluster has been registered, Genie is now ready to grant any wish to its end-users - as long as it is to submit Hadoop jobs, Hive jobs, or Pig jobs!

End-users use the Genie client to launch and monitor Hadoop jobs. The client internally uses Eureka to discover a live Genie instance, and Ribbon to perform client-side load balancing, and to communicate REST-fully with the service. Users specify job parameters, which consist of:

  • A job type, viz. Hadoop, Hive or Pig,
  • Command-line arguments for the job,
  • A set of file dependencies on S3 that can include scripts or UDFs (user defined functions).

Users must also tell Genie what kind of Hadoop cluster to pick. For this, they have a few choices - they can use a cluster name or a cluster ID to pin to a specific cluster, or they can use a schedule (e.g. SLA) and a metastore configuration (e.g. prod), which Genie will use to pick an appropriate cluster to run a job on.

Genie creates a new working directory for each job, stages all the dependencies (including Hadoop, Hive and Pig configurations for the chosen cluster), and then forks off a Hadoop client process from that working directory. It then returns a Genie job ID, which can be used by the clients to query for status, and also to get an output URI, which is browsable during and after job execution (see below). Users can monitor the standard output and error of the Hadoop clients, and also look at Hive and Pig client logs, if anything went wrong.

The Genie execution model is very simple - as mentioned earlier, Genie simply forks off a new process for each job from a new working directory. Other than simplicity, important benefits of this approach include isolation of jobs from each other and from Genie, and easy accessibility of standard output, error and job logs for our end-users (since they are browsable from the output URIs). We made a decision not to queue up jobs in Genie - if we had implemented a job queue, we would have had to implement a fair-share or capacity scheduler for Genie as well, which is already available at the Hadoop level. The downside of this approach is that a JVM is spawned for each job, which implies that Genie can only run a finite number of concurrent jobs on an instance, based on available memory.

Deployment at Netflix

Genie scales horizontally using ASGs (Auto-Scaling Groups) in the cloud, which helps us run several hundreds of concurrent Hadoop jobs in production at Netflix, with the help of Asgard for cloud management and deployment. We use Asgard (see screenshot below) to pick minimum, desired and maximum instances (for horizontal scalability) in multiple availability zones (for fault tolerance). For Genie server pushes, Asgard provides the concept of a “sequential ASG”, which lets us route traffic to new instances of Genie once a new ASG is launched, and turn off traffic to old instances by marking the old ASG out of service.

Using Asgard, we can also set up scaling policies to handle variable loads. The screenshot below shows a sample policy, which increases the number of Genie instances (by one) if the average number of running jobs per instance is greater than or equal to 25.

Usage at Netflix

Genie is being used in production at Netflix to run several thousands of Hadoop jobs daily, processing hundreds of terabytes of data. The screenshot below (from our internal Hadoop investigative tool, code named “Sherlock”) shows some of our clusters over a period of a few months.

The blue line shows one of our SLA clusters, while the orange line shows our main ad-hoc cluster. The red line shows another ad-hoc cluster, with a new experimental version of a fair-share scheduler. Genie was used to route jobs to one of the two ad-hoc clusters at random, and we measured the impact of the new scheduler on the second ad-hoc cluster. When we were satisfied with the performance of the new scheduler, we spun up another larger consolidated ad-hoc cluster with the new scheduler (also shown by the orange line), and all new ad-hoc Genie jobs were now routed to this latest incarnation. The two older clusters were terminated once all running jobs were finished (we call this a “red-black” push).

Summary

Even though Genie is now open source, and has been running in production at Netflix for months, it is still a work in progress. We think of the initial release as version 0. The data model for the services is fairly generic, but definitely biased towards running at Netflix, and in the cloud. We hope for community feedback and contributions to broaden its applicability, and enhance its capabilities.

We will be presenting Genie at the 2013 Hadoop Summit during our session titled “Genie - Hadoop Platform as a Service at Netflix”, and demoing Genie and other tools that are part of the Netflix Hadoop toolkit at the Netflix Booth. Please join us for the presentation, and/or feel free to stop by the booth, chat with the team, and provide feedback.

If you are interested in working on great open source software in the areas of big data and cloud computing, please take a look at jobs.netflix.com for current openings!

References

Genie OSS
Genie Wiki: Getting Started
Netflix Open Source Projects
@NetflixOSS Twitter Feed

Tuesday, June 18, 2013

Announcing Ice: Cloud Spend and Usage Analytics



One of the advantages of moving to the cloud was increased engineering velocity.  Every engineer who needed cloud resources was able to procure them at the click of a button.  This led to an increase in resource usage and allowed us to move more quickly as an organization.  At the same time, seeing the big picture of how many resources were used and by whom became more difficult. In addition, Netflix is a highly decentralized environment where each service team decides how many resources their services need.  The elastic nature of the cloud make capacity planning less crucial and teams can simply add resources as needed.  Viewing the broad picture of cloud resource usage becomes more difficult in such an environment.  To address both needs, Netflix created Ice.


Ice provides a birds-eye view of our large and complex cloud landscape from a usage and cost perspective.  Cloud resources are dynamically provisioned by dozens of service teams within the organization and any static snapshot of resource allocation has limited value.  The ability to trend usage patterns on a global scale, yet decompose them down to a region, availability zone, or service team provides incredible flexibility. Ice allows us to quantify our AWS footprint and to make educated decisions regarding reservation purchases and reallocation of resources.


We are thrilled to announce that today Ice joins the NetflixOSS platform.  You can get the source code on GitHub at https://github.com/Netflix/ice.


Features
Ice communicates with Amazon’s Programmatic Billing Access and maintains knowledge of the following key AWS entity categories:
  • Accounts
  • Regions
  • Services (e.g. EC2, S3, EBS)
  • Usage types (e.g. EC2 - m1.xlarge)
  • Cost and Usage Categories (On-Demand, Un-Used, Reserved, etc.)


The UI allows you to filter directly on the above categories to custom tailor your view and slice and dice your billing data.


In addition, Ice supports the definition of Application Groups. These groups are explicitly defined collections of resources in your organization. Such groups allow usage and cost information to be aggregated by individual service teams within your organization, each consisting of multiple services and resources. Ice also provides the ability to email weekly cost reports for each Application Group showing current usage and past trends.


When representing the cost profile for individual resources, Ice will factor the depreciation schedule into your cost contour, if so desired.  The ability to amortize one-time purchases, such as reservations, over time allows teams to better evaluate their month-to-month cost footprint.


Getting started
After signing up for Programmatic Billing Access, follow the instructions at https://github.com/Netflix/ice#prerequisite to get started.


Conclusion
Ice has provided Netflix with insights into our AWS usage and spending.  It helps us identify inefficient usage and influences our reservation purchases.  It provides our entire product development organization with visibility into how many cloud resource they are using and enables each team to make engineering decisions to better manager usage.  We hope that by releasing it as part of NetflixOSS, the rest of the community can realize similar, and even greater, benefits.


If you are interested in joining us on further building our distributed, scalable, and highly available NetflixOSS platform, please take a look at our jobs listing.

Friday, June 14, 2013

Isthmus - Resiliency against ELB outages

On Christmas Eve, 2012, Netflix streaming service experienced an outage.   For full details, see “A Closer Look at the Christmas Eve Outage” by Adrian Cockcroft.  This outage was particularly painful, both because of the timing, as well as the root cause - ELB control plane, was outside of our ability to correct.  While our applications were running healthy, no traffic was getting to them.  AWS teams worked diligently to correct the problem, though it took several hours to completely resolve the outage.


Following the outage, our teams had many discussions focusing on lessons and takeaways.  We wanted to understand how to strengthen our architecture so we can withstand issues like a region-wide ELB outage without service quality degradation to the Netflix users.  If we wanted to survive such outage, we needed a set of ELB’s hosted at another region that we could use to route the traffic to our backend services.  That was the starting point.


Isthmus
At end of 2012 we were already experimenting with a setup internally referred to as “Isthmus” (definition here), for a different purpose - we wanted to see if setting up a thin layer of ELB + a routing layer at remote AWS region and using persistent long distance connections between the routing layer and the backend services would improve latency of user experience.  We realized we can use a similar setup to achieve multi-regional ELB resiliency.  Under normal operation, traffic would flow through both regions.  If one of the regions would experience ELB issues, we would route via DNS all the traffic through another region.




The routing layer that we used was developed by our API team.  It’s a powerful and flexible layer that can maintain pool of connections, allows smart filtering and much more.  You can find full details at our NetflixOSS GitHub site.  Zuul is at the core of the Isthmus setup - it forwards all of user traffic and establishes the bridge (or an Isthmus) between 2 AWS regions.


We had to make some more changes to our internal infrastructure to support this effort.  Eureka - our service discovery solution normally operated within an AWS region.  In this particular setup, we needed Eureka in US-West2 region to be aware of Netflix services in US-East.  In addition, our middle-tier IPC layer - Ribbon, needed to understand whether to route requests to a service local to the region, or in a remote location.  


We route user traffic to a particular set of ELBs via DNS.  The changes were typically done by one of our engineers through the DNS provider UI console - one endpoint at a time.  This method is manual and does not work well in case of an outage.  Thus, Denominator was born - an open source library to work with DNS providers and allow such changes to be done programmatically.  Now we could automate and repeatedly execute directional DNS changes.


Putting it all together: changing user traffic in production
In the weeks following the outage, we stood up the infrastructure necessary to support Isthmus and were ready to test it out.  After some internal tests, and stress tests by simulating production-level traffic in our test environment, we deployed Isthmus in production, though it was taking no traffic yet.  Since the whole system was brand-new, we proceeded very carefully.  We started with a single endpoint, though a rather important one - our API services.  Gradually, we increased % of production traffic that it was taking: 1%, 5% and so on, until we verified that we could actually route 100% of user traffic through an Isthmus without any detrimental effects to user experience.  Traffic routing was done with DNS geo-directional changes - specifying which States to route to which endpoint.  


After success with the API service working in Isthmus mode, we proceeded to repeat the same setup with other services that enable Netflix streaming.  Not taking any chances, we’ve repeated the same gradual ramp-up and validation as we did with API.  Similar sequence, though at faster ramp-up speeds was followed for the remaining services that ensure user’s ability to browse and stream movies.


Over last 2 months we’ve been shifting production user traffic between AWS regions to reach the desired stable state - where traffic flows approximately 50/50% between 2 US-East and US-West regions.


The best way we can prove that this setup solves the problem we set out to resolve is by actually simulating an ELB outage - and verifying that we could throw all the traffic to another AWS region.  We’re currently planning such "Chaos" exercise and will be executing it shortly.


First step towards the goal of Multi-Regional Resiliency
The work that we’ve done so far improved our architecture to better handle region-wide ELB outages.  ELB is just one service dependency though - and we have many more.  Our goal is to be able to survive any region-wide issue - either a complete AWS Region failure, or a self-inflicted problem - with minimal or no service quality degradation to Netflix users.  For example, the solution we’re working on should mitigate outages like we had on June 5th.  We’re starting to replicate all the data between the AWS regions, and eventually will stand up full complement of services as well.  We’re working on such efforts now, and are looking for a few great engineers to join our Infrastructure teams.  If these are the types of challenges you enjoy - check out Netflix Jobs site for more details. To learn out more about Zuul, Eureka, Ribbon and other NetflixOSS components, join us for the upcoming NetflixOSS Meetup on July 17, 2013.

Wednesday, June 12, 2013

Women in Cloud Meetup at Netflix

By Shobana Radhakrishnan

We recently held a Meetup on our campus for Bay Area women in the Cloud Space, in collaboration with Cloud-NOW. Women from across a number of companies and backgrounds related to Cloud attended the event and participated in talks and panel discussions on various technical topics related to the cloud. I kicked off the evening, introducing Yury Izrailevsky, VP Cloud Computing and Platform Engineering at Netflix. Yury talked about the story of how Netflix entered Cloud Computing with their streaming service and scaled it by 100 times in 4 years, and how women engineers were a significant part of that effort. I also shared how Netflix engineering scales along two dimensions - strong technology and tools leveraging open source extensively, as well as a nimble culture without bureaucracy or unnecessary process.


Cloud-NOW's Rita Scroggin, as co-host, introduced this non-profit consortium of the leading women in cloud computing, focused on using technology for the overall professional development of women (cloud-now.org).

Keynote speaker Annika Jimenez took the stage next. Annika is Global head of Data Science Services at Pivotal, the brand-new Big Data spinoff of EMC and VMware, in which GE has invested as well. Annika shared with the audience the reasons for why data science is changing the way data computing is done. She showed how internet giants like Netflix, Google, Yahoo! and Facebook are leading the way to big data in the cloud, and that so much more work remains to be done.

Devika Chawla spoke next. She leads the Netflix engineering team that is moving all the customer messaging to the cloud - those messages we get inviting us to join, create accounts, watch suggested movies, provide commentary and rejoin if we happen to have lapsed. Devika's team must be able to do this across devices (phone, iPad, TV…), to individuals as well as the entire Netflix user base, and ever-faster. Building in the Cloud enables them to meet this challenge scalably and cost-effectively.

This was followed by various breakout sessions covering topics such as Cloud Security (IBM’s Uma Subramaniam with Netflix’s Jason Chan as co-host), Testing in a multi-cloud environment (Dell’s Seema Jethani with Netflix’s Sangeeta Narayanan as co-host) and Metrics for the cloud (Jeremy Edberg from Netflix and Globalization Expert Jessica Roland). Fang Ji also gave a peek at internal Netflix tools for monitoring cloud cost and performance, including AWSUsage which will be OSS soon. Panel Discussion on Engineering leadership from Verticloud leaders Ellen Salisbury and Anna Sidana, as well as insight into how the Freedom and Responsibility culture works for engineering from Netflix VP of Talent Acquisition Jessica Neal, rounded out the discussion sessions.


Afterward, participants were treated to some incredible sushi and a tour of some of our most exciting product demos, including some of the popular Netflix open source contributions. This included:
  • Asgard - web interface for application deployments and management in Amazon Web Services (AWS)
  • Garbage Collection Visualization (or GCViz) - Tool that turns the semi-structured data from the java garbage collector's gc.log into time-series charts for easy visual analysis. On Github here.
  • Genie, Lipstick and Sting - suite of tools, deployable on top of the Hadoop ecosystem, that enables even non-technical users to develop, tune, and maintain efficient Hadoop workflows and easily interact with and visualize datasets.
  • Hystrix and Turbine - used to understand API traffic patterns and system behavior. More at this blog post.

With about two petabytes of data in cloud, serving more than 36 million subscribers across more than 50 countries and territories, Netflix is always evolving new tools to manage our cloud systems, and encouraging innovation with prizes like our Cloud Prize. Deadline is September 15!
We are also always seeking the best engineering talent to help - check out jobs.netflix.com if interested in solving these challenges.

Announcing Zuul: Edge Service in the Cloud

The Netflix streaming application is a complex array of intertwined systems that work together to seamlessly provide our customers a great experience. The Netflix API is the front door to that system, supporting over 1,000 different device types and handing over 50,000 requests per second during peak hours. We are continually evolving by adding new features every day.  Our user interface teams, meanwhile,  continuously push changes to server-side client adapter scripts to support new features and AB tests.  New AWS regions are deployed to and catalogs are added for new countries to support international expansion.  To handle all of these changes, as well as other challenges in supporting a complex and high-scale system, a robust edge service that enables rapid development, great flexibility, expansive insights, and resiliency is needed.

Today, we are pleased to introduce Zuul  our answer to these challenges and the latest addition to our our open source suite of software  Although Zuul is an edge service originally designed to front the Netflix API, it is now being used in a variety of ways by a number of systems throughout Netflix. 


Zuul in Netflix's Cloud Architecture


How Does Zuul Work?

At the center of Zuul is a series of filters that are capable of performing a range of actions during the routing of HTTP requests and responses.  The following are the key characteristics of a Zuul filter:
  • Type: most often defines the stage during the routing flow when the filter will be applied (although it can be any custom string)
  • Execution Order: applied within the Type, defines the order of execution across multiple filters
  • Criteria: the conditions required in order for the filter to be executed
  • Action: the action to be executed if the Criteria are met
Here is an example of a simple filter that delays requests from a malfunctioning device in order to distribute the load on our origin:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class DeviceDelayFilter extends ZuulFilter {

    def static Random rand = new Random()
    
    @Override
     String filterType() {
       return 'pre'
     }

    @Override
    int filterOrder() {
       return 5
    }

    @Override
    boolean shouldFilter() {
 return  RequestContext.getRequest().
     getParameter("deviceType")?equals("BrokenDevice"):false
    }

    @Override
    Object run() {
 sleep(rand.nextInt(20000)) //Sleep for a random number of seconds
                                   //between [0-20]
    }
}

Zuul provides a framework to dynamically read, compile, and run these filters. Filters do not communicate with each other directly - instead they share state through a RequestContext which is unique to each request.

Filters are currently written in Groovy, although Zuul supports any JVM-based language. The source code for each filter is written to a specified set of directories on the Zuul server that are periodically polled for changes. Updated filters are read from disk, dynamically compiled into the running server, and are invoked by Zuul for each subsequent request.

Zuul Core Architecture

There are several standard filter types that correspond to the typical lifecycle of a request:
  • PRE filters execute before routing to the origin. Examples include request authentication, choosing origin servers, and logging debug info.
  • ROUTING filters handle routing the request to an origin. This is where the origin HTTP request is built and sent using Apache HttpClient or Netflix Ribbon.
  • POST filters execute after the request has been routed to the origin.  Examples include adding standard HTTP headers to the response, gathering statistics and metrics, and streaming the response from the origin to the client.
  • ERROR filters execute when an error occurs during one of the other phases.
Request Lifecycle
Alongside the default filter flow, Zuul allows us to create custom filter types and execute them explicitly.  For example, Zuul has a STATIC type that generates a response within Zuul instead of forwarding the request to an origin.  


How We Use Zuul

There are many ways in which Zuul helps us run the Netflix API and the overall Netflix streaming application.  Here is a short list of some of the more common examples, and for some we will go into more detail below:
  • Authentication
  • Insights
  • Stress Testing
  • Canary Testing
  • Dynamic Routing
  • Load Shedding
  • Security
  • Static Response handling
  • Multi-Region Resiliency 


Insights

Zuul gives us a lot of insight into our systems, in part by making use of other Netflix OSS components.  Hystrix is used to wrap calls to our origins, which allows us to shed and prioritize traffic when issues occur.  Ribbon is our client for all outbound requests from Zuul, which provides detailed information into network performance and errors, as well as handles software load balancing for even load distribution.  Turbine aggregates fine-grained metrics in real-time so that we can quickly observe and react to problems.  Archaius handles configuration and gives the ability to dynamically change properties. 

Because Zuul can add, change, and compile filters at run-time, system behavior can be quickly altered. We add new routes, assign authorization access rules, and categorize routes all by adding or modifying filters. And when unexpected conditions arise, Zuul has the ability to quickly intercept requests so we can explore, workaround, or fix the problem. 

The dynamic filtering capability of Zuul allows us to find and isolate problems that would normally be difficult to locate among our large volume of requests.  A filter can be written to route a specific customer or device to a separate API cluster for debugging.  This technique was used when a new page from the website needed tuning.  Performance problems, as well as unexplained errors were observed. It was difficult to debug the issues because the problems were only happening for a small set of customers. By isolating the traffic to a single instance, patterns and discrepancies in the requests could be seen in real time. Zuul has what we call a “SurgicalDebugFilter”. This is a special “pre” filter that will route a request to an isolated cluster if the patternMatches() criteria is true.  Adding this filter to match for the new page allowed us to quickly identify and analyze the problem.  Prior to using Zuul, Hadoop was being used to query through billions of logged requests to find the several thousand requests for the new page.  We were able to reduce the problem to a search through a relatively small log file on a few servers and observe behavior in real time.

The following is an example of the SurgicalDebugFilter that is used to route matched requests to a debug cluster:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class SharpDebugFilter extends SurgicalDebugFilter {
   private static final Set<String> DEVICE_IDS = ["XXX", "YYY", "ZZZ"]
   @Override
   boolean patternMatches() {
       final RequestContext ctx = RequestContext.getCurrentContext()
       final String requestUri = ctx.getRequest().getRequestURI();
       final String contextUri = ctx.requestURI;
       String id = HTTPRequestUtils.getInstance().
           getValueFromRequestElements("deviceId");
       return DEVICE_IDS.contains(id);
  }
}
In addition to dynamically re-routing requests that match a specified criteria, we have an internal system, built on top of Zuul and Turbine, that allows us to display a real-time streaming log of all matching requests/responses across our entire cluster.  This internal system allows us to quickly find patterns of anomalous behavior, or simply observe that some segment of traffic is behaving as expected,  (by asking questions such as "how many PS3 API requests are coming from Sao Paolo”)?


Stress Testing 

Gauging the performance and capacity limits of our systems is important for us to predict our EC2 instance demands, tune our autoscaling policies, and keep track of general performance trends as new features are added.  An automated process that uses dynamic Archaius configurations within a Zuul filter steadily increases the traffic routed to a small cluster of origin servers. As the instances receive more traffic, their performance characteristics and capacity are measured. This informs us of how many EC2 instances will be needed to run at peak, whether our autoscaling policies need to be modified, and whether or not a particular build has the required performance characteristics to be pushed to production.


Multi-Region Resiliency

Zuul is central to our multi-region ELB resiliency project called Isthmus. As part of Isthmus, Zuul is used to bridge requests from the west coast cloud region to the east coast to help us have multi-region redundancy in our ELBs for our critical domains. Stay tuned for a tech blog post about our Isthmus initiative. 


Zuul OSS

Today, we are open sourcing Zuul as a few different components:
  • zuul-core - A library containing a set of core features.
  • zuul-netflix - An extension library using many Netflix OSS components:
    • Servo for insights, metrics, monitoring
    • Hystrix for real time metrics with Turbine
    • Eureka for instance discovery
    • Ribbon for routing
    • Archaius for real-time configuration
    • Astyanax for and filter persistence in Cassandra
  • zuul-filters - Filters to work with zuul-core and zuul-netflix libraries 
  • zuul-webapp-simple -  A simple example of a web application built on zuul-core including a few basic filters
  • zuul-netflix-webapp- A web application putting zuul-core, zuul-netflix, and zuul-filters together.
Netflix OSS libraries in Zuul
Putting everything together, we are also providing a web application built on zuul-core and zuul-netflix.  The application also provides many helpful filters for things such as:
  • Weighted load balancing to balance a percentage of load to a certain server or cluster for capacity testing
  • Request debugging
  • Routing filters for Apache HttpClient and Netflix Ribbon
  • Statistics collecting
We hope that this project will be useful for your application and will demonstrate the strength of our open source projects when using Zuul as a glue across them, and encourage you to contribute to Zuul to make it even better. Also, if this type of technology is as exciting to you as it is to us, please see current openings on our team: jobs  

Mikey Cohen - API Platform
Matt Hawthorne - API Platform