Thursday, June 28, 2012

Scalable Logging and Tracking

Scalable Logging

by Kedar Sadekar


At Netflix we work hard to improve personalized recommendations. We use a lot of data to make recommendations better. What may seem an arbitrary action -- scrolling up, down, left or right and how much -- actually provides us with valuable information. We work to get all the necessary data points and feedback to provide the best user experience.

It is obvious that to capture the large amount of data generated, we need a dedicated, fast, scalable and highly available and asynchronous collection system that does not slow the user experience.
In this post we discuss the decisions and considerations that went into building a service that accepts a few billion requests a day, processing and storing these requests for later use and analysis by various systems within Netflix.

Considerations

We did not want this service to disrupt the user experience, hence, the main objective was as low a latency as possible. It also needed to scale to handle billions of requests a day. The data sent to and processed by this service is noncritical data. That was an important factor in our design where we made a conscious choice of being ok with dropping data (user events) as opposed to providing a sub-optimal client experience. From the client side, the call is fire-and-forget. That essentially means that the client should not care what the end result of the call was (success/failure).

Data Size

The average size of the request and the logged data is around 16 KB (range: 800 bytes ~ 130 KB) whereas the response average is pretty consistent at around 512 bytes. Here is an example of the data (fields) that we capture: video, device, page, timestamp.

Latency

The service needs to handle a billion plus requests a day, and peak traffic could be 3 - 6 times the average when measured in terms of requests per second (RPS). To achieve our goal of having a low millisecond latency for this service. Here are some of the practises we adopted:

Holding on to the request is expensive

This service is developed using Java, deployed on a standard Tomcat container. To achieve a high throughput, we want to free up Tomcat threads as soon as we can. To do that we do not hold on to the request for any longer than required. The methodology we used was simple, we grab whatever data we need from the HTTP request object, push it onto a thread pool for processing later and flush the response to the client immediately. Holding on to the request for any longer translates to a smaller throughput per node in the cluster. A lower throughput per node in the cluster means having to scale more horizontally and scaling horizontally beyond a point is inefficient and cost ineffective.

Fail fast / first

Return as quickly as you can, which means you try to identify your failure cases first, before doing any unnecessary processing. Return as soon as you know there is no point moving forward.
An example: If your data must have some data from the cookie, try to crack the cookie first, before dealing with any other request parameters. If the cookie does not have the required data, return, don’t bother looking at any other data the request body contains.  

HTTP codes  

We captured all the 4xx / 5xx / 2xx responses that we serve. Some services don’t care about a failure, in those cases, we just returned a HTTP 202 (accepted) response. Having these metrics in place helps you tune your code, and if the calling service does not care, why bother returning a 4xx response. We have alert triggering mechanisms based on the percentage of the HTTP response codes.

Dependencies can and will slow down sometimes

We did an exercise to identify all dependencies (other Netflix services / jars) that this service depended on which were going to make across the wire calls. We have learned that however reliable and robust the dependencies are, there will be network glitches and service latency issues at some point or another. We do not want the logging service to be bogged down by such issues,
For any such service calls, we guard them by wrapping them using Java Futures with appropriate timeouts. Aggressive timeouts were specially reserved for those calls that were in the hot path (before the response is flushed). Adding a lot of metrics helped in understanding if a service was timing out too often or was the slowest.

Process Later

Once we had all the data we needed, we put into a queue for asynchronous execution by an executor pool.
The following diagram illustrates what has been described above.

 

Garbage Collection


For a service written entirely in Java, an important factor when deploying is pause times during Garbage Collections. The nature of this service is an extremely large volume of really short-lived objects. We played around with GC tuning variables to achieve the  best throughput. As part of these experiments, we tried various combinations of the parallel generational collector and the CMS (Concurrent Mark Sweep) collector too. We setup canaries taking peak production traffic for at least a couple of days with different combinations for young gen to heap ratios.
Each time we had a winner, we pitted the CMS canary against the best canary with the parallel collector. We did this 2-3 times until we were sure we had a winner.
The winner was analyzed by capturing the GC logs and mining them for timings and counts of new gen (par-new), Full GC’s and CMS failures (if any) etc.  We learned that having canaries is the only way of knowing for sure. Don’t be in a hurry to pick a winner.

Auto Scaling

Measure, Measure

Since traffic (rps) is unpredictable, at Netflix heavily leverage auto-scaling policies. There are different metrics that one could use to auto-scale a cluster, the most common ones being CPU load and RPS. We chose to primarily use RPS. CPU load is used to trigger alerts both at instance and cluster levels. A lot of the metrics gathered are powered by our own Servo code (available on github). 
We have collected the metrics over a few days, including peak traffic at weekends and then applied the policies that enable us to effectively scale in the cloud. [See reference on auto-scaling]

Have Knobs

All these throughput measurements were done in steps. We had knobs in place that allowed us to slowly ramp-up traffic, observe the system behavior and make necessary changes, gaining confidence in what the system could handle.
Here is a graph showing the RPS followed by a graph showing the average latency metrics (in milliseconds) over the same period.





Persistence

The real magic of such voluminous data collection and aggregation is actually done by our internal log collectors. Individual machines have agents that send the logs to collectors and finally to the data sink (Hive for example).



Common Infrastructure / Multiple end-points
As different teams within Netflix churn out different features and algorithms, the need to measure the efficacy and success of those never diminishes. However, those teams would love to focus on their core competencies instead of having to setup up a logging / tracking infrastructure that caters to their individual needs.
It made perfect sense for those teams to direct their traffic to the logging service. Since the data required by each team is disparate, each of these teams’ needs is considered as a new end-point on the logging service.
Supporting a new client is simple, with the main decision being whether the traffic warrants an independent cluster or can be co-deployed with a cluster that supports other end-points.

When a single service exposes multiple end-points with hundreds of millions of requests a day per end point, we needed to decide between just scaling horizontally forever or break it down into multiple clusters by functionality. There are pros / cons of doing it either way. Here are a few:

Pros of single cluster
-       Single deployment
-       One place to manage / track
Pros of multiple deployment
-       Failure in one end-point does not affect another, especially in internal dependencies
-       Ability to independently scale up/down volume
-       Easier to debug issues

Conclusion

As the traffic was ramped up, we have been able to scale up very comfortably so far learning, lessons as we went along.
Data is being analyzed multiple ways by our algorithmic teams. For example - which row types (Top 10, most recently watched etc.) did most plays emanate from. How did that vary by country and device. How far did users scroll left / right across devices - and do users ever go beyond a certain point. These and many other data points are being examined to improve our algorithms to provide users with a better viewing experience.  

References

Servo : 

Auto-scaling: 

Join Us

Like what you see and want to work on bleeding edge performance and scale?
by Kedar Sadekar, Senior Software Engineer, Product Infrastructure Team




Monday, June 25, 2012

Asgard: Web-based Cloud Management and Deployment

By Joe Sondow, Engineering Tools

For the past several years Netflix developers have been using self-service tools to build and deploy hundreds of applications and services to the Amazon cloud. One of those tools is Asgard, a web interface for application deployments and cloud management.
Asgard is named for the home of the Norse god of thunder and lightning, because Asgard is where Netflix developers go to control the clouds. I’m happy to announce that Asgard has now been open sourced on github and is available for download and use by anyone. All you’ll need is an Amazon Web Services account. Like other open source Netflix projects, Asgard is released under the Apache License, Version 2.0. Please feel free to fork the project and make improvements to it.
Some of the information in this blog post is also published in the following presentations. Note that Asgard was originally named the Netflix Application Console, or NAC.

Visual Language for the Cloud

To help people identify various types of cloud entities, Asgard uses the Tango open source icon set, with a few additions. These icons help establish a visual language to help people understand what they are looking at as they navigate. Tango icons look familiar because they are also used by Jenkins, Ubuntu, Mediawiki, Filezilla, and Gimp. Here is a sampling of Asgard's cloud icons.

Cloud Model

The Netflix cloud model includes concepts that AWS does not support directly: Applications and Clusters.

Application

Below is a diagram of some of the Amazon objects required to run a single front-end application such as Netflix’s autocomplete service.
Here’s a quick summary of the relationships of these cloud objects.
  • An Auto Scaling Group (ASG) can attach zero or more Elastic Load Balancers (ELBs) to new instances.
  • An ELB can send user traffic to instances.
  • An ASG can launch and terminate instances.
  • For each instance launch, an ASG uses a Launch Configuration.
  • The Launch Configuration specifies which Amazon Machine Image (AMI) and which Security Groups to use when launching an instance.
  • The AMI contains all the bits that will be on each instance, including the operating system, common infrastructure such as Apache and Tomcat, and a specific version of a specific Application.
  • Security Groups can restrict the traffic sources and ports to the instances.
That’s a lot of stuff to keep track of for one application.
When there are large numbers of those cloud objects in a service-oriented architecture (like Netflix has), it’s important for a user to be able to find all the relevant objects for their particular application. Asgard uses an application registry in SimpleDB and naming conventions to associate multiple cloud objects with a single application. Each application has an owner and an email address to establish who is responsible for the existence and state of the application's associated cloud objects.
Asgard limits the set of permitted characters in the application name so that the names of other cloud objects can be parsed to determine their association with an application.
Here is a screenshot of Asgard showing a filtered subset of the applications running in our production account in the Amazon cloud in the us-east-1 region:
Screenshot of a detail screen for a single application, with links to related cloud objects:

Cluster

On top of the Auto Scaling Group construct supplied by Amazon, Asgard infers an object called a Cluster which contains one or more ASGs. The ASGs are associated by naming convention. When a new ASG is created within a cluster, an incremented version number is appended to the cluster's "base name" to form the name of the new ASG. The Cluster provides Asgard users with the ability to perform a deployment that can be rolled back quickly.
Example: During a deployment, cluster obiwan contains ASGs obiwan-v063 and obiwan-v064. Here is a screenshot of a cluster in mid-deployment.
The old ASG is “disabled” meaning it is not taking traffic but remains available in case a problem occurs with the new ASG. Traffic comes from ELBs and/or from Discovery, an internal Netflix service that is not yet open sourced.

Deployment Methods

Fast Rollback

One of the primary features of Asgard is the ability to use the cluster screen shown above to deploy a new version of an application in a way that can be reversed at the first sign of trouble. This method requires more instances to be in use during deployment, but it can greatly reduce the duration of service outages caused by bad deployments.
This animated diagram shows a simplified process of using the Cluster interface to try out a deployment and roll it back quickly when there is a problem:
The animation illustrates the following deployment use case:
  1. Create the new ASG obiwan-v064
  2. Enable traffic to obiwan-v064
  3. Disable traffic on obiwan-v063
  4. Monitor results and notice that things are going badly
  5. Re-enable traffic on obiwan-v063
  6. Disable traffic on obiwan-v064
  7. Analyze logs on bad servers to diagnose problems
  8. Delete obiwan-v064

Rolling Push

Asgard also provides an alternative deployment system called a rolling push. This is similar to a conventional data center deployment of a cluster on application servers. Only one ASG is needed. Old instances get gracefully deleted and replaced by new instances one or two at a time until all the instances in the ASG have been replaced. Rolling pushes are useful:
  1. If an ASG's instances are sharded so each instance has a distinct purpose that should not be duplicated by another instance.
  2. If the clustering mechanisms of the application (such as Cassandra) cannot support sudden increases in instance count for the cluster.
Downsides to a rolling push:
  1. Replacing instances in small batches can take a long time.
  2. Reversing a bad deployment can take a long time.

Task Automation

Several common tasks are built into Asgard to automate the deployment process. Here is an animation showing a time-compressed view of a 14-minute automated rolling push in action:

Auto Scaling

Netflix focuses on the ASG as the primary unit of deployment, so Asgard also provides a variety of graphical controls for modifying an ASG and setting up metrics-driven auto scaling when desired.
CloudWatch metrics can be selected from the default provided by Amazon such as CPUUtilization, or can be custom metrics published by your application using a library like Servo for Java.

Why not the AWS Management Console?

The AWS Management Console has its uses for someone with your Amazon account password who needs to configure something Asgard does not provide. However, for everyday large-scale operations, the AWS Management Console has not yet met the needs of the Netflix cloud usage model, so we built Asgard instead. Here are some of the reasons.
  • Hide the Amazon keys

    Netflix grants its employees a lot of freedom and responsibility, including the rights and duties of enhancing and repairing production systems. Most of those systems run in the Amazon cloud. Although we want to enable hundreds of engineers to manage their own cloud apps, we prefer not to give all of them the secret keys to access the company’s Amazon accounts directly. Providing an internal console allows us to grant Asgard users access to our Amazon accounts without telling too many employees the shared cloud passwords. This strategy also saves us from needing to assign and revoke hundreds of Identity and Access Management (IAM) cloud accounts for employees.
  • Auto Scaling Groups

    As of this writing the AWS Management Console lacks support for Auto Scaling Groups (ASGs). Netflix relies on ASGs as the basic unit of deployment and management for instances of our applications. One of our goals in open sourcing Asgard is to help other Amazon customers make greater use of Amazon’s sophisticated auto scaling features. ASGs are a big part of the Netflix formula to provide reliability, redundancy, cost savings, clustering, discoverability, ease of deployment, and the ability to roll back a bad deployment quickly.
  • Enforce Conventions

    Like any growing collection of things users are allowed to create, the cloud can easily become a confusing place full of expensive, unlabeled clutter. Part of the Netflix Cloud Architecture is the use of registered services associated with cloud objects by naming convention. Asgard enforces these naming conventions in order to keep the cloud a saner place that is possible to audit and clean up regularly as things get stale, messy, or forgotten.
  • Logging

    So far the AWS console does not expose a log of recent user actions on an account. This makes it difficult to determine whom to call when a problem starts, and what recent changes might relate to the problem. Lack of logging is also a non-starter for any sensitive subsystems that legally require auditability.
  • Integrate Systems

    Having our own console empowers us to decide when we want to add integration points with our other engineering systems such as Jenkins and our internal Discovery service.
  • Automate Workflow

    Multiple steps go into a safe, intelligent deployment process. By knowing certain use cases in advance Asgard can perform all the necessary steps for a deployment based on one form submission.
  • Simplify REST API

    For common operations that other systems need to perform, we can expose and publish our own REST API to do exactly what we want in a way that hides some of the complex steps from the user.

Costs

When using cloud services, it’s important to keep a lid on your costs. As of June 5, 2012, Amazon now provides a way to track your account’s charges frequently. This data is not exposed through Asgard as of this writing, but someone in your company should keep track of your cloud costs regularly. See http://aws.typepad.com/aws/2012/06/new-programmatic-access-to-aws-billing-data.html
Starting up Asgard does not initially cause you to incur any Amazon charges, because Amazon has a free tier for SimpleDB usage and no charges for creating Security Groups, Launch Configurations, or empty Auto Scaling Groups. However, as soon as you increase the size of an ASG above zero Amazon will begin charging you for instance usage, depending on your status for Amazon’s Free Usage Tier. Creating ELBs, RDS instances, and other cloud objects can also cause you to incur charges. Become familiar with the costs before creating too many things in the cloud, and remember to delete your experiments as soon as you no longer need them. Your Amazon costs are your own responsibility, so run your cloud operations wisely.

Feature Films

By extraordinary coincidence, Thor and Thor: Tales of Asgard are now available to watch on Netflix streaming.

Conclusion

Asgard has been one of the primary tools for application deployment and cloud management at Netflix for years. By releasing Asgard to the open source community we hope more people will find the Amazon cloud and Auto Scaling easier to work with, even at large scale like Netflix. More Asgard features will be released regularly, and we welcome participation by users on GitHub.
Follow the Netflix Tech Blog and the @NetflixOSS twitter feed for more open source components of the Netflix Cloud Platform.
If you're interested in working with us to solve more of these interesting problems, have a look at the Netflix jobs page to see if something might suit you. We're hiring!

Related Resources

Asgard

Netflix Cloud Platform

Amazon Web Services

Wednesday, June 20, 2012

Netflix Recommendations: Beyond the 5 stars (Part 2)

by Xavier Amatriain and Justin Basilico (Personalization Science and Engineering)
 
In part one of this blog post, we detailed the different components of Netflix personalization. We also explained how Netflix personalization, and the service as a whole, have changed from the time we announced the Netflix Prize.The $1M Prize delivered a great return on investment for us, not only in algorithmic innovation, but also in brand awareness and attracting stars (no pun intended) to join our team. Predicting movie ratings accurately is just one aspect of our world-class recommender system. In this second part of the blog post, we will give more insight into our broader personalization technology. We will discuss some of our current models, data, and the approaches we follow to lead innovation and research in this space.

Ranking

The goal of recommender systems is to present a number of attractive items for a person to choose from. This is usually accomplished by selecting some items and sorting them in the order of expected enjoyment (or utility). Since the most common way of presenting recommended items is in some form of list, such as the various rows on Netflix, we need an appropriate ranking model that can use a wide variety of information to come up with an optimal ranking of the items for each of our members.

If you are looking for a ranking function that optimizes consumption, an obvious baseline is item popularity. The reason is clear: on average, a member is most likely to watch what most others are watching. However, popularity is the opposite of personalization: it will produce the same ordering of items for every member. Thus, the goal becomes to find a personalized ranking function that is better than item popularity, so we can better satisfy members with varying tastes.

Recall that our goal is to recommend the titles that each member is most likely to play and enjoy. One obvious way to approach this is to use the member's predicted rating of each item as an adjunct to item popularity. Using predicted ratings on their own as a ranking function can lead to items that are too niche or unfamiliar being recommended, and can exclude items that the member would want to watch even though they may not rate them highly. To compensate for this, rather than using either popularity or predicted rating on their own, we would like to produce rankings that balance both of these aspects. At this point, we are ready to build a ranking prediction model using these two features.

There are many ways one could construct a ranking function ranging from simple scoring methods, to pairwise preferences, to optimization over the entire ranking. For the purposes of illustration, let us start with a very simple scoring approach by choosing our ranking function to be a linear combination of popularity and predicted rating. This gives an equation of the form frank(u,v) = w1 p(v) + w2 r(u,v) + b, where u=user, v=video item, p=popularity and r=predicted rating. This equation defines a two-dimensional space like the one depicted below.

Once we have such a function, we can pass a set of videos through our function and sort them in descending order according to the score. You might be wondering how we can set the weights w1 and w2 in our model (the bias b is constant and thus ends up not affecting the final ordering). In other words, in our simple two-dimensional model, how do we determine whether popularity is more or less important than predicted rating? There are at least two possible approaches to this. You could sample the space of possible weights and let the members decide what makes sense after many A/B tests. This procedure might be time consuming and not very cost effective. Another possible answer involves formulating this as a machine learning problem: select positive and negative examples from your historical data and let a machine learning algorithm learn the weights that optimize your goal. This family of machine learning problems is known as "Learning to rank" and is central to application scenarios such as search engines or ad targeting. Note though that a crucial difference in the case of ranked recommendations is the importance of personalization: we do not expect a global notion of relevance, but rather look for ways of optimizing a personalized model.

As you might guess, apart from popularity and rating prediction, we have tried many other features at Netflix. Some have shown no positive effect while others have improved our ranking accuracy tremendously. The graph below shows the ranking improvement we have obtained by adding different features and optimizing the machine learning algorithm.


Many supervised classification methods can be used for ranking. Typical choices include Logistic Regression, Support Vector Machines, Neural Networks, or Decision Tree-based methods such as Gradient Boosted Decision Trees (GBDT). On the other hand, a great number of algorithms specifically designed for learning to rank have appeared in recent years such as RankSVM or RankBoost. There is no easy answer to choose which model will perform best in a given ranking problem. The simpler your feature space is, the simpler your model can be. But it is easy to get trapped in a situation where a new feature does not show value because the model cannot learn it. Or, the other way around, to conclude that a more powerful model is not useful simply because you don't have the feature space that exploits its benefits.

Data and Models

The previous discussion on the ranking algorithms highlights the importance of both data and models in creating an optimal personalized experience for our members. At Netflix, we are fortunate to have many relevant data sources and smart people who can select optimal algorithms to turn data into product features. Here are some of the data sources we can use to optimize our recommendations:
  • We have several billion item ratings from members. And we receive millions of new ratings a day.
  • We already mentioned item popularity as a baseline. But, there are many ways to compute popularity. We can compute it over various time ranges, for instance hourly, daily, or weekly. Or, we can group members by region or other similarity metrics and compute popularity within that group.
  • We receive several million stream plays each day, which include context such as duration, time of day and device type.
  • Our members add millions of items to their queues each day.
  • Each item in our catalog has rich metadata: actors, director, genre, parental rating, and reviews.
  • Presentations: We know what items we have recommended and where we have shown them, and can look at how that decision has affected the member's actions. We can also observe the member's interactions with the recommendations: scrolls, mouse-overs, clicks, or the time spent on a given page.
  • Social data has become our latest source of personalization features; we can process what connected friends have watched or rated.
  • Our members directly enter millions of search terms in the Netflix service each day.
  • All the data we have mentioned above comes from internal sources. We can also tap into external data to improve our features. For example, we can add external item data features such as box office performance or critic reviews.
  • Of course, that is not all: there are many other features such as demographics, location, language, or temporal data that can be used in our predictive models.
So, what about the models? One thing we have found at Netflix is that with the great availability of data, both in quantity and types, a thoughtful approach is required to model selection, training, and testing. We use all sorts of machine learning approaches: From unsupervised methods such as clustering algorithms to a number of supervised classifiers that have shown optimal results in various contexts. This is an incomplete list of methods you should probably know about if you are working in machine learning for personalization:
  • Linear regression
  • Logistic regression
  • Elastic nets
  • Singular Value Decomposition
  • Restricted Boltzmann Machines
  • Markov Chains
  • Latent Dirichlet Allocation
  • Association Rules
  • Gradient Boosted Decision Trees
  • Random Forests
  • Clustering techniques from the simple k-means to novel graphical approaches such as Affinity Propagation
  • Matrix factorization

Consumer Data Science

The abundance of source data, measurements and associated experiments allow us to operate a data-driven organization. Netflix has embedded this approach into its culture since the company was founded, and we have come to call it Consumer (Data) Science. Broadly speaking, the main goal of our Consumer Science approach is to innovate for members effectively. The only real failure is the failure to innovate; or as Thomas Watson Sr, founder of IBM, put it: “If you want to increase your success rate, double your failure rate.” We strive for an innovation culture that allows us to evaluate ideas rapidly, inexpensively, and objectively. And, once we test something we want to understand why it failed or succeeded. This lets us focus on the central goal of improving our service for our members.

So, how does this work in practice? It is a slight variation over the traditional scientific process called A/B testing (or bucket testing):

1. Start with a hypothesis
  • Algorithm/feature/design X will increase member engagement with our service and ultimately member retention
2. Design a test
  • Develop a solution or prototype. Ideal execution can be 2X as effective as a prototype, but not 10X.
  • Think about dependent & independent variables, control, significance…
3. Execute the test

4. Let data speak for itself

When we execute A/B tests, we track many different metrics. But we ultimately trust member engagement (e.g. hours of play) and retention. Tests usually have thousands of members and anywhere from 2 to 20 cells exploring variations of a base idea. We typically have scores of A/B tests running in parallel. A/B tests let us try radical ideas or test many approaches at the same time, but the key advantage is that they allow our decisions to be data-driven. You can read more about our approach to A/B Testing in this previous tech blog post or in some of the Quora answers by our Chief Product Officer Neil Hunt.

An interesting follow-up question that we have faced is how to integrate our machine learning approaches into this data-driven A/B test culture at Netflix. We have done this with an offline-online testing process that tries to combine the best of both worlds. The offline testing cycle is a step where we test and optimize our algorithms prior to performing online A/B testing. To measure model performance offline we track multiple metrics used in the machine learning community: from ranking measures such as normalized discounted cumulative gain, mean reciprocal rank, or fraction of concordant pairs, to classification metrics such as accuracy, precision, recall, or F-score. We also use the famous RMSE from the Netflix Prize or other more exotic metrics to track different aspects like diversity. We keep track of how well those metrics correlate to measurable online gains in our A/B tests. However, since the mapping is not perfect, offline performance is used only as an indication to make informed decisions on follow up tests.

Once offline testing has validated a hypothesis, we are ready to design and launch the A/B test that will prove the new feature valid from a member perspective. If it does, we will be ready to roll out in our continuous pursuit of the better product for our members. The diagram below illustrates the details of this process.


An extreme example of this innovation cycle is what we called the Top10 Marathon. This was a focused, 10-week effort to quickly test dozens of algorithmic ideas related to improving our Top10 row. Think of it as a 2-month hackathon with metrics. Different teams and individuals were invited to contribute ideas and code in this effort. We rolled out 6 different ideas as A/B tests each week and kept track of the offline and online metrics. The winning results are already part of our production system.

 

Conclusion

The Netflix Prize abstracted the recommendation problem to a proxy question of predicting ratings. But member ratings are only one of the many data sources we have and rating predictions are only part of our solution. Over time we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to watch a title and enjoys it enough to come back to the service. More data availability enables better results. But in order to get those results, we need to have optimized approaches, appropriate metrics and rapid experimentation.

To excel at innovating personalization, it is insufficient to be methodical in our research; the space to explore is virtually infinite. At Netflix, we love choosing and watching movies and TV shows. We focus our research by translating this passion into strong intuitions about fruitful directions to pursue; under-utilized data sources, better feature representations, more appropriate models and metrics, and missed opportunities to personalize. We use data mining and other experimental approaches to incrementally inform our intuition, and so prioritize investment of effort. As with any scientific pursuit, there’s always a contribution from Lady Luck, but as the adage goes, luck favors the prepared mind. Finally, above all, we look to our members as the final judges of the quality of our recommendation approach, because this is all ultimately about increasing our members' enjoyment in their own Netflix experience. We are always looking for more people to join our team of "prepared minds". Make sure you take a look at our jobs page.

Monday, June 18, 2012

Announcing Archaius: Dynamic Properties in the Cloud

By Allen Wang and Sudhir Tonse


Netflix has a culture of being dynamic when it comes to decision making. This trait comes across both in the business domain as well as in technology and operations.
It follows that we like the ability to effect changes in the behavior of our deployed services dynamically at run-time. Availability is of the utmost importance to us, so we would like to accomplish this without having to bounce servers.
Furthermore, we want the ability to dynamically change properties (and hence the logic and behavior of our services) based on a request or deployment context. For example, we want to configure properties for an application instance or request, based on factors like the Amazon Region the service is deployed in, the country of origin (of the request), the device the movie is playing on etc.

What is Archaius?

 








(Image obtained from http://en.wikipedia.org/wiki/File:Calumma_tigris-2.jpg)

Archaius, is the dynamic, multi dimensional, properties framework that addresses these requirements and use cases.
The code name for the project comes from an endangered species of Chameleons. More information can be found at http://en.wikipedia.org/wiki/Archaius_tigris. We chose Archaius, as Chameleons are known for changing their color (a property) based on their environment and situation.


We are pleased to announce the public availability of Archaius as an important milestone in our continued goal of open sourcing the Netflix Platform Stack. (Available at http://github.com/netflix)

Why Archaius?

To understand why we built Archaius, we need to enumerate the pain points of configuration management and the ecosystem that the system operates in. Some of these are captured below, and drove the requirements.
  • Static updates require server pushes; this was operationally undesirable and caused a dent in the availability of the service/application.
  • A Push method of updating properties could not be employed as this system would need to know all the server instances to push the configuration to at any given point in time ((i.e. the list of hostnames and property locations). 
    • This was a possibility in our own data center where we owned all the servers. In the cloud, the instances are ephemeral and their hostnames/ip addresses are not known in advance. Furthermore, the number of these instances fluctuate based on the ASG settings. (for more information on how Netflix uses Auto Scaling Group feature of AWS, please visit here or here).
  • Given that property changes had to be applied at run time, it was clear that the codebase had to use a common mechanism which allowed it to consume properties in a uniform manner, from different sources (both static and dynamic).
  • There was a need to have different properties for different applications and services under different contexts. See the section "Netflix Deployment Overview" for an overview of services and context.
  • Property changes needed to be journaled. This allowed us to correlate any issues in production to a corresponding run time property change.
  • Properties had to be applied based on the Context. i.e. The property had to be multi dimensional. At Netflix, the context was based on "dimensions" such as Environment (development, test, production), Deployed Region (us-east-1, us-west-1 etc.), "Stack" (a concept in which each app and the services in its dependency graph were isolated for a specific purpose; e.g. "iPhone App launch Stack") etc.

Use Cases/Examples

  • Enable or disable certain features based on the request context. 
  • A UI presentation logic layer may have a default configuration to display 10 Movie Box Shots in a single display row. If we determine that we would like to display 5 instead, we can do so using Archaius' Dynamic Properties.
  • We can override the behaviors of the circuit breakers. Reference: Resiliency and Circuit breakers
  • Connection and request timeouts for calls to internal and external services can be adjusted as needed
  • In case we get alerted on errors observed in certain services, we can change the Log Levels (i.e. DEBUG, WARN etc.) dynamically for particular packages/components on these services. This enables us to parse the log files to inspect these errors. Once we are done inspecting the logs, we can reset the Log Levels using Dynamic Properties.
  • Now that Netflix is deployed in an ever growing global infrastructure, Dynamic Properties allow us to enable different characteristics and features based on the International market.
  • Certain infrastructural components benefit from having configurations changed at Runtime based on aggregate site wide behavior. For e.g. a distributed cache component's TTL (time to live) can be changed at runtime based on external factors.
  • Connection pools had to be set differently for the same client library based on which application/service it was deployed in. (For example, in a light weight, low Requests Per Second (RPS) application, the number of connections in a connection pool to a particular service/db will be set to a lower number compared to a high RPS application)
  • The changes in properties can be effected on on a particular instance, a particular region, a stack of deployed services or an entire farm of a particular application at run-time.

Netflix Deployment Overview




 

Example Deployment Context

  • Environment = TEST
  • Region = us-east-1
  • Stack = MyTestStack
  • AppName = cherry
The diagram above shows a hypothetical simplistic overview of a typical deployment architecture at Netflix. Netflix has several services and applications that are consumer facing. These are referred to as Edge Services/Applications. These are typically fronted by Amazon's ELB. Each application/service depends on a set of mid-tier services and persistence technologies (Amazon S3, Cassandra etc.) sometimes fronted by a distributed cache.

Every service or application has a unique "AppName" associated with it. Most services at Netflix are stateless and hosted on multiple instances deployed across multiple Availability Zones of an Amazon Region. The available environments could be "test" or "production" etc. A Stack is logical grouping. For example, an Application and the Mid-Tier Services in its dependency graph can all be logically grouped as belonging to a Stack called "MyTestStack". This is typically done to run different tests on isolated and controlled deployments.

The red oval boxes in the diagram above called "Shared Libraries" are the various common code used by multiple applications. For example, Astyanax, our open sourced Cassandra Client is one such shared library. Turns out that we may need to configure the connection pool differently for each of the applications that is using the Astyanax library. Furthermore it could vary in different Amazon Regions and within different "Stacks" of deployments. Sometimes, we may want to tweak this connection pool parameter at runtime. These are the capabilities that Archaius offers.
i.e. The ability to specifically target a subset or an aggregation of components with a view towards configuring their behavior at static (initial loading) or runtime is what enables us to address the use cases outlined above.

The examples and diagrams in this article show a representative view of how Archaius is used at Netflix. Archaius, the Open sourced version of the project is configurable and extendable to meet your specific needs and deployment environment (even if your deployment of choice is not the EC2 Cloud).

Overview of Archaius


 
Archaius includes a set of java configuration management APIs that are used at Netflix. It is primarily implemented as an extension of Apache's Common Configuration library. Notable features are:
  • Dynamic, Typed Properties
  • High throughput and Thread Safe Configuration operations
  • A polling framework that allows for obtaining property changes from a Configuration Source
  • A Callback mechanism that gets invoked on effective/"winning" property mutations (in the ordered hierarchy of Configurations)
  • A JMX MBean that can be accessed via JConsole to inspect and invoke operations on properties
At the heart of Archaius is the concept of Composite Configuration which is an ordered list of one or more Configurations. Each Configuration can be sourced from a Configuration Source such as JDBC, REST API, a .properties file etc. Configuration Sources can optionally be polled at runtime for changes (In the above diagram, the Persisted DB Configuration Source which is an RDBMS containing properties in a table, is polled every so often for changes). The final value of a property is determined based on the top most Configuration that contains that property. i.e. If a property is present in multiple configurations, the actual value seen by the application will be the value that is present in the topmost slot in the hierarchy of Configurations. The order of the configurations in the hierarchy can be configured.

A rough template for handling a request and using Dynamic Property based execution is shown below: 
 
void handleFeatureXYZRequest(Request params ...){
  if (featureXYZDynamicProperty.get().equals("useLongDescription"){
   showLongDescription();
  } else {
   showShortSnippet();
  }
}
The source code for Archaius is hosted on GitHub at https://github.com/Netflix/archaius.

References

  1. Apache's Common Configuration library
  2. Archaius Features
  3. Archaius User Guide

Conclusion

Archaius forms an important component of the Netflix Cloud Platform. It offers the ability to control various sub systems and components at runtime without any impact to the availability of the services. We hope that this is a useful addition to the list of projects open sourced by Netflix, and invite the open source community to help us improve Archaius and other components.

Interested in helping us take Netflix Cloud Platform to the next level? We are looking for talented engineers.

- Allen Wang, Sr. Software Engineer, Cloud Platform (Core Infrastructure)
- Sudhir Tonse (@stonse), Manager, Cloud Platform (Core Infrastructure)



Friday, June 15, 2012

Netflix Operations: Part I, Going Distributed

Running the Netflix Cloud

Moving to the cloud presented new challenges for us[1] and forced us to develop new design patterns for running a reliable and resilient distributed system[2].  We’ve focused many of our past posts on the technical hurdles we overcame to run successfully in the cloud.  However, we had to make operational and organizational transformations as well.  We want to share the way we think about operations at Netflix to help others going through a similar journey.  In putting this post together, we realized there’s so much to share that we decided to make this a first in a series of posts on operations at Netflix.

The old guard

When we were running out of our data center, Netflix was a large monolithic Java application running inside of a tomcat container.  Every two weeks, the deployment train left at exactly the same time and anyone wanting to deploy a production change needed to have their code checked-in and tested before departure time.  This also meant that anyone could check-in bad code and bring the entire train to a halt while the issue was diagnosed and resolved.  Deployments were heavy and risk-laden and, because of all the moving parts going into each deployment, it was handled by a centralized team that was part of ITOps.

Production support was similarly centralized within ITOps.  We had a traditional NOC that monitored charts and graphs and was called when a service interruption occurred.  They were organizationally separate from the development team.  More importantly, there was a large cultural divide between the operations and development teams because of the mismatched goals of site uptime versus features and velocity of innovation.

Built for scale in the cloud

In moving to the cloud, we saw an opportunity to recast the mold for how we build and deploy our software.  We used the cloud migration as an opportunity to re-architect our system into a service oriented architecture with hundreds of individual services.  Each service could be revved on its own deployment schedule, often weekly, empowering each team to deliver innovation at its own desired pace.  We unwound the centralized deployment team and distributed the function into the teams that owned each service.

Post-deployment support was similarly distributed as part of the cloud migration.  The NOC was disbanded and a new Site Reliability Engineering team was created within the development organization not to operate the system, but to provide system-wide analysis and development around reliability and resiliency.

The road to a distributed future

As the scale of web applications has grown over time due to the addition of features and growth of usage, the application architecture has changed radically.  There are a number of things that exemplify this: service oriented architecture, eventually consistent data stores, map-reduce, etc.  The fundamental thing that they all share is a distributed architecture that involves numerous applications, servers and interconnections.  For Netflix this meant moving from a few teams checking code into a large monolithic application running on tens of servers to having tens of engineering teams developing hundreds of component services that run on thousands of servers.


The Netflix distributed system
As all of these changes occurred on the engineering side, we had to modify the way that we think about and organize for operations as well.  Our approach has been to make operations itself a distributed system.  Each engineering team is responsible for coding, testing and operating its systems in the production environment.  The result is that each team develops the expertise in the operational areas it most needs and then we leverage that knowledge across the organization.  There is an argument that developers needing to fundamentally understand how to operate, monitor and improve the resiliency of their applications in production is a distraction from their “real” work.  However, our experience has been that the added ownership actually leads to more robust applications and greater agility than centralizing these efforts.  For example, we've found that making the developers responsible for fixing their own code at 4am has encouraged them to write more robust code that handles failure gracefully, as a way of avoiding getting another 4am call. Our developers get more work done more quickly than before.

As we grew, it quickly became clear that centralized operations was not well suited for our new use case.  Our production environment is too complex for any one team or organization to understand well, which meant that they were forced to either make ill-informed decisions based on their perceptions, or get caught in a game of telephone tag with different development teams.  We also didn’t want our engineering teams to be tightly coupled to each other when they were making changes to their applications.

In addition to distributing the operations experience throughout development, we also heavily invested in tools and automation.  We created a number of engineering teams to focus on high volume monitoring and event correlation, end-to-end continuous integration and builds, and automated deployment tools[
3][4].  These tools are critical to limiting the amount of extra work developers must do in order to manage the environment while also providing them with the information that they need to make smart decisions about when and how they deploy, vet and diagnose issues with each new code deployment.
Conclusion

Our architecture and code base evolved and adapted to our new cloud-based environment.  Our operations have evolved as well.  Both aim to be distributed and scalable.  Once a centralized function, operations is now distributed throughout the development organization.  Successful operations at Netflix is a distributed system, much like our software, that relies on algorithms, tools, and automation to scale to meet the demands of our ever-growing user-base and product.

In future posts, we’ll explore the various aspects of operations at Netflix in more depth.  If you want to see any area explored in more detail, comment below or tweet us.



If you’re passionate about building and running massive-scale web applications, we’re always looks for amazing developers and site reliability engineers.  See all our open positions at www.netflix.com/jobs.

- Ariel Tseitlin (@atseitlin), Director of Cloud Solutions
- Greg Orzell (@chaossimia), Cloud & Platform Engineering Architect

[1] http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
[2] http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
[3] http://www.slideshare.net/joesondow/asgard-the-grails-app-that-deploys-netflix-to-the-cloud
[4] http://www.slideshare.net/carleq/building-cloudtoolsfornetflixcode-mash2012