## Monday, December 9, 2013

### Announcing Suro: Backbone of Netflix's Data Pipeline

To make the best business and technical decisions, it is critical for Netflix to reliably collect application specific data in a timely fashion. At Netflix we deploy a fairly large number of AWS EC2 instances that host our web services and applications. They collectively emit more than 1.5 million events per second during peak hours, or around 80 billion events per day. The events could be log messages, user activity records, system operational data, or any arbitrary data that our systems need to collect for business, product, and operational analysis.

Given that data is critical to our operations and yet we allow applications to generate arbitrary events, our data pipeline infrastructure needs to be highly scalable, always available, and deliver events with minimal latency, which is measured as elapsed time between the moment when an event is emitted and when the event is available for consumption by its consumers. And yes, the data pipeline needs to be resilient to our own Simian Army, particularly the Chaos Monkeys.

While various web services and applications produce events to Suro, many kinds of consumers may process such data differently. For example, our Hadoop clusters run MapReduce jobs on the collected events to generate offline business reports. Our event stream clusters generate operational reports to reflect real-time trends. Since we may dispatch events to different consumers based on changing needs, our data pipeline also needs to be dynamically configurable.

Suro, which we are proud to announce as our latest offering as part of the NetflixOSS family, serves as the backbone of our data pipeline. It consists of a producer client, a collector server, and plugin framework that allows events to be dynamically filtered and dispatched to multiple consumers.

## History of Suro

Suro has its roots in Apache Chukwa, which was initially adopted by Netflix. The current incarnation grew out of what we learned from meeting the operational requirements of running in production over the past few years. The following are notable modifications compared to Apache Chukwa:
• Suro supports arbitrary data formats. Users can plug in their own serialization and deserialization code
• Suro instruments many tagged monitoring metrics to make itself operations friendly
• Suro integrates with NetflixOSS to be cloud friendly
• Suro supports dispatching events to multiple destinations with dynamic configuration
• Suro supports configurable store-and-forward on both client and collector

## Overall Architecture

The figure below illustrates the overall architecture of Suro. It is the single data pipeline that collects events generated by Netflix applications running in either AWS cloud or Netflix data centers. Suro also dispatches events to multiple destinations for further processing.

Such arrangement supports two typical use cases: batched processing, and real-time computation.

### Batch Processing

Many analytical reports are generated by Hadoop jobs. In fact, Suro was initially deployed just to collect data for our Big Data Platform team’s Hadoop clusters. For this, Suro aggregates data into Hadoop sequence files, and uploads them into designated S3 buckets. A distributed demuxing cluster demuxes the events in the S3 buckets to prepare them for further processing by Hadoop jobs. We hope to open source the demuxer in the next few months. Our Big Data Platform team has already open sourced pieces of our data infrastructure, such as Lipstick and Genie. Others will be coming soon. A previous post titled Hadoop Platform as a Service in the Cloud covers this in detail.

### Real-Time Computation

While offline/batch processing the events still form the bulk of our consumer use cases, the more recent trend has been in the area of real-time stream processing. Stream consumers are typically employed to generate instant feedback, exploratory analysis, and operational insights. Log Summaries of application-generated log data is an example that falls under this bucket. The following graph summarizes how log events flow from applications to two different classes of consumers.

1. Applications emits events to Suro. The events include log lines.
2. Suro dispatches all the events to S3 by default. Hadoop jobs will process these events.
3. Based on a dynamically configurable routing rule, Suro also dispatches these log events to a designated Kafka cluster under a mapped topic.
4. Druid cluster indexes the log lines on the fly, making them immediately available for querying. For example, our service automatically detects error surges for each application within a 10 minute window and sends out alerts to application owners:
Application owners can then go to Druid’s UI to explore such errors:
5. A customized ElasticSearch cluster also ingests the same sets of log lines, de-duplicates them, and makes them immediately available for querying. Users are able to jump from an aggregated view on the Druid UI to individual records on ElasticSearch’s Kibana UI to see exactly what went wrong.

Of course, this is just one example of making use of real-time analysis. We are also actively looking into various technologies such as Storm and Apache Samza to apply iterative machine learning algorithms on application events.

### Suro Collector In Detail

The figure below zooms into the design of the Suro Collector service. The design is similar to SEDA. Events are processed asynchronously in stages. Events are offered to a queue first in each stage, and a pool of threads consume the events asynchronously from the queue, process them, and sends them off to the next stage.

The main processing flow is as follows:
1. Client buffers events in a buffer called message set, and sends buffered messages to Suro Collector.
2. Suro Collector takes each incoming message set, deflates it if possible, and immediately returns after handing the message set to Message Set Processor.
3. The Message Set Processor puts the message set into a queue, and has a pool of  Message Router threads that routes messages asynchronously.
4. A Message Router determines which sink a message should go to. If there’s a filter configured for a message, the message payload will be deserialized.
5. Each sink maintains its own queue, and sends messages to designated configurations asynchronously.

## Performance Measurement

Here are some results from simple stress tests.

### Test setup

 Collector Cluster Size Client Cluster Size Message Batch Size Message Size Sink Config - 3 m1.xlarge instances - OS: Linux - JDK: 1.7.0_25 - 6 m1.xlarge instances - OS: Linux - JDK: 1.7.0_25 1000 300 Bytes - S3 Sink- Batch: 1000 - Format: sequence file - Compression: LZO - Write to disk first: Yes - Disk Type: EBS - Notice type: SQS

### Test Results

The following table summarizes the test result after warm-up:
 Total Message Count Total Message Size Peak Throughput Per instance Average Throughput 705 Million 211 GB 66 K msg/sec 60 K msg/sec

The version open sourced today has the following components.
• Suro Client
• Suro Server
• Kafka Sink and S3 Sink
• Three Types of Message Filters

In the coming months, we will describe and open source other parts of the pipeline. We would love to collaborate with other solutions in the community in this domain and hope that Suro, Genie, Lipstick etc. provide some of the answers in this highly evolving and popular technology space.

## Other Event Pipelines

Suro evolved over the past few years alongside many other powerful data pipeline solutions such as Apache Flume and Facebook Scribe. Suro has overlapping features with these systems. The strength of Suro is that it is well integrated into AWS and especially the ecosystem of NetflixOSS, to support Amazon Auto Scaling, Netflix Chaos Monkey, and dynamic dispatching of events based on user defined rules. In particular,
• Suro client has built-in load balancer that is aware of Netflix Eureka, while Suro server also integrates with Netflix Eureka. Therefore, both Suro servers and applications that use Suro client can be auto scaled.
• Suro server uses EBS and file-backed queues to minimize message loss during unexpected EC2 termination.
• Suro server is able to push messages to arbitrary consumers at runtime with the help of Netflix Archaius. Users can declaratively configure Suro server at runtime to dispatch events to multiple destinations, such as Apache Kafka, SQS, S3, and any HTTP endpoint. The dispatching can be done in either batches or real time.

## Summary

Suro has been the backbone of the data pipeline at Netflix for a few years and has evolved to handle many of the typical use cases that any Big Data infrastructure aims to solve.
In this article, we have described the top level architecture, the use cases, and some of the components that form the overall data pipeline infrastructure at Netflix. We are happy to open source Suro and welcome inputs, comments and involvement from the open source community.

If building critical big data infrastructure is your interest and passion, please take a look at http://jobs.netflix.com.

## Thursday, December 5, 2013

### Announcing Zeno - Netflix’s In-Memory Data Distribution Framework

by Drew Koszewnik

Netflix’s Video Metadata Service (VMS) is the platform that supplies all of the data about our movies and TV shows necessary to drive the Netflix experience. This data ranges from video titles and synopses to the resolutions and bitrates of streams for playback.

In previous blog posts, we revealed how we manage this large volume of data efficiently for our high availability applications by loading it, in its entirety, directly into memory. Our rationale behind this architectural decision stems from the extremely low latency tolerance required to support the billions of requests Netflix serves per day.

There are two overlapping data sets which are managed by VMS in different ways:
1. Data describing “things” (e.g. videos, genres, actors, streams, etc).
2. The relationships between these “things”.
Previously, we described our use of NetflixGraph to represent the “relationships” data set above. In contrast, the “things” dataset we choose to represent as POJO objects. In this article, we will be introducing Zeno, the framework we use to distribute and keep up to date gigabytes of object data on thousands of servers across the globe. This post will show how Zeno:
• Resulted in an immediate memory footprint reduction of roughly 50% for the largest in-memory dataset on Netflix’s servers.
• Significantly reduced Netflix server startup times across the board.
• Reduced the development maintenance effort required to evolve our data model.
The challenges we address with Zeno are faced by many organizations. Today, we are open sourcing Zeno. If you face some of the challenges described in this article on your team, you may be surprised by the efficiency gains you can realize by applying this library.

What’s in the Box?

Zeno binaries are uploaded to Maven central, and the code is available for download from github. Zeno:
• Creates compact serialized representations of a set of Java objects.
• Automatically detects and removes duplication in a data set.
• Serializes data changes, so updates require minimal resource usage.
• Is efficient about memory and GC impact when deserializing data.
• Provides powerful tools to debug data sets.
• Defines a pattern for separation between data model and data operations, increasing the agility of development teams.
Zeno will create serialized representations,  but some assembly is required: Zeno does not make assumptions about how the serialized data will be transported to consumers.  This is by design; VMS is only the first use case for Zeno, and this scope limit opens up potential for a wider range of use cases.

Data Distribution Challenges

During its initial development stages, Zeno’s primary mandate was to improve the performance of VMS’s object data propagation. Our data set is small enough to fit in memory, but loading it on the Java heap and keeping it updated throughout the life of the server presents several problem-specific performance challenges:
• It should not take a long time to initialize the data set in memory. Long server startup times impact our autoscaling ability.
• Periodic updates should not cause significant spikes in server resource usage.
• Periodic updates will create long-lived objects, and since young space collection pauses are proportional to the amount of surviving data, this can lead to longer duration stop-the-world GC pauses during data updates.
• Transferring large amounts of data takes time, and our goal is to minimize the time between data becoming available in our source of truth, and the effects of that data becoming visible to customers using the service.
In order to understand what Zeno has to offer in this arena, let’s explore the current state of the VMS architecture.

VMS Data Distribution

The VMS architecture consists of a single server (with plenty of CPU and RAM), called the “data origination server”, which every few minutes pulls in all of the data necessary from multiple sources and reconstructs our video metadata objects.

The object data this server constructs encompasses all of the video metadata necessary to power the Netflix experience. Each time this data is constructed, we use Zeno to create a serialized representation, which we upload to a persistent file store (we choose S3 for this, but any technology can be leveraged for this use case). After upload, we notify all of the thousands of Netflix servers that new data is available. Each of the Netflix servers then downloads the data, and deserializes it using Zeno.

The actual transmission of the serialized data is out of scope for the Zeno framework. As such, use of Zeno does not require use of a persistent file store. However, we achieve fault-tolerance by using one. If the data origination server goes down for any reason, the serialized representation of our POJO instances is still available. The data across all Netflix servers will simply become “stale” until a new data origination server comes back online.

In the following two sections, we explore in more detail two specific features via which Zeno supports an extremely high level of efficiency for this data propagation strategy.

Deltas and Snapshots

Each time the data origination server constructs POJO instances they will be slightly different than the previous time they were constructed. These differences are usually very small relative to the entire dataset. We don’t want to ship the entire dataset in order to effect a relatively small change in the data -- instead, we rely on Zeno to calculate a series of modifications to apply to the current data state in order to transform the data to the next state. We then ask Zeno to produce a small file describing these modifications, which we call a “delta” file.

Netflix servers which stay online for long periods of time can stay up to date indefinitely by using Zeno to apply these delta updates when they are available. However, due to both deployment and autoscaling, Netflix servers come up and shut down in AWS datacenters across two continents continuously. Newly started servers need a way to efficiently bootstrap their data.

To address this need, every so often the data origination server uses Zeno to create a file encompassing the entire data set. This larger “snapshot” file contains all of the data necessary to bootstrap to the current point in time. From this state, newly created servers can chain delta updates together to get to a later state.

The delta file approach minimizes the amount of data we need to distribute to many servers. We gain multiple benefits from this minimization:
• The data update thread competes for fewer of the servers’ resources.
• We create vastly fewer long-lived objects during data updates.
This last benefit is compounded by the distributed nature of the Netflix infrastructure. It takes a lot longer to copy data, for example, across the Atlantic Ocean to Europe than it does to copy data within the same data center. We therefore have our data origination server replicate each file (snapshot or delta) which it uploads. These files are uploaded to S3 buckets local to each AWS region in which Netflix servers operate.

Because the delta files are small, our data origination server can make them available in any AWS region worldwide just seconds after they become visible locally in the US. Consequently, even though our system is “eventually” consistent, our servers worldwide have the ability to keep their entire dataset in lockstep synchronization within seconds of each other.

Both the snapshot and delta files are encoded in a binary format we at Netflix refer to as “The Blob”. This name was used for lack of a better term during development to describe the format, and it stuck.

Now, with the open-source Zeno framework, anyone can take POJOs in any arbitrary data model and create their own blobs. Zeno will automatically calculate the minimum set of changes required to bring data up to date, and it provides convenient APIs to produce and consume deltas and snapshots. This method of data distribution, battle-tested at Netflix scale, is now easy to replicate for any data set.

Data Deduplication

Many of the objects in our POJO model on the data origination server will represent duplicate data. For example, the object representing the actor Tom Hanks who starred in Forrest Gump will be identical to the object representing Tom Hanks who narrated the conductor in The Polar Express.1

All duplication in the object data set is automatically detected and removed when serialized using Zeno. Not only does this reduce the size of the serialized snapshot and delta files, but it also significantly reduces the size of the memory footprint of the data on Netflix’s servers.

Before Zeno, we spent a great deal of ad-hoc effort finding and eliminating the largest offenders of duplicate data. However, on our initial test of “blob-enabled” VMS we were very pleasantly surprised to see that complete deduplication resulted in the memory footprint of our data set being cut in half!

In addition, because data is already deduplicated in the serialized stream (and in addition because raw Zeno serialization/deserialization is extremely fast), our initial test of “blob-enabled” VMS showed reduction in initial deserialization times of between 75% and 90%. Since VMS initialization took the majority of server startup time for many of our applications, Zeno had a very significant positive impact on Netflix’s autoscaling ability.

Some detail about how data is deduplicated with Zeno is explored in the original internal presentation, most of which is now available for public review here.

So far, this article has discussed how Zeno enables efficient distribution of in-memory data. Zeno also enables the development of additional components which can operate on any data model.

Data Model Separation

The VMS object model itself evolves with our requirements. This evolution happens frequently and regularly; as we make improvements to the Netflix experience, we constantly need new data to drive new features.

In order to prevent this constant evolution of our data model from obstructing our rapid pace of innovation, the Zeno framework defines a pattern for the creation of components which can act on any data model without requiring awareness of structure or semantics:

1. We define a set of NFTypeSerializers. These “serializers” serve simply as formulaic descriptions of the objects, they do not contain any logic.
2. Operations extend a Zeno-defined interface SerializationFramework. These operations use the serializers to traverse objects in a data model.
When we define our “operations”, we are essentially describing how to traverse an instance of any arbitrary POJO class, and the actions to take when we get to nodes of different types. We might think of this like a SAX parser equivalent for POJO objects, as we define events to take as we encounter specific elements in the object instances.

Zeno ships with multiple framework implementations, and the patterns can be replicated to produce new operations as well. See the documentation for complete details.

Data Debugging

The data ultimately represented in our object model comes from multiple sources:
• Some of it is manually curated
• Some is obtained from external sources
• Some is generated by automated systems
The VMS data origination server also applies business logic to combine and index this data.

Each of these steps is a potential source for data inaccuracy which can cause consequences on the Netflix site and devices ranging from displaying an incorrect release year for a title to being completely unable to watch a movie or TV show.

We have found that the ability to inspect the differences between two data states goes a very long way towards pinpointing errors in the business logic which builds our object data, and detecting pure data issues. Zeno includes a proven and reliable tool for understanding the exact differences between two arbitrary sets of data in just a few moments. Details about how to effectively use this functionality are described in the documentation.

Conclusions

The Zeno framework is a powerful component in the Netflix architecture. It provides the ability to efficiently propagate and keep up to date large datasets in RAM across many servers, and is proven to work at Netflix scale. It ships with data debugging tools applicable to any data set, and defines a pattern for separation of data model and data operations, which can increase the agility of development teams.

As of today, Zeno is available for application towards similar challenges your team may be facing. Zeno will continue to be applied and updated by the team at Netflix; if you’d like to reach us to talk about it, we’ve created a google group for that purpose. When we open sourced NetflixGraph earlier this year, we soon heard success stories from teams within Netflix and externally about efficiency improvements through its application towards different problems. We hope to achieve the same success with the Zeno framework.

It's impossible to describe how satisfying it is to solve enormous scalability challenges, work with the extremely talented Netflix development team, and share responsibility for the design and control of the infrastructure responsible for serving 40 million Netflix subscribers. Does this seem interesting you? There are many more challenges where this came from. Check out jobs.netflix.com to explore our open positions.

1 If you clicked the links above and got to the display page for these movies, I'll give you one guess where the metadata used to draw these pages came from.

## Wednesday, December 4, 2013

### Scryer: Netflix's Predictive Auto Scaling Engine - Part 2

by Danny YuanNeeraj JoshiDaniel Jacobson

In part 1 of this series, we introduced Scryer, Netflix’s predictive autoscaling engine, and discussed its use cases and how it runs in Netflix. In this second installment, we will discuss the design of Scryer ranging from the technical implementation to the algorithms that drive its predictions.

## Design of Scryer

Scryer has a simple data flow architecture. On a very high level, historical data flows into Scryer, and predicted actions flow out. The diagram below shows the architecture.

The API layer provides a RESTful interface for a web UI, as well as automation scripts to interact with Scryer.

The Data Collector module pulls metrics from a pluggable list of data sources, cleans the data, and transforms it into a format suitable for the Predictor. The data retrieval is currently done incrementally within a sliding time window to minimize the load on the data source. The data is is also stored in a secondary persistent store for resiliency purposes.

The Predictor generates predictions based on a pluggable list of prediction algorithms. We implemented two prediction algorithms for production, one of which is an augmented linear regression based algorithm, the other based on Fast Fourier Transformation. The Predictor module also provides life cycle hooks for pre and post processing of predictions. A pluggable prediction combiner is then used to combine multiple predictions to generate a single final prediction.

The Action Plan Generator module uses the prediction and other control parameters (e.g., server throughput, server start time etc), to compute an auto scaling plan. The auto scaling plan is optimized to minimize the number of scale-up events while maintaining an optimal scale-up batch size for each scale-up event. Pre and post action hooks are available to apply additional padding to instance counts if required. For example, we may need to add extra instances for holidays.

The Scaler module carries out the action plan generated by the Action Plan Generator module. It allows a different implementation of actions. Currently, he have implemented three different actions:
• Emitting predictions and action steps to our monitoring dashboard at scheduled time. This is great for simulating the behavior of Scryer. We can easily visualize the predictions and actions, and compare the predictions with the actual workload in a same graph.
• Scheduling each step using AWS API for Scheduled Actions
• Scheduling actions that will scale a cluster using EC2 API

## Metrics for Prediction Algorithms

The first order of business for building the prediction algorithm is to determine what metrics are to be used for prediction and autoscaling actions. When using Amazon Auto Scaling, we normally settle on load average. Load average fits because it is a good indicator of the capacity of a cluster, and it is independent of traffic pattern. Our goal is simply to keep load average within a certain range by adjusting cluster size. However, load average is a misfit for prediction because it is a result of auto scaling. It is too complicated, if not impossible, to predict something that also changes by the prediction. A metrics has to satisfy two conditions to be easily predictable:
• It has a clear, relatively stable, and preferably a recurring pattern. We can predict reliably only what has repeatedly happened in the past.
• It is independent of cluster performance. We deploy our code frequently, and the performance of a deployment may vary per deployment. If the metrics depends on cluster performance, prediction may deviate widely from the actual values of the metrics.

Therefore, we decided to use user traffic for prediction. In particular, we use request per second by default because most of our services are request-based. User traffic satisfies the aforementioned two conditions.

Once we determined which metrics to predict on, we would also need to figure out how to calculate scaling actions. Since the goal of auto scaling is to ensure a cluster has sufficient number of machines to serve all the user traffic, all we need to do is to predict the size of cluster, which depends on the average throughput of a server:
We can get throughput metrics from our monitoring system, or from stress testing. Scryer also allows users to override the throughput value manually via web UI or by calling a RESTful API.

## Prediction Algorithms

The key to effective prediction algorithms is making use of as many signals as possible and at the same time ignoring noise in input metrics. We observed that our input metrics had the following characteristics:
• They have clear weekly periodicity for the same day of a week. That is, the traffic of two adjacent Tuesdays is more similar than that of adjacent Tuesday and Wednesday.
• Their daily patterns are similar, albeit different in shapes and scales
• They have some small spikes and drops that we can deem as noise.
• The change of traffic is relatively constant week by week. In other words, the traffic at the same time in the same day moves approximately linearly week by week.
• There could be occasional large spikes or large drops due to system outages.
Based on our observations, we took two different approaches: FFT-based smoothing, and linear regression with clustered data points.

### FFT-Based Prediction

The idea of this algorithm is to treat incoming traffic as a combination of multiple sine curves. Noise is of high frequency and low amplitude. Therefore, we can use an FFT filter that filters out noise based on given thresholds of frequency and amplitude. The filtered result is a smoothened curve. To predict a future value, we shift the curve to find the past value that is exactly one period away. Mathematically speaking, if the filtered result is a function of time $f(t)$, and the future value is another function of time $g(t)$, then $g(t) = f(t - \omega)$, where $\omega$ is the periodicity of the function $f(t)$. The figure below illustrates the idea. The black curve is the input, and the blue curve is the smoothened result. We can see the sharp spikes are filtered out because they have much higher frequency and much smaller amplitude than the blue curve.

The FFT based algorithm is also capable of ignoring outages. It detects outages by applying standard statistical methods. Once an outage is detected, the algorithm will iteratively apply FFT on adjusted data until the outage is ignored. The following figure shows that a simulated big drop is reasonably ignored:

The prediction algorithm undergoes multiple iterations to gradually remove the effect of such drop, as shown by the figure below. The first iteration is red, and the last iteration is yellow. We can see that prediction becomes better with each iteration.

### Linear Regression on Clustered Data Points

We can’t apply linear regression directly on input metrics, as the shape of the input is sinoid. However, given that each day has similar pattern with a linear trend, we can pick data points at the same time but different days, and then apply linear regression. This approach would require a lot of days. However, if we zoom in on the data, we would see that within a smaller time window, say 10 minutes, the data points are of relatively identical values. Therefore, we can pick a cluster of data points from each time window, and then apply linear regression. This turns out to produce very accurate predictions. The following series of figures illustrate how linear regression works. This method also complements the FFT-based method. Some traffic patterns may contain regular but short-lived spikes. Such spikes are not noises. FFT-based method unfortunately filters such spikes out. However, this method will predict such spikes. The diagram below illustrates such regular patterns that will get filtered by FFT-based method.

This diagram below shows workload of one of Netflix clusters within a 30-minute window. The workload does not fluctuate more than 4%.

Therefore, we can pick data points around the same time of different days in a specified time window. The number of chosen data points are progressively reduced as we move back in time. This naturally gives newer data more influence than older ones. We also choose larger set of data points from the days that are highly similar to each other based on a weight matrix. For example, Saturday's traffic is similar to other Saturdays' and to its adjacent Sunday's, so we choose more data points from weekends for a Saturday than from weekdays.

Once we have clusters of data points, we can then apply linear regression. The blue dots are selected clustered points, and the red line is the result of linear regression. Once we obtain the line, we can predict by a simple extrapolation of the line to a future time.
One potential problem with this approach is that a single outage may make a lot of points invalid, therefore skewing regression results. This is why we still combine FFT-based method with this algorithm. In addition, we also apply outlier detection algorithms to remove invalid points. We implemented both distribution based algorithm and deviation based algorithm. Both turned out to work well.

## Future Work

While Scryer has dramatically improved our system scaling, there are still many things that we can do to make it better. We plan to improve Scryer on three areas in near future:
• Making Scryer distributed. The current implementation of Scryer runs on a single server. It is capable of handling hundreds of clusters, and tolerates temporary system crash because it checkpoints important states in Cassandra. That said, making it distributed can reduce Scryer’s bootstrap time, therefore reducing its potential down time. A distributed Scryer can also be scaled up to handle many more clusters. Distributing Scryer also helps its resilience. If the single instance fails, so does Scryer. Of course, we still have AAS, but that is not optimal. Plus, because it is on a single instance, we have to opt it out of Chaos Monkey. Opting it in gives us more data on how the system fares if it does drop out, so being distributed has that benefit as well.
• Implementing automatic feedback loop so Scryer can auto tune. We record and monitor the accuracy of Scryer’s predictions, effect of scaled actions, as well instance start time. We then use such data to tune parameters of our algorithms. This work, however, can be largely automated. We plan to implement a trend detector. If the prediction starts to consistently deviate from actual workload, the detector will capture such a deviation, and feed it to an auto-correcting module. The auto-correcting module will compensate the auto scaling accordingly, and will also tune the prediction algorithm if needed.
• Improving our prediction algorithms. For example, we ran experiments to find out how to choose cluster of data points for linear regression. We plan to automate this process so we always get accurate parameters for choosing data points. We also plan to improve the heuristics on how to filter out noises in our FFT-based algorithms.

## Conclusion

Scryer adopts a simple yet flexible design that allows users to configure its behaviors with ease. It has built-in fault tolerant features to cope with temporary data unavailability, occasional data irregularities such as outages, and system downtime. The algorithms employed by Scryer take advantage of Netflix’s traffic patterns, and achieve accurate results. Although the approaches and algorithms described above are already yielding excellent results, we are constantly reviewing them in an effort to improve Scryer..

Finally, we work on these kinds of exciting challenges all the time at Netflix.  If you would like to join us in tackling such problems, check out our Jobs site.