Every running server in our production environment must be part of an auto scaling group; we even verify this with one of the simian army members, which locates and terminates any stray instances in the environment. The goal is to detect and terminate unhealthy instances as quickly as possible, we can count on them being replaced automatically by Amazon. This also applies to any hosts that are terminated by Amazon or have hardware failures. This works for both stateful and stateless services, because even our stateful services know how to setup everything they need when the AMI launches. However, most of our services are stateless, which makes this really easy to handle and provides the opportunity to also use auto scaling for optimization.
While I would say that availability is the most important use of auto scaling, cost and resource optimization is certainly its more sexy side. Being able to allocate resources based on need, and pay for them accordingly, is one of the big promises of the cloud. Very few applications have a constant workload, and Netflix is no exception. In fact, we have a large variation in the peak to trough for our usage pattern. Auto scaling allows us to vary the size of our pools, based on usage, which saves money and allows us to adapt to unforeseen spikes without having an outage or needing someone to manually size the capacity.
Making it Work
Configuring auto scaling is a complex task. Briefly, the configuration consists of three primary steps: first, identify the constraining resources (e.g. memory, CPU). Then, have a way to track the constraining resource in CloudWatch, Amazon's cloud resource and monitoring service. Finally, configure alarms and polices to take the correct action when the profile associated with your constraining resource changes. This is especially true given the only usable metric provided by AWS out of the box is CPU Utilization, which isn't a great indicator for all application types. We have created two levels of tooling, and some basic scripts to help make this process easier.
The first level of tooling is a monitoring library providing the infrastructure to export application metrics for monitoring to CloudWatch. The library provides annotations to make export easy for developers. When a field is annotated with the "@Monitor" tag, it is automatically registered with JMX, and can be published to CloudWatch based on a configurable filter. More features exist, but the critical step is to tag, for export, a field to be used by the auto scaling configuration. Look for a blog in the coming weeks discussing Netflix open sourcing this library.
The second level of tooling is a set of features built into our Netflix Application Console (slides). The tool as a whole drives our entire cloud infrastructure, but lets us focus on what we've added to make auto scaling easy. As part of our push process, we create a new auto scaling group for each new version of code. This means that we need to make sure the entire configuration from the old group is copied to the new group. The tool also displays and allows users to modify the rule settings in a simple HTML UI. In addition, we have some simple scripts to help setup auto scaling rules, configure SNS notifications, and create roll-back options. The scripts are available at our github site.
The final and most important piece of having dynamic auto scaling work well is to understand and test the application's behavior under load. We do this by either squeezing down the traffic in production to a smaller set of servers, or generating artificial load against a single server. Not understanding how an application behaves under load, or what the true limiting factors of the application are, may result in an ineffective or even destructive auto scaling configuration.
The End Result
Below is a set of graphs that show our request traffic over two days. The number of servers we are running to support that traffic and the aggregate CPU utilization of the pool. Notice that server count mirrors request rate and that under load the aggregate CPU is essentially flat.
Scale up early, scale down slowly
At Netflix, we prefer to scale up early and scale down slowly. We advocate teams use symmetric percentages and periods for auto scaling policies and CloudWatch alarms, more here.
To scale up early we recommend tripping a CloudWatch alarm at 75% of the target threshold for a small amount of time. We typically recommend 5-10 minutes to trigger an event. Note, be mindful about the time required to start an instance, consider both EC2 and application startup time. The 25% headroom provides excess capacity for short irregular request spikes. It also protects against the loss of capacity due to instances failing on startup. For example, if max CPU utilization is 80%, set the alarm to trigger after 5 minutes at 60% CPU.
Scaling down slowly is important to mitigate the risk of removing capacity too quickly, or incorrectly reducing capacity. To prevent these scenarios we use time as a proxy to scaling slowly. For example, scale up by 10% if CPU utilization is greater than 60% for 5 minutes, scale down by 10% if CPU utilization is less than 30% for 20 minutes. The advantage to using time, as opposed to asymmetric scaling policies, is to prevent capacity 'thrashing', or removing too much capacity followed by quickly re-adding the capacity. This can happen if the scale down policy is too aggressive. Time-based scaling can also prevent incorrectly scaling down during an unplanned service outage. For example, suppose an edge service temporarily goes down. As a result of reduced requests associated with the outage, the middle tier may incorrectly scale down. If the edge service is down for less than the configured alarm time, no scale down event will occur.
Provision for availability zone capacity
Auto scaling policies should be defined based on the capacity needs per availability zone. This is especially critical for auto scaling groups configured to leverage multiple availability zones with a percent-based scaling policy. For example, suppose an edge service is provisioned in three zones with the min and max set to 10 and 30 respectively. The same service was load tested with a max 110 requests per second (RPS), per instance. With a single elastic load balancer (ELB), fronting the service, each request is uniformly routed, round robin, per zone. Effectively, each zone must be provisioned with enough capacity to handle one third of the total traffic. With a percent based policy one or more zones may become under provisioned. Assume 1850 total RPS, 617 RPS per zone. Two zones with 6 instances, the other having 5 instances, 17 total. The zones with 6 instances, on average, are processing 103 RPS per server. The zone with 5 instances, on average, are processing 124 RPS per instance, about 13% beyond desired (load tested) RPS. The root of the problem is the unbalanced availability zone. A zone can become unbalanced by scaling up/down by a factor less than the number of zones. This tends to occur when using a percent based policy. Also note, the aggregate, computed by CloudWatch, is a simple, equally weighted, average, masking the under-provisioned zone.
Auto scaling is a very powerful tool, but it can also be a double-edged sword. Without the proper configuration and testing it can do more harm than good. A number of edge cases may occur when attempting to optimize or make the configuration more complex. As seen above, when configured carefully and correctly, auto scaling can increase availability while simultaneously decreasing overall costs. For more details on our tools and lessons we have learned, check out our auto scaling github project which has the source for some our tools as well as a lot of documentation on the wiki.