Monday, June 25, 2012

Asgard: Web-based Cloud Management and Deployment

By Joe Sondow, Engineering Tools

For the past several years Netflix developers have been using self-service tools to build and deploy hundreds of applications and services to the Amazon cloud. One of those tools is Asgard, a web interface for application deployments and cloud management.
Asgard is named for the home of the Norse god of thunder and lightning, because Asgard is where Netflix developers go to control the clouds. I’m happy to announce that Asgard has now been open sourced on github and is available for download and use by anyone. All you’ll need is an Amazon Web Services account. Like other open source Netflix projects, Asgard is released under the Apache License, Version 2.0. Please feel free to fork the project and make improvements to it.
Some of the information in this blog post is also published in the following presentations. Note that Asgard was originally named the Netflix Application Console, or NAC.

Visual Language for the Cloud

To help people identify various types of cloud entities, Asgard uses the Tango open source icon set, with a few additions. These icons help establish a visual language to help people understand what they are looking at as they navigate. Tango icons look familiar because they are also used by Jenkins, Ubuntu, Mediawiki, Filezilla, and Gimp. Here is a sampling of Asgard's cloud icons.

Cloud Model

The Netflix cloud model includes concepts that AWS does not support directly: Applications and Clusters.

Application

Below is a diagram of some of the Amazon objects required to run a single front-end application such as Netflix’s autocomplete service.
Here’s a quick summary of the relationships of these cloud objects.
  • An Auto Scaling Group (ASG) can attach zero or more Elastic Load Balancers (ELBs) to new instances.
  • An ELB can send user traffic to instances.
  • An ASG can launch and terminate instances.
  • For each instance launch, an ASG uses a Launch Configuration.
  • The Launch Configuration specifies which Amazon Machine Image (AMI) and which Security Groups to use when launching an instance.
  • The AMI contains all the bits that will be on each instance, including the operating system, common infrastructure such as Apache and Tomcat, and a specific version of a specific Application.
  • Security Groups can restrict the traffic sources and ports to the instances.
That’s a lot of stuff to keep track of for one application.
When there are large numbers of those cloud objects in a service-oriented architecture (like Netflix has), it’s important for a user to be able to find all the relevant objects for their particular application. Asgard uses an application registry in SimpleDB and naming conventions to associate multiple cloud objects with a single application. Each application has an owner and an email address to establish who is responsible for the existence and state of the application's associated cloud objects.
Asgard limits the set of permitted characters in the application name so that the names of other cloud objects can be parsed to determine their association with an application.
Here is a screenshot of Asgard showing a filtered subset of the applications running in our production account in the Amazon cloud in the us-east-1 region:
Screenshot of a detail screen for a single application, with links to related cloud objects:

Cluster

On top of the Auto Scaling Group construct supplied by Amazon, Asgard infers an object called a Cluster which contains one or more ASGs. The ASGs are associated by naming convention. When a new ASG is created within a cluster, an incremented version number is appended to the cluster's "base name" to form the name of the new ASG. The Cluster provides Asgard users with the ability to perform a deployment that can be rolled back quickly.
Example: During a deployment, cluster obiwan contains ASGs obiwan-v063 and obiwan-v064. Here is a screenshot of a cluster in mid-deployment.
The old ASG is “disabled” meaning it is not taking traffic but remains available in case a problem occurs with the new ASG. Traffic comes from ELBs and/or from Discovery, an internal Netflix service that is not yet open sourced.

Deployment Methods

Fast Rollback

One of the primary features of Asgard is the ability to use the cluster screen shown above to deploy a new version of an application in a way that can be reversed at the first sign of trouble. This method requires more instances to be in use during deployment, but it can greatly reduce the duration of service outages caused by bad deployments.
This animated diagram shows a simplified process of using the Cluster interface to try out a deployment and roll it back quickly when there is a problem:
The animation illustrates the following deployment use case:
  1. Create the new ASG obiwan-v064
  2. Enable traffic to obiwan-v064
  3. Disable traffic on obiwan-v063
  4. Monitor results and notice that things are going badly
  5. Re-enable traffic on obiwan-v063
  6. Disable traffic on obiwan-v064
  7. Analyze logs on bad servers to diagnose problems
  8. Delete obiwan-v064

Rolling Push

Asgard also provides an alternative deployment system called a rolling push. This is similar to a conventional data center deployment of a cluster on application servers. Only one ASG is needed. Old instances get gracefully deleted and replaced by new instances one or two at a time until all the instances in the ASG have been replaced. Rolling pushes are useful:
  1. If an ASG's instances are sharded so each instance has a distinct purpose that should not be duplicated by another instance.
  2. If the clustering mechanisms of the application (such as Cassandra) cannot support sudden increases in instance count for the cluster.
Downsides to a rolling push:
  1. Replacing instances in small batches can take a long time.
  2. Reversing a bad deployment can take a long time.

Task Automation

Several common tasks are built into Asgard to automate the deployment process. Here is an animation showing a time-compressed view of a 14-minute automated rolling push in action:

Auto Scaling

Netflix focuses on the ASG as the primary unit of deployment, so Asgard also provides a variety of graphical controls for modifying an ASG and setting up metrics-driven auto scaling when desired.
CloudWatch metrics can be selected from the default provided by Amazon such as CPUUtilization, or can be custom metrics published by your application using a library like Servo for Java.

Why not the AWS Management Console?

The AWS Management Console has its uses for someone with your Amazon account password who needs to configure something Asgard does not provide. However, for everyday large-scale operations, the AWS Management Console has not yet met the needs of the Netflix cloud usage model, so we built Asgard instead. Here are some of the reasons.
  • Hide the Amazon keys

    Netflix grants its employees a lot of freedom and responsibility, including the rights and duties of enhancing and repairing production systems. Most of those systems run in the Amazon cloud. Although we want to enable hundreds of engineers to manage their own cloud apps, we prefer not to give all of them the secret keys to access the company’s Amazon accounts directly. Providing an internal console allows us to grant Asgard users access to our Amazon accounts without telling too many employees the shared cloud passwords. This strategy also saves us from needing to assign and revoke hundreds of Identity and Access Management (IAM) cloud accounts for employees.
  • Auto Scaling Groups

    As of this writing the AWS Management Console lacks support for Auto Scaling Groups (ASGs). Netflix relies on ASGs as the basic unit of deployment and management for instances of our applications. One of our goals in open sourcing Asgard is to help other Amazon customers make greater use of Amazon’s sophisticated auto scaling features. ASGs are a big part of the Netflix formula to provide reliability, redundancy, cost savings, clustering, discoverability, ease of deployment, and the ability to roll back a bad deployment quickly.
  • Enforce Conventions

    Like any growing collection of things users are allowed to create, the cloud can easily become a confusing place full of expensive, unlabeled clutter. Part of the Netflix Cloud Architecture is the use of registered services associated with cloud objects by naming convention. Asgard enforces these naming conventions in order to keep the cloud a saner place that is possible to audit and clean up regularly as things get stale, messy, or forgotten.
  • Logging

    So far the AWS console does not expose a log of recent user actions on an account. This makes it difficult to determine whom to call when a problem starts, and what recent changes might relate to the problem. Lack of logging is also a non-starter for any sensitive subsystems that legally require auditability.
  • Integrate Systems

    Having our own console empowers us to decide when we want to add integration points with our other engineering systems such as Jenkins and our internal Discovery service.
  • Automate Workflow

    Multiple steps go into a safe, intelligent deployment process. By knowing certain use cases in advance Asgard can perform all the necessary steps for a deployment based on one form submission.
  • Simplify REST API

    For common operations that other systems need to perform, we can expose and publish our own REST API to do exactly what we want in a way that hides some of the complex steps from the user.

Costs

When using cloud services, it’s important to keep a lid on your costs. As of June 5, 2012, Amazon now provides a way to track your account’s charges frequently. This data is not exposed through Asgard as of this writing, but someone in your company should keep track of your cloud costs regularly. See http://aws.typepad.com/aws/2012/06/new-programmatic-access-to-aws-billing-data.html
Starting up Asgard does not initially cause you to incur any Amazon charges, because Amazon has a free tier for SimpleDB usage and no charges for creating Security Groups, Launch Configurations, or empty Auto Scaling Groups. However, as soon as you increase the size of an ASG above zero Amazon will begin charging you for instance usage, depending on your status for Amazon’s Free Usage Tier. Creating ELBs, RDS instances, and other cloud objects can also cause you to incur charges. Become familiar with the costs before creating too many things in the cloud, and remember to delete your experiments as soon as you no longer need them. Your Amazon costs are your own responsibility, so run your cloud operations wisely.

Feature Films

By extraordinary coincidence, Thor and Thor: Tales of Asgard are now available to watch on Netflix streaming.

Conclusion

Asgard has been one of the primary tools for application deployment and cloud management at Netflix for years. By releasing Asgard to the open source community we hope more people will find the Amazon cloud and Auto Scaling easier to work with, even at large scale like Netflix. More Asgard features will be released regularly, and we welcome participation by users on GitHub.
Follow the Netflix Tech Blog and the @NetflixOSS twitter feed for more open source components of the Netflix Cloud Platform.
If you're interested in working with us to solve more of these interesting problems, have a look at the Netflix jobs page to see if something might suit you. We're hiring!

Related Resources

Asgard

Netflix Cloud Platform

Amazon Web Services