Friday, January 4, 2013

Janitor Monkey - Keeping the Cloud Tidy and Clean

By Michael Fu and Cory Bennett, Engineering Tools

One of the great advantages of moving from a private datacenter into the cloud is that you have quick and easy access to nearly limitless new resources. Innovation and experimentation friction is greatly reduced: to push out a new application release you can quickly build up a new cluster, to get more storage just attach a new volume, to backup your data just make a snapshot, to test out a new idea just create new instances and get to work. The downside of this flexbility is that it is pretty easy to lose track of the cloud resources that are no longer needed or used. Perhaps you forgot to delete the cluster with the previous version of your application, or forgot to destroy the volume when you no longer needed the extra disk. Taking snapshots is great for backups, but do you really need them from 12 months ago? It's not just forgetfulness that can cause problems. API and network errors can cause your request to delete an unused volume to get lost.

At Netflix, when we analyzed our Amazon Web Services (AWS) usage, we found a lot of unused resources and we needed a solution to rectify this problem. Diligent engineers can manualy delete unused resources via Asgard but we needed a way to automatically detect and clean them up. Our solution was Janitor Monkey.

We have written about our Simian Army in the past and we are now proud to announce that the source code for the new member of our simian army, Janitor Monkey, is now open and available to the public.

What is Janitor Monkey?

Janitor Monkey is a service which runs in the Amazon Web Services (AWS) cloud looking for unused resources to clean up. Similar to Chaos Monkey, the design of Janitor Monkey is flexible enough to allow extending it to work with other cloud providers and cloud resources. The service is configured to run, by default, on non-holiday weekdays at 11 AM. The schedule can be easily re-configured to fit your business' need.

Janitor Monkey determines whether a resource should be a cleanup candidate by applying a set of rules on it. If any of the rules determines that the resource is a cleanup candidate, Janitor Monkey marks the resource and schedules a time to clean it up. We provide a collection of rules in the open sourced version that are currently used at Netflix and believed general enough to be used by most users. The design of Janitor Monkey also makes it simple to customize rules or to add new ones.

Since there can be exceptions when you want to keep an unused resource around, before a resource is deleted by Janitor Monkey, the owner of the resource will receive a notification a configurable number of days ahead of the cleanup time. This is to prevent a resource that is still needed from being deleted by Janitor Monkey. The resource owner can then flag the resources that they want to keep as exceptions and Janitor Monkey will leave them alone.

Over the last year Janitor Monkey has deleted over 5,000 resources running in our production and test environments. It has helped keep our costs down and has freed up engineering time which is no longer needed to manage unused resources.

Resource Types and Rules

Four types of AWS resources are currently managed by Janitor Monkey: Instances, EBS Volumes, EBS Volume Snapshots, and Auto Scaling Groups. Each of these resource types has its own rules to mark unused resources. For example, an EBS volume is marked as a cleanup candidate if it has not been attached to any instance for 30 days. Another example is that an instance will be cleaned by Janitor Monkey if it is not in any auto scaling group for over 3 days since we know these are experimentation instances -- all others must be in auto scaling groups. The number of retention days in these rules is configurable so the rules can be easily customized to fit your business requirements. We plan to make Janitor Monkey support more resource types in the future, such as launch configurations, security groups, and AMIs. The design of Janitor Monkey makes adding new resource types easy.

How Janitor Monkey Cleans

Janitor Monkey works in three stages: "mark, notify, delete". When Janitor Monkey marks a resource as a cleanup candidate, it schedules a time to delete the resource. The delete time is specified in the rule that marks the resource. Every resource is associated with an owner email, which can be specified as a tag on the resource. You can also easily extend Janitor Monkey to obtain this information from your internal system. The simplest way is using a default email address, e.g. your team's email list for all the resources. You can configure a number of days for specifying when to let Janitor Monkey send notification to the resource owner before the scheduled termination. By default the number is 2, which means that the owner will receive a notification 2 business days ahead of the termination date. During the 2-day period the resource owner can decide if the resource can be deleted. In case a resource needs to be retained, the owner can use a simple REST interface to flag the resource to be excluded by Janitor Monkey. The owner can later use another REST interface to remove the flag and Janitor Monkey will then be able to manage the resource again. When Janitor Monkey sees a resource marked as a cleanup candidate and the scheduled termination time has passed, it will delete the resource. The resource owner can also delete the resource manually if he/she wants to release the resource earlier to save cost. When the status of the resource changes, making the resource not a cleanup candidate (e.g. a detached EBS volume is attached to an instance), Janitor Monkey will unmark the resource and no cleanup will occur.

Configuration and Customization

The resource types managed by Janitor Monkey, the rules for each resource type to mark cleanup candidates, and the parameters used to configure each individual rule, are all configurable. You can easily customize Janitor Monkey with the most appropriate set of rules for your resources by setting Janitor Monkey properties in a configuration file. You can also create your own rules or add support for new resource types, and we encourage you to contribute your cleanup rules to the project so that all can benefit.

Auditing, Logging, and Costs

Janitor Monkey events are logged in an Amazon SimpleDB table by default. You can easily check the SimpleDB records to find out what Janitor Monkey has done. The resources managed by Janitor Monkey are also stored in SimpleDB. At Netflix we have a UI for managing the Janitor Monkey resources and we have plans to open source it in the future as well.

There could be associated costs with Amazon SimpleDB, but in most cases the activity of Janitor Monkey should be small enough to fall within Amazon's Free Usage Tier. Ultimately the costs associated with running Janitor Monkey are your responsibility. For your reference, the costs of Amazon SimpleDB can be found at http://aws.amazon.com/simpledb/pricing/

Coming Up

In the near future we are planning to release some new resource types for Janitor Monkey to manage. As mentioned earlier, the next candidate will likely be launch configuration. Also, we will add support for using Edda to implement existing and new Janitor Monkey rules. Edda allows us to query the history of resources, helping Janitor Monkey find unused resources more accurately and reliably.

Summary

Janitor Monkey helps keep our cloud clean and clutter-free. We hope you find Janitor Monkey to be useful for your business. We'd appreciate any feedback on it. We're always looking for new members to join the team. If you are interested in working on great open source software, take a look at jobs.netflix.com for current openings!