Monday, March 11, 2013

Python at Netflix

By Roy Rapoport, Brian Moyles, Jim Cistaro, and Corey Bertram

We’ve blogged a lot about how we use Java here at Netflix, but Python’s footprint in our environment continues to increase.  In honor of our sponsorship of PyCon, we wanted to highlight our many uses of Python at Netflix.

Developers at Netflix have the freedom to choose the technologies best suited for the job. More and more, developers turn to Python due to its rich batteries-included standard library, succinct and clean yet expressive syntax, large developer community, and the wealth of third party libraries one can tap into to solve a given problem. Its dynamic underpinnings enable developers to rapidly iterate and innovate, two very important qualities at Netflix. These features (and more) have led to increasingly pervasive use of Python in everything from small tools using boto to talk to AWS, to storing information with python-memcached and pycassa, managing processes with Envoy, polling restful APIs to large applications with requests, providing web interfaces with CherryPy and Bottle, and crunching data with scipy. To illustrate, here are some current projects taking advantage of Python:


The Central Alert Gateway (CAG) is a RESTful web application written in Python to which any process can post an alert, though the vast majority of alerts are triggered by our telemetry system, Atlas (which will be open sourced in the near future).  CAG can take these alerts and based on configuration send them via email to interested parties, dispatch them to our notification system to page on call engineers, suppress them if we’ve already alerted someone, or perform automated remediation actions (for example, reboot or terminate an EC2 instance if it starts appearing unhealthy).  At our scale, we generate hundreds of thousands of alerts every day and handling as many of these automatically -- and making sure to only notify people of new issues rather than telling them again about something they’re aware of -- is critical to our production efficiency (and quality of life).

Chaos Gorilla

We’ve talked before about how we use Chaos Monkey to make sure our services are resilient to the termination of any small number of instances.  As we’ve improved resiliency to instance failures, we’ve been working to set the reliability bar much, much higher.  Chaos Gorilla integrates with Asgard and Edda, and allows us to simulate the loss of an entire availability zone in a given region.  This sort of failure mode -- an AZ either going down or simply becoming inaccessible to other AZs -- happens once in a blue moon, but it’s a big enough problem that simulating it and making sure our entire ecosystem is resilient to that failure is very important to us.

Security Monkey and Howler Monkey

Security Monkey is designed to keep track of configuration history and alert on changes in EC2 security-related policies such as security groups, IAM roles, S3 access control lists, etc.  This makes our Cloud Security team very happy, since without it there’s no way to know when, or how, a change occurred in the environment.  

Howler Monkey is designed to automatically discover and keep track of SSL certificates in our environments and domain names, no matter where they may reside, and alert us as we get close to an SSL certificate’s expiration date, with flexible and powerful subscription and alerting mechanisms.  Because of it, we moved from having an SSL certificate expire surprisingly and with production impact about once a quarter to having no production outages due to SSL expirations in the last eighteen months.  It’s a simple tool that makes a huge difference for us and our dozens of SSL certificates.  


We push hard to always increase our speed of innovation, and at the same time reduce the cost of making changes in the environment.  In the datacenter days, we forced every production change to be logged in a change control system because the first question everyone asks when looking at an issue is “What changed recently?”.  We found a formal change control system didn’t work well for with our culture of freedom and responsibility, so we deprecated a formal change control process for the vast majority of changes in favor of Chronos.  Chronos accepts events via a REST interface and allows humans and machines to ask questions like “what happened in the last hour?” or “what software did we deploy in the last day?”.  It integrates with our monkeys and Asgard so the vast majority of changes in our environment are automatically reported to it, including event types such as deployments, AB tests, security events, and other automated actions.


Readers of the blog or those who have seen our engineers present on the Netflix Platform may have seen numerous references to baking -- our name for the process by which we take an application and turn it into a deployable Amazon Machine Image. Aminator is the tool that does the heavy lifting and produces almost every single image that powers Netflix.

Aminator attaches a foundation image to a running EC2 instance, preps the image, installs packages into the image, and turns the resultant image into a complete Netflix application. Simple in concept and execution, but absolutely critical to our success. Pre-staging images and avoiding post-launch configuration really helps when launching hundreds or thousands of instances.

Cass Ops

Netflix Cassandra Operations uses Python for automation and monitoring tools.  We have created many modules for management and maintenance of our Cassandra clusters.  These modules use REST APIs to interface with other Netflix tools to manage our instances within AWS as well as interfacing directly with the Cassandra instances themselves.  These activities include creating clusters using Asgard, tracking our inventory with Edda, monitoring Eureka to make sure clusters are visible to clients, managing Cassandra repairs and compactions, and doing software upgrades.  In addition to our internally developed tools, we take advantage of various Python packages.  We use JenkinsAPI to interface with Jenkins for both job configuration and status information on our monitoring and maintenance jobs.  Pycassa is used to access our operational data stored in Cassandra.  Boto gives us the ability to communicate with various AWS services such as S3 storage.  Paramiko allows us to ssh to instances without needing to create a subprocess.  Use of Python for these tools has allowed us to rapidly develop and enhance our tools as Cassandra has grown at Netflix.

Data Science and Engineering

Our Data Science and Engineering teams rely heavily on Python to help surface insights from the vast quantities of data produced by the organization.  Python is used in tools for monitoring data quality, managing data movement and syncing, expressing business logic inside our ETL workflows, and running various web applications to visualize data.  

One such application is Sting, a lightweight RESTful web service that slices, dices, and produces visualizations of large in-memory datasets.  Our data science teams use Sting to analyze and iterate against the results of Hive queries on our big data platform.  While a Hive query may take hours to complete, once the initial dataset is loaded in Sting, additional iterations using OLAP style operations enjoy sub-second response times.  Datasets can be set to periodically refresh, so results are kept fresh and up to date.  Sting is written entirely in Python, making heavy use of libraries such as pandas and numpy to perform fast filtering and aggregation operations.

General Tooling and the Service Class

Pythonistas at Netflix have been championing the adoption of Python and striving to make its power accessible to everyone within the organization.  To do this we wrapped libraries for many of the OSS tools now being released by Netflix as well as a few internal services in a general use ‘Service’ class.  With this we have helped our users quickly and easily stand up new services that have access to many common actions such as alerting, telemetry, Eureka, and easy AWS API access.  We expect to make many of these these libraries available this year and will be around to chat about them at PyCon!
Here is an example of how easily we can stand up a service that has Eureka registration, Route 53 registration, a basic status page and exposes a fully functional Bottle service:

These systems and applications comprise a glimpse of the overall use and importance of Python to Netflix. They contribute heavily to our overall service quality, allow us to rapidly innovate, and are a whole lot of fun to work on to boot!

We’re sponsoring PyCon this year, and in addition to a slew of Netflixers attending we’ll also have a booth at the expo area and give a talk expanding on some of the use cases above.  If any of this sounds interesting, come by and chat.  Also, we’re hiring Senior Site Reliability Engineers, Senior DevOps Engineers, and Data Science Platform Engineers.