Monday, March 3, 2014

The Netflix Dynamic Scripting Platform

The Engine that powers the Netflix API
Over the past couple of years, we have optimized the Netflix API with a view towards improving performance and increasing agility. In doing so, the API has evolved from a provider of RESTful web services to a platform that distributes development and deployment of new functionality across various teams within Netflix.

At the core of the redesign is a Dynamic Scripting Platform which provides us the ability to inject code into a running Java application at any time. This means we can alter the behavior of the application without a full scale deployment. As you can imagine, this powerful capability is useful in many scenarios. The API Server is one use case, and in this post, we describe how we use this platform to support a distributed development model at Netflix.

As a reminder, devices make HTTP requests to the API to access content from a distributed network of services. Device Engineering teams use the Dynamic Scripting Platform to deploy new or updated functionality to the API Server based on their own release cycles. They do so by uploading adapter code in the form of Groovy scripts to a running server and defining custom endpoints to front those scripts. The scripts are responsible for calling into the Java API layer and constructing responses that the client application expects. Client applications can access the new functionality within minutes of the upload by requesting the new or updated endpoints. The platform can support any JVM-compatible language, although at the moment we primarily use Groovy.

Architecting for scalability is a key goal for any system we build at Netflix and the Scripting Platform is no exception. In addition, as our platform has gained adoption, we are developing supporting infrastructure that spans the entire application development lifecycle for our users. In some ways, the API is now like an internal PaaS system that needs to provide a highly performant, scalable runtime; tools to address development tasks like revision control, testing and deployment; and operational insight into system health. The sections that follow explore these areas further.

Architecture

Here is a view into the API Server under the covers.

scriptplatform_apiserver2_25.png
[1] Endpoint Controller routes endpoint requests to the corresponding groovy scripts. It consults a registry to identify the mapping between an endpoint URI and its backing scripts.

[2] Script Manager handles requests for script management operations. The management API is exposed as a RESTful interface.

[3] Script source, compiled bytecode and related metadata are stored in a Cassandra cluster, replicated across all AWS regions.

[4] Script Cache is an in memory cache that holds compiled bytecode fetched from Cassandra. This eliminates the Cassandra lookup during endpoint request processing. Scripts are compiled by the first few servers running a new API build and the compiled bytecode is persisted in Cassandra. At startup, a server looks for persisted bytecode for a script before attempting to compile it in real time. Because deploying a set of canary instances is a standard step in our delivery workflow, the canary servers are the ones to incur the one-time penalty for script compilation. The cache is refreshed periodically to pick up new scripts.

[5] Admin Console and Deployment Tools are built on top of the script management API.

Script Development and Operations

Our experience in building a Delivery Pipeline for the API Server has influenced our thinking around the workflows for script management. Now that a part of the client code resides on the server in the form of scripts, we want to simplify the ways in which Device teams integrate script management activities into their workflows. Because technologies and release processes vary across teams, our goal is to provide a set of offerings from which they can pick and choose the tools that best suit their requirements.

The diagram below illustrates a sample script workflow and calls out the tools that can be used to support it. It is worth noting that such a workflow would represent just a part of a more comprehensive build, test and deploy process used for a client application.


scriptpipelineforblogv1.png

Distributed Development

To get script developers started, we provide them with canned recipes to help with IDE setup and dependency management for the Java API and related libraries.  In order to facilitate testing of the script code, we have built a Script Test Framework based on JUnit and Mockito. The Test Framework looks for test methods within a script that have standard JUnit annotations, executes them and generates a standard JUnit result report. Tests can also be run against a live server to validate functionality in the scripts.

Additionally, we have built a REPL tool to facilitate ad-hoc testing of small scripts or snippets of groovy code that can be shared as samples, for debugging etc.


Distributed Deployment

As mentioned earlier, the release cycles of the Device teams are decoupled from those of the API Team. Device teams have the ability to dynamically create new endpoints, update the scripts backing existing endpoints or delete endpoints as part of their releases. We provide command line tools built on top of our Endpoint Management API that can be used for all deployment related operations. Device teams use these tools to integrate script deployment activities with their automated build processes and manage the lifecycle of their endpoints. The tools also integrate with our internal auditing system to track production deployments.

Admin Operations & Insight

Just as the operation of the system is distributed across several teams, so is the responsibility of monitoring and maintaining system health. Our role as the platform provider is to equip our users with the appropriate level of insight and tooling so they can assume full ownership of their endpoints. The Admin Console provides users full visibility into real time health metrics, deployment activity, server resource consumption and traffic levels for their endpoints.








Engineers on the API team can get an aggregate view of the same data, as well as other metrics like script compilation activity that are crucial to track from a server health perspective. Relevant metrics are also integrated into our real time monitoring and alerting systems.


The screenshot to the left is from a top-level dashboard view that tracks script deployment activity.

Experiences and Lessons Learnt

Operating this platform with an increasing number of teams has taught us a lot and we have had several great wins!
  • Client application developers are able to tailor the number of network calls and the size of the payload to their applications. This results in more efficient client-side development and overall, an improved user experience for Netflix customers.
  • Distributed, decoupled API development has allowed us to increase our rate of innovation.
  • Response and recovery times for certain classes of bugs have gone from days or hours down to minutes. This is even more powerful in the case of devices that cannot easily be updated or require us to go through an update process with external partners.
  • Script deployments are inherently less risky than server deployments because the impact of the former is isolated to a certain class or subset of devices. This opens the door for increased nimbleness.
  • We are building an internal developer community around the API. This provides us an opportunity to promote collaboration and sharing of  resources, best practices and code across the various Device teams.

As expected, we have had our share of learnings as well.
  • The flexibility of our platform permits users to utilize the system in ways that might be different from those envisioned at design time. The strategy that several teams employed to manage their endpoints put undue stress on the API server in terms of increased system resources, and in one instance, caused a service outage. We had to react quickly with measures in the form of usage quotas and additional self-protect switches while we identified design changes to allow us to handle such use cases.
  • When we chose Cassandra as the repository for script code for the server, our goal was to have teams use their SCM tool of choice for managing script source code. Over time, we are finding ourselves building SCM-like features into our Management API and tools, as we work to address the needs of various teams. It has become clear to us that we need to offer a unified set of tools that cover deployment and SCM functionality.
  • The distributed development model combined with the dynamic nature of the scripts makes it challenging to understand system behavior and debug issues. RxJava introduces another layer of complexity in this regard because of its asynchronous nature. All of this highlights the need for detailed insights into scripts’ usage of the API.

We are actively working on solutions for the above and will follow up with another post when we are ready to share details.

Conclusion


The evolution of the Netflix API from a web services provider to a dynamic platform has allowed us to increase our rate of innovation while providing a high degree of flexibility and control to client application developers. We are investing in infrastructure to simplify the adoption and operation of this platform. We are also continuing to evolve the platform itself as we find new use cases for it. If this type of work excites you, reach out to us - we are always looking to add talented engineers to our team!