Tuesday, July 5, 2016

Product Integration Testing at the Speed of Netflix

The Netflix member experience is delivered using a micro-service architecture and is personalized to each of our 80+ million members.  These services are owned by multiple teams, each having their own lifecycle of build and release. This means it is imperative to have a vigilant and knowledgeable Integration Test team that ensures end-to-end quality standards are maintained even as microservices are deployed every day in a decentralized fashion.

As the Product Engineering Integration Test team, our charter is to not impact velocity of innovation while still being the gatekeepers of quality and ensuring developers get feedback quickly. Every development team is responsible for the quality of their team’s deliverables. Our goal is to work seamlessly across various engineering groups with a focus on end-to-end functionality and coordination between teams. We are a lean team with a handful of integration test engineers in an organization of 200+ engineers.

Innovating at a blistering pace while ensuring quality is maintained continues to create interesting challenges for our team. In this post, we are going to look at three such challenges -

  1. Testing and monitoring for High Impact Titles (HIT’s)
  2. A/B testing
  3. Global launch

Testing and monitoring High Impact Titles

There are a lot of High Impact Titles (HIT’s) - like Orange is the New Black - that regularly launch on Netflix. HIT’s come in all forms and sizes. Some are serialized, some are standalone, some are just for kids, some launch with all episodes of a season at once and some launch with a few episodes every week. Several of these titles are launched with complicated A/B tests, where each test cell has a different member experience.

These titles have high visibility for our members and hence need to be tested extensively. Testing starts several weeks before launch and ramps up till launch day. After launch we monitor these titles on different device platforms across all countries.

Testing strategies differ by the phase they are in. There are different promotion strategies for different phases which makes testing/automating a complicated task. There are primarily two phases:

  1. Before title launch : Prior to launch we have to ensure that the title metadata is in place to allow a smooth operation on launch day. Since there are lot of teams involved in the launch of a HIT, we need to make sure that all backend systems are talking to each other and to the front end UI seamlessly. The title is promoted via Spotlight (this is the large billboard-like display at the top of the Netflix homepage), teasers and trailers. However, since there is personalization at every level at Netflix, we need to create complex test cases to verify that the right kind of titles are promoted to the right member profiles. Since the system is in flux, it makes automation difficult. So most testing in this phase is manual.

  1. After a title is launched : Our work does not end on launch day. We have to continuously monitor the launched titles to make sure that the member experience is not compromised in any way. The title becomes part of the larger Netflix catalog and this creates a challenge in itself. We now need to write tests that check if the title continues to find its audience organically and if data integrity for that title is maintained (for instance, some checks verify if episode summaries are unchanged since launch, another check verifies if search results continue to return a title for the right search strings). But with 600 hours of Netflix original programming coming online this year alone, in addition to licensed content, we cannot rely on manual testing here. Also, once the title is launched, there are generic assumptions we can make about it, because data and promotional logic for that title will not change - e.g. number of episodes > 0 for TV shows, Title is searchable (for both movies and TV shows), etc. This enables us to use automation to continuously monitor them and check if features related to every title continue to work correctly.

HIT testing is challenging and date driven. But it is an exhilarating experience to be part of a title launch, making sure that all related features and backend logic are working correctly at launch time. Celebrity sightings and cool Netflix swag are also nice perks :)

A/B Testing

We A/B test a lot. At any given time, we have a variety of  A/B tests running, all with varying levels of complexity.

In the past, most of the validation behind A/B tests was a combination of automated and manual testing, where the automated tests were implemented for individual components (white box testing), while end-to-end testing (black box testing) was mostly conducted manually. As we started to experience a significant increase in the volume of A/B tests, it was not scalable to manually validate the tests end-to-end, and we started ramping up on automation.

One major challenge with adding end-to-end automation for our A/B tests was the sheer number of components to automate. Our approach was to treat test automation as a deliverable product and focus on delivering a minimum viable product (MVP) composed of reusable pieces. Our MVP requirement was to be able to assert a basic member experience by validating the data from the REST endpoints of the various microservices. This gave us a chance to iterate towards a solution instead of searching for the perfect one right from the start.  

Having a common library which would provide us with the capability to reuse and repurpose modules for every automated test was an essential starting point for us. For example, we had an A/B test which caused modifications to a member’s MyList - when automating this, we wrote a script to add/remove title(s) to/from a member’s MyList. These scripts were parameterized such that they could be reused for any future A/B test that dealt with MyList. This approach enabled us to automate our A/B tests faster since we had more reusable building blocks. We also obtained efficiency by reusing as much existing automation as possible. For example,  instead of writing our own UI automation, we were able to utilize the Netflix Test Studio to trigger test scenarios that required UI actions across various devices.

When choosing a language/platform to implement our automation in, our focus was on providing quick feedback to the product teams. For that we needed really fast test suite execution, on the order of seconds. We also wanted to make our tests as easy to implement and deploy as possible. With these two requirements in mind, we discounted our first choice -  Java. Our tests would have been dependent on the use of several interdependent jar files, and we would have had to account for the overhead of dependency management, versioning, and be susceptible to changes in different versions of the jars. This would significantly increase the test runtimes.

We decided to implement our automation by accessing microservices through their REST endpoints, so that we could bypass the use of jars, and avoid writing any business logic. In order to ensure the simplicity of implementation and deployment of our automation, we decided to use a combination of parameterized shell and python scripts that could be executed from a command line. There would be a single shell script to control test case execution, which would call other shell/python scripts that would function as reusable utilities.

This approach yielded several benefits:

  1. We were able to obtain test runtimes (including setup and teardown) within a range of 4 - 90 seconds, with a median runtime of 40 seconds. Using java-based automation, we estimate our median runtimes to have taken between 5 and 6 minutes.
  2. Continuous Integration was simplified - All we needed was a Jenkins Job which would download the code from our repo, execute the necessary scripts, and log the results. Jenkins’ built-in console log parsing was also sufficient enough to provide test pass/fail statistics.
  3. It is easy to get started - In order for another engineer to run our test suite, the only thing needed is access to our git repo and a terminal.

Global Launch

One of our largest projects in 2015 was to make sure we had sufficient integration testing in place to ensure Netflix’s simultaneous launch in 130 countries would go smoothly. This meant that, at a minimum, we needed our smoke test suite to be automated for every country and language combination. This effectively added another feature dimension to our automation product.

Our tests were sufficiently fast, so  we initially decided that all we needed was to run our test code in a loop for each country/locale combination. The result was that tests which completed in about 15 seconds would now take a little over an hour to complete. We had to find a better approach to this problem. In addition to this, each test log was now about 250 times larger, making it more onerous to investigate failures. In order to address this, we did two things:

  1. We utilized the Jenkins Matrix plugin to parallelize our tests so that tests for each country would run in parallel. We also had to customize our Jenkins slaves to use multiple executors so that other jobs wouldn’t queue up in the event our tests ran into any race conditions or infinite loops. This was feasible for us because our automation only had the overhead of running shell scripts, and not having to preload binaries.

  1. We didn’t want to refactor every test written up to this point, and we didn’t want every test to run against every single country/locale combination. As a result, we decided to use an opt-in model, where we could continue writing automated tests the way we had been writing them for a while, and to make a test global-ready, an additional wrapper would be added to the test. This wrapper would take in the test case id, and country/locale combination as parameters and then execute the test case with those parameters, as shown below:

Today, we have automation running globally that covers all high priority integration test cases including monitoring for HITs in all regions where that title is available.

Future Challenges

The pace of innovation doesn’t slow down at Netflix, it only accelerates. Consequently, our automation product continues to evolve. Some of the projects in our roadmap are:

  1. Workflow-based tests: This would include representing a test case as a workflow, or a series of steps to mimic the flow of data through the Netflix services pipeline. The reason for doing this is to reduce the overhead in investigating test failures, by easily identifying the step where the failure occurred.

  1. Alert integration: Several alerting systems are in place across Netflix. When certain alerts are triggered, it may not be relevant to execute certain test suites. This is because the tests would be dependent on services which may not be functioning at 100%, and would possibly fail - giving us results that would not be actionable for us. We need to build a system that can listen to these alerts and then determine what tests need to be run.

  1. Chaos Integration: Our tests currently assume the Netflix ecosystem is functioning at 100%, however, this may not always be the case. The reliability engineering team constantly runs chaos exercises to test the overall integrity of the system. Presently, the results of test automation in a degraded environment show upwards of a 90% failure rate. We need to enhance our test automation to provide relevant results when executed in a degraded environment.

In future blog posts, we will delve deeper and talk about ongoing challenges and other initiatives. Our culture of Freedom and Responsibility plays a significant role in enabling us to adapt quickly to a rapidly evolving ecosystem. There is much more experimentation ahead, and new challenges to face. If new challenges excite you as much as they excite us, join us.