Monday, June 15, 2015

NTS: Real-time Streaming for Test Automation

by Peter Hausel and Jwalant Shah

Netflix Test Studio

Netflix members can enjoy instant access to TV shows & Movies on over 1400 different device/OS permutations. Assessing long-duration playback quality and delivering a great member experience on such a diverse set of playback devices presented a huge challenge to the team.

Netflix Test Studio (NTS) was created with the goal of creating a consistent way for internal and external developers to deploy and execute tests. This is achieved by abstracting device differences. NTS also provides a standard set of tools for assessing the responsiveness and quality of the overall experience. NTS now runs over 40,000 long-running tests each day on over 600 devices around the world.


NTS is a cloud-based automation framework that lets you remote control most Netflix Ready Devices. In this post we’ll focus on two key aspects of the framework:
  • Collect test results in near-realtime.
    • A highly event driven architecture allows us to accomplish this: JSON snippets sent from the single page UI to the device and JavaScript listeners on the device firing back events. We also have a requirement to be able to play back events as they happened, just like a state machine.
  • Allow testers to interact with both the device and various Netflix services during execution.
    • Integrated tests require the control of the test execution stream in order to simulate real-world conditions. We want to simulate failures, pause, debug and resume during test execution.

A typical user interface for Test Execution using NTS

A Typical NTS Test:

Architecture overview

Early implementation of NTS had a relatively simplistic design: hijack a Netflix Ready Device for automation via various redirection methods, then a Test Harness (test executor) would coordinate the execution with the help of a central, public facing Controller service. Eventually, we would get data out from the device via long polling, validate steps, and bubble up validation results back to the client. We built separate clusters of this architecture for each Netflix SDK version.

Original Architecture using Long Polling

nts_legacy_with devices.png

Event playback is not supported

This model worked relatively well in the beginning. However, as the number of supported devices, SDK’s and test cases grew, we started seeing the limitations of this approach: messages were sometimes lost, there was no way of knowing what exactly happened, error messages were misleading, tests were hard to monitor and playback real-time, finally, maintaining almost identical clusters with different test content and SDK versions was introducing an additional maintenance burden as well.

In the next iteration of the tool, we removed the Controller service and most of the polling by introducing a WebSockets proxy (built on top of JSR-356) that was sitting between the clients and Test Executors. We also introduced JSON-RPC as the command protocol.

Updated Version - Near-Realtime (Almost There)


Pub/Sub without event playback support

  • Test Executor submits events in a time series fashion to a Websocket Bus which terminates at Dispatcher.
  • Client connects to a Dispatcher with session Id information. One-to-many relationship between Dispatcher and TestExecutors.
  • Dispatcher instance keeps an internal lookup of test execution session id’s to Websocket connections to Test Executors and delivers messages received over those connections to the Client.

This approach solved most of our issues: fewer indirections, real-time streaming capabilities, push-based design. There were only two remaining issues: message durability was still not supported and more importantly, the WebSockets proxy was difficult to scale out due to its stateful nature.

At this point, we started looking into Apache Kafka to replace the internal WebSocket layer with a distributed pub/sub and message queue solution.

Current version - Kafkants_kafka_messaging_devices.png

Pub/Sub with event playback support
A few interesting properties of this pub/sub system:
  • Dispatcher is responsible for handling client requests to subscribe to Test Execution events stream.
  • Kafka provides a scalable message queue between Test Executor and Dispatcher. Since each session id is mapped to a particular partition and each message sent to client includes the current Kafka offset, we can now guarantee reliable delivery of messages to clients with support for replay of messages in case of network reconnection.
  • Multiple clients can subscribe to the same stream without additional overhead and admin users can view/monitor remote users test execution in real time.
  • The same stream is consumed for analytics purposes as well.
  • Throughput/Latency: during load testing, we could get ~90-100ms latency per message consistently with 100 concurrent users (our test setup was 6 brokers deployed on 6 d2.xlarge instances). In our production system, latency is often lower due to batching.

Where do we go from here?

With HTTP/2 on the horizon, it’s unclear where WebSockets will fit in the long-run. That said, if you need a TCP-based, persistent channel now, you don’t have a better option. While we are actively migrating away from JSR-356 (and Tomcat Websocket) to RxNetty due to numerous issues we ran into, we continue to invest more in WebSockets.

As for Kafka, the transition was not problem free either. But Kafka solved some very hard problems for us (distributed event bus, message durability, consuming a stream both as a distributed queue and pub/sub etc.) and more importantly, it opened up the door for further decoupling. As a result, we are moving forward with our strategic plan to use this technology as the unified backend for our data pipeline needs.

(Engineers who worked on this project: Jwalant Shah, Joshua Hua, Matt Sun)

Thursday, June 4, 2015

Localization Technologies at Netflix

The localization program at Netflix is centered around linguistic excellence, a great team environment, and cutting-edge technology. The program is only 4 years old, which for a company our size is unusual to find. We’ve built a team and toolset representative of the scope and scale that a localization team needs to operate at in 2015, not one that is bogged down with years of legacy process and technology, as is often the case.
We haven’t been afraid to experiment with new localization models and tools, going against localization industry norms and achieving great things along the way. At Netflix we are given the freedom to trailblaze.
In this blog post we’re going to take a look at two major pieces of technology we’ve developed to assist us on our path to global domination…
Netflix Global String Repository
Having great content by itself is not enough to make Netflix successful; how the content is presented has a huge impact. Having an intuitive, easy to use, and localized user interface (UI) contributes significantly to Netflix's success. Netflix is available on the web and on a vast number of devices and platforms including Apple iOS, Google Android, Sony PlayStation, Microsoft Xbox, and TVs from Sony, Panasonic, etc. Each of these platforms has their own standards for internationalization, and that poses a challenge to our localization team.
Here are some situations that require localization of UI strings:
- New languages are introduced
- New features are developed
- Fixes are made to current text data
Traditionally, getting UI strings translated is a high-touch process where a localization PM partners with a dev team to understand where to get the source strings from, what languages to translate them into, and where to deliver the final localized files. This gets further complicated when multiple features are being developed in parallel using different branches in Git.
Once translations are completed and the final files delivered, an application typically goes through a build, test and deploy process. For device UIs, a build might need additional approval from a third party like Apple. This causes unnecessary delays, especially in cases where a fix to a string needs to be rolled out immediately.
What if we can make this whole process transparent to the various stakeholders – developers, and localization? What if we can make builds unnecessary when fixes to text need to be delivered?
In order to answer those questions we have developed a global repository for UI strings, called Global String Repository, that allows teams to store their localized string data and pull it out at runtime. We have also integrated Global String Repository with our current localization pipeline making the whole process of localization seamless. All translations are available immediately for consumption by applications.
Global String Repository allows isolation through bundles and namespaces. A bundle is a container for string data across multiple languages. A namespace is a placeholder for bundles that are being worked upon. There is a default namespace that is used for publishing. A simple workflow would be:
  1. A developer makes a change to the English string data in a bundle in a namespace
  2. Translation workflows are automatically triggered
  3. Linguist completes the translation workflow
  4. Translations are made available to the bundle in the namespace
Applications have a choice when integrating with Global String Repository:
  • Runtime: Allows fast propagation of changes to UIs
  • Build time: Uses Global String Repository solely for localization but packages the data with the builds
Global String Repository allows build time integration by making all necessary localized data available through a simple REST API.
We expose the Global String Repository via the Netflix edge APIs and it is subjected to the same scaling and availability requirements as the other metadata APIs. It is a critical piece especially for applications that are integrating at runtime. With over 60 million customers, a large portion of whom stream Netflix on devices, Global String Repository is in the critical path.
True to the Netflix way, Global String Repository is comprised of a back-end microservice and a UI. The microservice is built as a Java web application using Apache Cassandra and ElasticSearch. It is deployed in AWS across 3 regions. We collect telemetry for every API interaction.
The Global String Repository UI is developed using Node.js, Bootstrap and Backbone and is also deployed in the AWS cloud.
On the client side, Global String Repository exposes REST APIs to retrieve string data and also offers a Java client with in-built caching.
While we have Global String Repository up and running, there is still a long way to go. Some of the things we are currently working on are:
- Enhancing support for quantity strings (plurals) and gender based strings
- Making the solution more resilient to failures
- Improving scalability
- Supporting multiple export formats (Android XML, Microsoft .Resx, etc)
The Global String Repository has no binding to Netflix's business domain, so we plan on releasing it as open source software.
Netflix, as a soon-to-be global service, supports many locales across myriad of device/UI combinations; testing this manually just does not scale. Previously, members of the localization and UI teams would manually use actual devices, from game consoles to iOS and Android, to see all of these strings in context to test for both the content as well as any UI issues, such as truncations.
At Netflix, we think there is always a better way; with that attitude we rethought how we do in context, on device localization testing, and Hydra was born.
The motivation behind Hydra is to catalogue every possible unique screen and allow anyone to see a specific set of screens that they are interested in, across a wide range of filters including devices and locales. For example, as a German localization specialist you could, by selecting the appropriate filters, see the non-member flow in German across PS3, Website and Android. These screens can then be reviewed in a fraction of the time it would take to get to all of those different screens across those devices.
How Screens Reach Hydra
Hydra itself does not take any of the screens, it serves to catalogue and display them. To get screens into Hydra, we leverage our existing UI automation. Through Jenkins CI jobs, data driven tests are run in parallel across all supported locales, to take screenshots and post them screens to Hydra with appropriate metadata, including page name, feature area, major UI platform, and one critical piece of metadata, unique screen definition.
The purpose of the unique screen definition is to have a full catalogue of screens without any unnecessary overlap. This allows for fewer screens to be reviewed as well as for longer term to be able to compare a given screen against itself over time. The definition of a unique screen is different from UI to UI, for browser it is a combination of page name, browser, resolution, local and dev environment.
The Technology
Hydra is a full stack web application deployed to AWS. The Java based backend has two main functions, it processes incoming screenshots and exposes data to the frontend through rest APIs. When the UI automation posts a screen to Hydra, the image file itself is written to S3, allowing for more or less infinite storage, and the much smaller metadata is written to a RDS database so as to be queried later through the rest APIs. The rest endpoints provide a mapping of query string params to MySQL queries.
For example:
This call would essentially map to this query to populate the values for the ‘feature’ filter:
select distinct feature where uigroup = ‘TVUI’ AND area = ‘signupwizard’ AND locale = ‘da-DK’
The JavaScript frontend, which leverages knockout.js, serves to allow users to select filters and view the screens that match those filters. The content of the filters as well as the screens that match the filters that are already selected are both provided by making calls to the rest endpoints mentioned above.
Allowing for Scale
With Hydra in place and the automation running, adding support for new locales becomes as easy as adding one line to an existing property file that feeds the testNG data provider. The screens in the new locale will then flow in with the next Jenkins builds that run.
Next Steps
One known improvement is to have a mechanism to know when a screen has changed. In its current state, if a string changes there is nothing that automatically identifies that a screen has changed. Hydra could evolve into more or less a work queue, localization experts could login and see only the specific set of screens that have changed.
Another feature would be to have the ability to map individual string keys map to which screens. This would allow a translator to change a string, and then search for that string key, and see the screens that are affected by that change. This allows the translator to be able to see that string change in context before even making it.
If what we’re doing here at Netflix with regards to localization technology excites you, please take a moment to review the open positions on our Localization Platform Engineering team:

We like big challenges and have no shortage of them to work on. We currently operate in 50 countries, by the end of 2016 that number will grow to 200. Netflix will be a truly global product and our localization team needs to scale to support that. Challenges like these have allowed us to attract the best and brightest talent, and we’ve built a team that can do what seems impossible.