by Peter Hausel and Jwalant Shah
Netflix Test Studio
Netflix members can enjoy instant access to TV shows & Movies on over 1400 different device/OS permutations. Assessing long-duration playback quality and delivering a great member experience on such a diverse set of playback devices presented a huge challenge to the team.
Netflix Test Studio (NTS) was created with the goal of creating a consistent way for internal and external developers to deploy and execute tests. This is achieved by abstracting device differences. NTS also provides a standard set of tools for assessing the responsiveness and quality of the overall experience. NTS now runs over 40,000 long-running tests each day on over 600 devices around the world.
NTS is a cloud-based automation framework that lets you remote control most Netflix Ready Devices. In this post we’ll focus on two key aspects of the framework:
- Collect test results in near-realtime.
- Allow testers to interact with both the device and various Netflix services during execution.
- Integrated tests require the control of the test execution stream in order to simulate real-world conditions. We want to simulate failures, pause, debug and resume during test execution.
A typical user interface for Test Execution using NTS
A Typical NTS Test:
Early implementation of NTS had a relatively simplistic design: hijack a Netflix Ready Device for automation via various redirection methods, then a Test Harness (test executor) would coordinate the execution with the help of a central, public facing Controller service. Eventually, we would get data out from the device via long polling, validate steps, and bubble up validation results back to the client. We built separate clusters of this architecture for each Netflix SDK version.
Original Architecture using Long Polling
Event playback is not supported
This model worked relatively well in the beginning. However, as the number of supported devices, SDK’s and test cases grew, we started seeing the limitations of this approach: messages were sometimes lost, there was no way of knowing what exactly happened, error messages were misleading, tests were hard to monitor and playback real-time, finally, maintaining almost identical clusters with different test content and SDK versions was introducing an additional maintenance burden as well.
In the next iteration of the tool, we removed the Controller service and most of the polling by introducing a WebSockets proxy (built on top of JSR-356) that was sitting between the clients and Test Executors. We also introduced JSON-RPC as the command protocol.
Pub/Sub without event playback support
- Test Executor submits events in a time series fashion to a Websocket Bus which terminates at Dispatcher.
- Client connects to a Dispatcher with session Id information. One-to-many relationship between Dispatcher and TestExecutors.
- Dispatcher instance keeps an internal lookup of test execution session id’s to Websocket connections to Test Executors and delivers messages received over those connections to the Client.
This approach solved most of our issues: fewer indirections, real-time streaming capabilities, push-based design. There were only two remaining issues: message durability was still not supported and more importantly, the WebSockets proxy was difficult to scale out due to its stateful nature.
At this point, we started looking into Apache Kafka to replace the internal WebSocket layer with a distributed pub/sub and message queue solution.
Pub/Sub with event playback support
A few interesting properties of this pub/sub system:
- Dispatcher is responsible for handling client requests to subscribe to Test Execution events stream.
- Kafka provides a scalable message queue between Test Executor and Dispatcher. Since each session id is mapped to a particular partition and each message sent to client includes the current Kafka offset, we can now guarantee reliable delivery of messages to clients with support for replay of messages in case of network reconnection.
- Multiple clients can subscribe to the same stream without additional overhead and admin users can view/monitor remote users test execution in real time.
- The same stream is consumed for analytics purposes as well.
- Throughput/Latency: during load testing, we could get ~90-100ms latency per message consistently with 100 concurrent users (our test setup was 6 brokers deployed on 6 d2.xlarge instances). In our production system, latency is often lower due to batching.
Where do we go from here?
With HTTP/2 on the horizon, it’s unclear where WebSockets will fit in the long-run. That said, if you need a TCP-based, persistent channel now, you don’t have a better option. While we are actively migrating away from JSR-356 (and Tomcat Websocket) to RxNetty due to numerous issues we ran into, we continue to invest more in WebSockets.
As for Kafka, the transition was not problem free either. But Kafka solved some very hard problems for us (distributed event bus, message durability, consuming a stream both as a distributed queue and pub/sub etc.) and more importantly, it opened up the door for further decoupling. As a result, we are moving forward with our strategic plan to use this technology as the unified backend for our data pipeline needs.
(Engineers who worked on this project: Jwalant Shah, Joshua Hua, Matt Sun)