Monday, March 30, 2015

Billing & Payments Engineering Meetup II

On March 18th, we hosted our second Billing & Payments Engineering Meetup at Netflix. It felt truly encouraging to see the growing interest of the engineer community of the Bay Area for the event. Just like the first event, the theater was almost full.

If you missed our first Meetup, you can check it here.
For this Meetup, we decided to take a different approach. We are several teams within Netflix that are involved with billing or payments at various level. Each of us gave a presentation of our work, therefore hoping to provide the audience with a 360, transversal vision of how payments are managed at Netflix. There’s a great synergy between us and we hope it was reflected in the talks we gave.

Stay tuned on the meetup page to be notified of the next event!

Payment Processing in the Cloud - Mathieu Chauvin - Payments Engineering

Now that Netflix has gained a tremendous experience with AWS, the Payments Engineering team has re-engineered their suite of applications into the cloud. It’s the first time payments are processed from a public cloud solution at this scale.
This presentation gives more information about the technical design of this new solution, as well as the transition strategy that was adopted, for a seamless migration of more than 57 million subscribers.

Mat's team is hiring!
Senior Software Engineer/Architect - Payments Platform

Architecture about Billing Workflows in the Cloud - Sangeeta Handa & John Brandy - Billing Engineering

At Billing we are at the crossroads, where we are half way still in our old data center and half way migrated to cloud. Billing Engineering has 2 major aspects - One is batch renewal of Netflix subscribing customers and other is the APIs that change the billing state of a Netflix customer  in some way. Our topic for discussion was how Billing engineering is managing its workflow for these APIs across different processes and teams in this scenario and technology stack we are using to accomplish this.

Sangeeta’s team is hiring!

Payment Analytics at Netflix - Shankar Vedaraman - Data Science Engineering, Payments

Netflix Product has been data driven since inception and payment processing at Netflix is no different. With more than 55M customers paying Netflix on a monthly basis, there is lots of data to analyze and recommend dynamic routing of transactions to maximize approval rates. At the meetup, Shankar Vedaraman, who leads the Payment Analytics Data Science and Engineering team, presented all the different payments business processes that his team focusses on and touched upon key analytical insights that his team provides.

Shankar's team is hiring!

Security for Billing & Payments - Poorna Udupi - Product and Application Security

Poorna Udupi who leads the Product and Application Security team at Netflix, spoke about making security consumable in the form of tools, libraries and self-service applications to enable developers attain a rapid velocity of feature delivery while simultaneously being secure. Specifically speaking to the audience of billing and payments enthusiasts, he discussed a few security techniques in detail: infrastructure segmentation, tokenization, utilization of big data for fraud and abuse detection, prevention and sanitization. He provided a lightning overview of some of the open source security projects contributed by his team such as Scumblr, Sketchy and others in the pipeline that focus on automating away security functions so that his team can focus on security feature experimentation and innovation.

Poorna's team is hiring!

Escape from PCI Land - Rahul Dani - Growth Product Engineering

Rahul Dani, who leads the Growth Product Engineering team at Netflix, talked about the adventure in steering the middle tier signup apps out of PCI scope and into a PCI free environment.

Rahul's team is hiring!

Wednesday, March 11, 2015

Can Spark Streaming survive Chaos Monkey?

Netflix is a data-driven organization that places emphasis on the quality of data collected and processed. In our previous blog post, we highlighted our use cases for real-time stream processing in the context of online recommendations and data monitoring. With Spark Streaming as our choice of stream processor, we set out to evaluate and share the resiliency story for Spark Streaming in the AWS cloud environment.  A Chaos Monkey based approach, which randomly terminated instances or processes, was employed to simulate failures.

Spark on Amazon Web Services (AWS) is relevant to us as Netflix delivers its service primarily out of the AWS cloud. Stream processing systems need to be operational 24/7 and be tolerant to failures. Instances on AWS are ephemeral, which makes it imperative to ensure Spark’s resiliency.

Spark Components

Apache Spark is a fast and general-purpose cluster computing system. Spark can be deployed on top of Mesos, Yarn or Spark's own cluster manager, which allocates worker node resources to an application. Spark Driver connects to the cluster manager and is responsible for converting an application to a directed graph (DAG) of individual tasks that get executed within an executor process on the worker nodes.

Creating Chaos

Netflix streaming devices periodically send events that capture member activities, which plays a significant role in personalization. These events flow to our server side applications and are routed to Kafka. Our Spark streaming application consumes these events from Kafka and computes metrics. The deployment architecture is shown below:

Fig 2: Deployment Architecture

Our goal is to validate that there is no interruption in computing metrics when the different Spark components fail. To simulate such failures, we employed a whack-a-mole approach and killed the various Spark components.

We ran our spark streaming application on Spark Standalone. The resiliency exercise was run with Spark v1.2.0, Kafka v0.8.0 and Zookeeper v3.4.5.

Spark Streaming Resiliency

Driver Resiliency: Spark Standalone supports two modes for launching the driver application. In client mode, the driver is launched in the same process as the one where the client submits the application.  When this process dies, the application is aborted.  In cluster mode, the driver is launched from one of the worker process in the cluster.  Additionally, standalone cluster mode supports a supervise option that allows restarting the application automatically on non-zero exit codes.

Master Resiliency:  Spark scheduler uses the Master to make scheduling decisions.  To avoid single point of failure, it is best to setup a multi master standalone cluster. Spark uses Zookeeper for leader election. One of the master nodes becomes the ACTIVE node and all Worker nodes get registered to it. When this master node dies, one of the STANDBY master nodes becomes the ACTIVE node and all the Worker nodes get automatically registered to it. If there are any applications running on the cluster during the master failover, they still continue to run without a glitch.

Worker Process Resiliency: Worker process launches and monitors the Executor and Driver as child processes. When the Worker process is killed, all its child processes are also killed.  The Worker process gets automatically relaunched, which in turn restarts the Driver and/or the Executor process.

Executor Resiliency: When the Executor process is killed, they are automatically relaunched by the Worker process and any tasks that were in flight are rescheduled.

Receiver Resiliency: Receiver runs as a long running task within an Executor and follows the same resiliency characteristics of an executor.

The effect on the computed metrics due to the termination of various Spark components is shown below.

Fig 3: Behavior on Receive/Driver/Master failure

Driver Failure: The main impact is back-pressure built up due to a node failure, which results in a sudden drop in message processing rate, followed by a catch up spike, before the graph settles into steady state.

Receiver Failure: The dip in computed metrics was due to the fact that default Kafka receiver is an unreliable receiver.  Spark streaming 1.2 introduced an experimental feature called write ahead logs that would make the kafka receiver reliable.  When this is enabled, applications would incur a hit to Kafka receiver throughput.  However, this could be addressed by increasing the number of receivers.


The following table summarizes the resiliency characteristics of different Spark components:

Behaviour on Component Failure
Client Mode: The entire application is killed
Cluster Mode with supervise: The Driver is restarted on a different Worker node
Single Master: The entire application is killed
Multi Master: A STANDBY master is elected ACTIVE
Worker Process
All child processes (executor or driver) are also terminated and a new worker process is launched
A new executor is launched by the Worker process
Same as Executor as they are long running tasks inside the Executor
Worker Node
Worker, Executor and Driver processes run on Worker nodes and the behavior is same as killing them individually

We uncovered a few issues (SPARK-5967, SPARK-3495, SPARK-3496, etc.) during this exercise, but Spark Streaming team was helpful in fixing them in a timely fashion. We are also in the midst of performance testing Spark and will follow up with a blog post.

Overall, we are happy with the resiliency of spark standalone for our use cases and excited to take it to the next level where we are working towards building a unified Lambda Architecture that involves a combination of batch and real-time streaming processing.  We are in early stages of this effort, so if you interested in contributing in this area, please reach out to us.

Monday, March 9, 2015

Netflix Hack Day - Winter 2015

Last week, we hosted the latest Netflix Hack Day. Hack Day is a way for our product development teams to get away from everyday work. It's a fun, experimental, collaborative, and creative outlet.

This time, we had about 70 hacks that were produced by more than 150 engineers and designers. We shared a few examples below to give you a taste, and you can see some of our past hacks in our posts for Feb. 2014 & Aug. 2014. Note that while we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or otherwise be used beyond Hack Day. We are surfacing them here publicly to share the spirit of the event.

Thanks to all of the hackers for putting together some incredible work in just 24 hours. If you’re interested in being a part of our next Hack Day, let us know by checking out our jobs site!

Netflix Earth
Netflix Earth is an animated 3D globe showing real-time global playback activity.

BEEP (Binge Encouragement and Enforcement Platform)
BEEP actively and abrasively encourages users to continue watching Netflix when their attention starts to stray.

In a world... where devices proliferate… darNES digs back in time to provide Netflix access to the original Nintendo Entertainment System.

Say Whaaat!!!
Say Whaaat!!! provides a more convenient way to to catch missed dialogue as you watch Netflix by displaying subtitles when you pause playback. It also provides the ability to navigate a film's timeline, caption by caption.

Net the Netflix Cheats
Don’t let your partners watch when you aren't around. Net the Netflix Cheats requires dual PIN access to watch titles that you and your partner have agreed to watch together.

And here are some pictures taken during the event.