Wednesday, July 9, 2014

Billing & Payments Engineering Meetup

On June 18th, we hosted our first Billing & Payments Engineering Meetup at Netflix.
We wanted to create a space for exchanging information and learning among professionals. That space would serve as a forum, or an agora, for a community of people sharing the same interests in the engineering aspects of billing & payment systems.
The billing and payments space is a dynamic and innovative environment that requires increased attention as it evolves. Many of the Bay Area's tech companies may have different core products, yet we all monetize in a fairly similar way. Most created billing systems internally and had to overcome similar technical or business challenges as companies grew. Moreover, as our companies expand internationally, the need to process foreign payment methods is becoming critical and potentially defining factor in maximizing chances of success.

Several trend-setting companies responded to our invite to speak to the large audience that came looking for tips and best-of-industry practices. 
Below is a recap of the agenda:
  • Mathieu Chauvin - Engineering Manager for Payments @ Netflix
  • Taylor Wicksell - Sr. Software Engineer for Billing @ Netflix
  • Jean-Denis Greze - Engineer @ Dropbox
  • Alec Holmes - Software Engineer @ Square
  • Emmanuel Cron - Software Engineer III, Google Wallet @ Google
  • Paul Huang - Engineering Manager @ Survey Monkey
  • Anthony Zacharakis - Lead Engineer @ Lumos Labs
  • Shengyong Li / Feifeng Yang - Dir. Engineering Commerce / Tech Lead Payment @ Electronic Arts
Below you can find the aggregate presentations. Thanks again to the presenters for sharing this material.


After the presentations, we held a networking session and engaged in very interesting conversations. It was a great event and another one will come up soon. Stay tuned on the meetup page to be notified!
http://www.meetup.com/Netflix-Billing-Payments-Engineering/

Netflix is always looking for talented people. If you share our passion for billing & payments innovation, check out our Careers page!
http://jobs.netflix.com/jobs.php?id=NFX00084


Sunday, July 6, 2014

Scale and Performance of a Large JavaScript Application



We recently held our second JavaScript Talks event at our Netflix headquarters in Los Gatos, Calif. Matt Seeley discussed the development approaches we use at Netflix to build the JavaScript applications which run on TV-connected devices, phones and tablets. These large, rich applications run across a wide range of devices and require carefully managing network resources, memory and rendering. This talk explores various approaches the team uses to build well-performing UIs, monitor application performance, write consistent code, and scale development across the team.

The video from the talk can found below along with slides from the talk at: https://speakerdeck.com/mseeley/life-on-the-grid.

And don't forget to check out videos from our past JavaScript Talks events on the Netflix UI Engineering YouTube Channel.







Monday, June 30, 2014

Announcing Security Monkey - AWS Security Configuration Monitoring and Analysis

We are pleased to announce the open source availability of Security Monkey, our solution for monitoring and analyzing the security of our Amazon Web Services configurations.


At Netflix, responsibility for delivering the streaming service is distributed and the environment is constantly changing. Code is deployed thousands of times a day, and cloud configuration parameters are modified just as frequently. To understand and manage the risk associated with this velocity, the security team needs to understand how things are changing and how these changes impact our security posture.
Netflix delivers its service primarily out of Amazon Web Services’ (AWS) public cloud, and while AWS provides excellent visibility of systems and configurations, it has limited capabilities in terms of change tracking and evaluation. To address these limitations, we created Security Monkey - the member of the Simian Army responsible for tracking and evaluating security-related changes and configurations in our AWS environments.

Overview of Security Monkey

We envisioned and built the first version of Security Monkey in 2011. At that time, we used a few different AWS accounts and delivered the service from a single AWS region. We now use several dozen AWS accounts and leverage multiple AWS regions to deliver the Netflix service. Over its lifetime, Security Monkey has ‘evolved’ (no pun intended) to meet our changing and growing requirements.

Viewing IAM users in Security Monkey - highlighted users have active access keys.
There are a number of security-relevant AWS components and configuration items - for example, security groups, S3 bucket policies, and IAM users. Changes or misconfigurations in any of these items could create an unnecessary and dangerous security risk. We needed a way to understand how AWS configuration changes impacted our security posture. It was also critical to have access to an authoritative configuration history service for forensic and investigative purposes so that we could know how things have changed over time. We also needed these capabilities at scale across the many accounts we manage and many AWS services we use.
Security Monkey's filter interface allows you to quickly find the configurations and items you're looking for.
These needs are at the heart of what Security Monkey is - an AWS security configuration tracker and analyzer that scales for large and globally distributed cloud environments.

Architecture

At a high-level, Security Monkey consists of the following components:
  • Watcher - The component that monitors a given AWS account and technology (e.g. S3, IAM, EC2). The Watcher detects and records changes to configurations. So, if a new IAM user is created or if an S3 bucket policy changes, the Watcher will detect this and store the change in Security Monkey’s database.
  • Notifier - The component that lets a user or group of users know when a particular item has changed. This component also provides notification based on the triggering of audit rules.
  • Auditor - Component that executes a set of business rules against an AWS configuration to determine the level of risk associated with the configuration. For example, a rule may look for a security group with a rule allowing ingress from 0.0.0.0/0 (meaning the security group is open to the Internet). Or, a rule may look for an S3 policy that allows access from an unknown AWS account (meaning you may be unintentionally sharing the data stored in your S3 bucket). Security Monkey has a number of built-in rules included, and users are free to add their own rules.

In terms of technical components, we run Security Monkey in AWS on Ubuntu Linux, and storage is provided by a PostgreSQL RDS database. We currently run Security Monkey on a single m3.large instance - this instance type has been able to easily monitor our dozens of accounts and many hundreds of changes per day.

The application itself is written in Python using the Flask framework (including a number of Flask plugins). At Netflix, we use our standard single-sign on (SSO) provider for authentication, but for the OSS version we’ve implemented Flask-Login and Flask-Security for user management. The frontend for Security Monkey’s data presentation is written in Angular Dart, and JSON data is also available via a REST API.

General Features and Operations

Security Monkey is relatively straightforward from an operational perspective. Installation and AWS account setup is covered in the installation document, and Security Monkey does not rely on other Netflix OSS components to operate. Generally, operational use includes:
  • Initial Configuration
    • Setting up one or more Security Monkey users to use/administer the application itself.
    • Setting up one or more AWS accounts for Security Monkey to monitor.
    • Configuring user-specific notification preferences (to determine whether or not a given user should be notified for configuration changes and audit reports).
  • Typical Use Cases
    • Checking historical details for a given configuration item (e.g. the different states a security group has had over time).
    • Viewing reports to check what audit issues exist (e.g. all S3 policies that reference unknown accounts or all IAM users that have active access keys).
    • Justifying audit issues (providing background or context on why a particular issues exists and is acceptable though it may violate an audit rule).

Note on AWS CloudTrail and AWS Trusted Advisor

CloudTrail is AWS’ service that records and logs API calls. Trusted Advisor is AWS’ premium support service that automatically evaluates your cloud deployment against a set of best practices (including security checks).

Security Monkey predates both of these services and meets a bit of each services’ goals while having unique value of its own:
  • CloudTrail provides verbose data on API calls, but has no sense of state in terms of how a particular configuration item (e.g. security group) has changed over time. Security Monkey provides exactly this capability.
  • Trusted Advisor has some excellent checks, but it is a paid service and provides no means for the user to add custom security checks. For example, Netflix has a custom check to identify whether a given IAM user matches a Netflix employee user account, something that is impossible to do via Trusted Advisor. Trusted Advisor is also a per-account service, whereas Security Monkey scales to support and monitor an arbitrary number of AWS accounts from a single Security Monkey installation.

Open Items and Future Plans

Security Monkey has been in production use at Netflix since 2011 and we will continue to add additional features. The following list documents some of our planned enhancements.
  • Integration with CloudTrail for change detail (including originating IP, instance, IAM account).
  • Ability to compare different configuration items across regions or accounts.
  • CSRF protections for form POSTs.
  • Content Security Policy headers (currently awaiting a Dart issue to be addressed).
  • Additional AWS technology and configuration tracking.
  • Test integration with moto.
  • SSL certificate expiry monitoring.
  • Simpler installation script and documentation.
  • Roles/authorization capabilities for admin vs. user roles.
  • More refined AWS permissions for Security Monkey operations (the current policy in the install docs is a broader read-only role).
  • Integration with edda, our general purpose AWS change tracker. On a related note, our friends at Prezi have open sourced reddalert, a security change detector that is itself integrated with edda.

Conclusion


Security Monkey has helped the security teams @ Netflix gain better awareness of changes and security risks in our AWS environment. Its approach fits well with the general Simian Army approach of continuously monitoring and detecting potential anomalies and risky configurations, and we look forward to seeing how other AWS users choose to extend and adapt its capabilities. Security Monkey is now available on our GitHub site.

If you’re in the San Francisco Bay Area and would like to hear more about Security Monkey (and see a demo), our August Netflix OSS meetup will be focused specifically on security. It’s scheduled for August 20th and will be held at Netflix HQ in Los Gatos.

-Patrick Kelley, Kevin Glisson, and Jason Chan (Netflix Cloud Security Team)

Monday, June 16, 2014

Delivering Breaking Bad on Netflix in Ultra HD 4K

This week Netflix is pleased to begin streaming all 62 episodes of Breaking Bad in UltraHD 4K. Breaking Bad in 4K comes from Sony Pictures Entertainment’s beautiful remastering of Breaking Bad from the original film negatives. This 4K experience is available on select 4K Smart TVs.

As pleased as I am to announce Breaking Bad in 4K, this blog post is also intended to highlight the collaboration between Sony Pictures Entertainment and Netflix to modernize the digital supply chain that transports digital media from content studios, like Sony Pictures, to streaming retailers, like Netflix.

Netflix and Sony agreed on an early subset of IMF for the transfer of the video and audio files for Breaking Bad. IMF stands for Interoperable Master Format, an emerging SMPTE specification governing file formats and metadata for digital media archiving and B2B exchange.

IMF specifies fundamental building blocks like immutable objects, checksums, globally unique identifiers, and manifests (cpl). These building blocks hold promise for vastly improving the efficiency, accuracy, and scale of the global digital supply chain.

At Netflix, we are excited about IMF and we are committing significant R&D efforts towards adopting IMF for content ingestion. Netflix has an early subset of IMF in production today and we will support most of the current IMF App 2 draft by the end of 2014. We are also developing a roadmap for IMF App 2 Extended and Extended+. We are pleased that Sony Pictures is an early innovator in this space and we are looking forward to the same collaboration with additional content studio partners.

Breaking Bad is joining House of Cards season 2 and the Moving Art documentaries in our global 4K catalog. We are also adding a few more 4K movies for our USA members. We have added Smurfs 2, Ghostbusters, and Ghostbusters 2 in the United States. All of these movies were packaged in IMF by Sony Pictures.

Kevin McEntee
VP Digital Supply Chain

Wednesday, June 11, 2014

Optimizing the Netflix Streaming Experience with Data Science



On January 16, 2007, Netflix started rolling out a new feature: members could now stream movies directly on their browser without having to wait for the red envelope in the mail. This event marked a substantial shift for Netflix and the entertainment industry. A lot has changed since then. Today, Netflix delivers over 1 billion hours of streaming per month to 48 million members in more than 40 countries. And Netflix accounts for more than a third of peak Internet traffic in the US. This level of engagement results in a humungous amount of data.
 
At Netflix, we use big data for deep analysis and predictive algorithms to help provide the best experience for our members. A well-known example of this is the personalized movie and show recommendations that are tailored to each member's tastes. The Netflix prize that was launched in 2007 highlighted Netflix's focus on recommendations. Another area that we're focusing on is the streaming quality of experience (QoE), which refers to the user experience once the member hits play on Netflix. This is an area that benefits significantly from data science and algorithms/models built around big data.

Netflix is committed to delivering outstanding streaming service and is investing heavily in advancing the state of the art in adaptive streaming algorithms and network technologies such as Open Connect to optimize streaming quality. Netflix won a Primetime Emmy Engineering Award in 2012 for the streaming service. To put even more focus on "streaming science," we've created a new team at Netflix that's working on innovative approaches for using our data to improve QoE. In this post, I will briefly outline the types of problems we're solving, which include:

  • Understanding the impact of QoE on user behavior
  • Creating a personalized streaming experience for each member
  • Determining what movies and shows to cache on the edge servers based on member viewing behavior
  • Improving the technical quality of the content in our catalog using viewing data and member feedback
Understanding the impact of QoE on user behavior
User behavior refers to the way users interact with the Netflix service, and we use our data to both understand and predict behavior. For example, how would a change to our product affect the number of hours that members watch? To improve the streaming experience, we look at QoE metrics that are likely to have an impact on user behavior. One metric of interest is the rebuffer rate, which is a measure of how often playback is temporarily interrupted while more data is downloaded from the server to replenish the local buffer on the client device. Another metric, bitrate, refers to the quality of the picture that is served/seen - a very low bitrate corresponds to a fuzzy picture. There is an interesting relationship between rebuffer rate and bitrate. Since network capacity is limited, picking too high of a bitrate increases the risk of hitting the capacity limit, running out of data in the local buffer, and then pausing playback to refill the buffer. What’s the right tradeoff?
There are many more metrics that can be used to characterize QoE, but the impact that each one has on user behavior, and the tradeoffs between the metrics need to be better understood. More technically, we need to determine a mapping function that can quantify and predict how changes in QoE metrics affect user behavior. Why is this important? Understanding the impact of QoE on user behavior allows us to tailor the algorithms that determine QoE and improve aspects that have significant impact on our members' viewing and enjoyment.
Improving the streaming experience

Streaming Supply Chain v2.png
The Netflix Streaming Supply Chain: opportunities to optimize the streaming experience exist at multiple points


How do we use data to provide the best user experience once a member hits play on Netflix?

Creating a personalized streaming experience

One approach is to look at the algorithms that run in real-time or near real-time once playback has started, which determine what bitrate should be served, what server to download that content from, etc.
With vast amounts of data, the mapping function discussed above can be used to further improve the experience for our members at the aggregate level, and even personalize the streaming experience based on what the function might look like based on each member's "QoE preference." Personalization can also be based on a member's network characteristics, device, location, etc. For example, a member with a high-bandwidth connection on a home network could have very different expectations and experience compared to a member with low bandwidth on a mobile device on a cellular network.

Optimizing content caching

A set of big data problems also exists on the content delivery side. Open Connect is Netflix's own content delivery network that allows ISPs to directly connect to Netflix servers at common internet exchanges, or place a Netflix-provided storage appliance (cache) with Netflix content on it at ISP locations. The key idea here is to locate the content closer (in terms of network hops) to our members to provide a great experience.

One of several interesting problems here is to optimize decisions around content caching on these appliances based on the viewing behavior of the members served. With millions of members, a large catalog, and limited storage capacity, how should the content be cached to ensure that when a member plays a particular movie or show, it is being served out of the local cache/appliance?
Improving content quality

Another approach to improving user experience involves looking at the quality of content, i.e. the video, audio, subtitles, closed captions, etc. that are part of the movie or show. Netflix receives content from the studios in the form of digital assets that are then encoded and quality checked before they go live on the content servers. Given our large and varied catalog that spans several countries and languages, the challenge is to ensure that all our movies and shows are free of quality issues such as incorrect subtitles or captions, our own encoding errors, etc.
In addition to the internal quality checks, we also receive feedback from our members when they discover issues while viewing. This data can be very noisy and may contain non-issues, issues that are not content quality related (for example, network errors encountered due to a poor connection), or general feedback about member tastes and preferences. In essence, identifying issues that are truly content quality related amounts to finding the proverbial needle in a haystack.

By combining member feedback with intrinsic factors related to viewing behavior, we're building models to predict whether a particular piece of content has a quality issue. For instance, we can detect viewing patterns such as sharp drop offs in viewing at certain times during the show and add in information from member feedback to identify problematic content. Machine learning models along with natural language processing (NLP) and text mining techniques can be used to build powerful models to both improve the quality of content that goes live and also use the information provided by our members to close the loop on quality and replace content that does not meet the expectations of Netflix members. As we expand internationally, this problem becomes even more challenging with the addition of new movies and shows to our catalog and the increase in number of languages.
These are just a few examples of ways in which we can use data in creative ways to build models and algorithms that can deliver the perfect viewing experience for each member. There are plenty of other challenges in the streaming space that can benefit from a data science approach. If you're interested in working in this exciting space, please check out the Streaming Science & Algorithms position on the Netflix jobs site.



Tuesday, June 3, 2014

HTML5 Video in Safari on OS X Yosemite

By Anthony Park and Mark Watson.

We're excited to announce that Netflix streaming in HTML5 video is now available in Safari on OS X Yosemite! We've been working closely with Apple to implement the Premium Video Extensions in Safari, which allow playback of premium video content in the browser without the use of plugins. If you're in Apple's Mac Developer Program, or soon the OS X Beta Program, you can install the beta version of OS X Yosemite. With the OS X Yosemite Beta on a modern Mac, you can visit Netflix.com today in Safari and watch your favorite movies and TV shows using HTML5 video without the need to install any plugins.

We're especially excited that Apple implemented the Media Source Extensions (MSE) using their highly optimized video pipeline on OS X. This lets you watch Netflix in buttery smooth 1080p without hogging your CPU or draining your battery. In fact, this allows you to get up to 2 hours longer battery life on a MacBook Air streaming Netflix in 1080p - that’s enough time for one more movie!

Apple also implemented the Encrypted Media Extensions (EME) which provides the content protection needed for premium video services like Netflix.

Finally, Apple implemented the Web Cryptography API (WebCrypto) in Safari, which allows us to encrypt and decrypt communication between our JavaScript application and the Netflix servers.

The Premium Video Extensions do away with the need for proprietary plugin technologies for streaming video. In addition to Safari on OS X Yosemite, plugin-free playback is also available in IE 11 on Windows 8.1, and we look forward to a time when these APIs are available on all browsers.

Congratulations to the Apple team for advancing premium video on the web with Yosemite! We’re looking forward to the Yosemite launch this Fall.

Monday, June 2, 2014

Building Netflix Playback with Self-Assembling Components


playback.png

Our 48 million members are accustomed to seeing a screen like this, whether on their TV or one of the 1000+ other Netflix devices they enjoy watching on. But the simple act of pressing play calls into action a deep and complex system that handles the DRM licenses, contract evaluations, CDN selection, and more. This system is known internally as the Playback Service and is responsible for making your Netflix streaming experience seem effortless.

The original Playback Service was built before Netflix was synonymous with streaming.  As our product matured, the existing architecture became more difficult to support and started showing signs of stress.  For example, we debuted HTML5 support last summer on IE11 in a major step towards standards-based playback on web platforms.  This required adoption of a new device security model that works with the emerging HTML5 Web Cryptography API.  It would have been very challenging to integrate this new model into our original architecture due to poor separation of business and security logic.  This and other shortcomings pushed us to re-imagine our design, leading to a radically different solution with the following as our key design requirements:

  • Operation at massive scale
  • High velocity innovation
  • Reusability of components


High-level Architecture

The new Playback Service uses self-assembling components to handle the enormous volume of traffic Netflix gets each day.  The following is an overview of that architecture; with special focus given to how requests are delegated to these components which are dynamically assembled and executed within an internal rule engine.

Playback Service Architecture 3.png

We will examine the building blocks of this architecture and show the benefits of using small, reusable components that are automatically wired together to create an emergent system.  This post will focus on the smallest units of the new architecture and the ideas behind self-assembly.  It will be followed up by others that go deeper into how we implement these concepts in the new architecture and address some challenges inherent to this approach.


Bottom-up

We started from the bottom up; defining the building blocks of the new architecture in a way that promotes loose coupling and clear separation of concerns.  These building blocks are called Processors: entities that take zero or more inputs and generate no more than one output.  These are the smallest computational unit of the architecture and behave like commands ala The Gang of Four.  Below is a diagram that shows a processor that takes A, B, C and generates D.
This metaphor generalizes well given that many complex tasks can be subdivided into discrete, function-like steps.  It also matches the way most engineers already think about problem decomposition.  This definition enables processors to be as specialized as necessary, promoting low interconnectedness with other parts of the system.  These qualities make the system easier to reason about, enhance, and test.

The Playback Service--like other complex systems--can be modelled as a black box that takes inputs and generates outputs.  The conversion of some input A to an output E can be defined as a function f(A) = E and modelled as a single processor.
Of course, using a single processor only makes sense for very simple systems.  More complex services  would be decomposed into finer-grained processors as illustrated below.
Here you can see that the computation of E is handled as several processor invocations.  This flow resembles a series of function calls in a Java program, but there are some fundamental differences.  The difficulty with normal functions is someone has to invoke them and decide how they are wired together.  Essentially, the decomposition of f(A) = E above is usually something the entire team needs to understand and maintain.  This places a cap on system evolution since scaling the system means scaling each engineer.  It also increases the cost of scaling the team since minimum ramp-up time is directly proportional to the system complexity.

But what if you could have functions that self-assemble?  What if processors could simply advertise their inputs/outputs and the wiring between them were an emergent property of that particular collection of processors?


Self-assembling Components

The hypothesis is that complex systems can be built efficiently if they are reduced to small, local problems that are solved in relative isolation with processors.  These small blocks are then automatically assembled to reveal a fully formed system.  Such a system would no longer require engineers to understand their entire scope before making significant contributions.  These systems would be free to scale without taxing their engineering teams proportionally.  Likewise, their teams could grow without investing in lots of onboarding time for each new member.

We can use the decomposition we did above for f(A) = E to illustrate how a self-assembly would work.  Here is a simplified version of the diagram we saw earlier.
This system solves for A => E using the processors shown.  However, this could be a more sophisticated system containing other processors that do not participate in the computation of E given A.  Consider the following, where the system’s complete set of processors is included in the diagram.

The other processors are inactive for this computation, but various combinations would become active under different inputs.  Take a case where the inputs were J, and W and processors were in place to handle these inputs such that the computation J,W => Y were possible.

The inputs J and W would trigger a different set of processors than before; leaving those that computed A => E dormant.

The processors triggered for some inputs is an emergent property of the complete set of processors within the system.  An assembler mechanism exists to determine when each processor can participate in the computation.  It makes this decision at runtime, allowing for a fully dynamic wiring for each request.  As a result, processors can be organized in any way and do not need to be aware of each other.  This makes their functionality easier to add, remove, and update than conventional mechanisms like switch statements or inheritance; which are statically determined and more rigidly structured.

Extending traditional systems often means ramping up on a lot of code to understand where the relevant inflection points are for a change or feature.  Self-assembly relaxes the urgency for this deeper context and shifts the focus towards getting the right interaction designs for each component.  It also enables more thorough testing since processors are naturally isolated from each other and simpler to unit test.  They can also be assembled and run as a group with mocked dependencies to facilitate thorough end-to-end validation.

Self-assembly frees engineers to focus on solving local problems and adding value without having to wrestle with the entire end-to-end context.  State validation is a good example of an enhancement that requires only local context with this architecture.  The computation of J,W => Y above can be enhanced to include additional validation of V whenever it is generated.  This could be achieved by adding a new processor that operates on V as an input: illustrated below.

The new processor V => V would take a value and raise an error if that value is invalid for some reason.  This validation would be triggered whenever V is present in the system, whether or not J,W => Y is being computed.  This is by design; meaning each processor is reused whenever its services are needed.

This validator pattern emerges often in the new Playback Service.  For example, we use it to detect whether data sent by clients has been tampered with mid-flight.  This is done using HMAC calculations to verify the data matches a client provided hash value.  As with other processors, the integrity protection service provided this way is available for use during any request.


Challenges of Self-assembly

The use of self-assembling components offers clear advantages over hand wiring.  It enables fluid architectures that can change dynamically at runtime and simplifies feature isolation so components can evolve rapidly with minimal impact to the overall system.  Moreover, it decouples team size from system complexity so the two can scale independently.

Despite these benefits, building a working solution that enables self-assembly is non-trivial.  Such a system has to decide which operations are executed when, and in what order.  It has to manage the computation pipeline without adding too much overhead or complexity; all while scaling up with the set of processors.  It also needs to be relatively unobtrusive so developers can remain focused on building the service.  These were some of the challenges my team had to overcome when building the new Playback architecture atop the concepts of self-assembly.


Upcoming...

Subsequent blog posts will take us deeper into the workings of the new Playback Service architecture and provide more details about how we solved the challenges above and other issues intrinsic to self-assembly.  We will also be discussing how this architecture is designed to enable fully dynamic end-points (where the set of rules/processors can change for each request) as well as dynamic services where the set of end-points can change for a running server.

The new Playback Service architecture based on self-assembling components provides a flexible programming model that is easy to develop and test.  It greatly improves our ability to innovate as we continue to enhance the viewing experience for our members.

We are always looking for talented engineers to join us.  So reach out if you are excited about this kind of engineering endeavor and would like to learn more about this and other things we are working on.