Wednesday, October 19, 2016

Netflix Chaos Monkey Upgraded

We are pleased to announce a significant upgrade to one of our more popular OSS projects.  Chaos Monkey 2.0 is now on github!

Years ago, we decided to improve the resiliency of our microservice architecture.  At our scale it is guaranteed that servers on our cloud platform will sometimes suddenly fail or disappear without warning.  If we don’t have proper redundancy and automation, these disappearing servers could cause service problems.

The Freedom and Responsibility culture at Netflix doesn’t have a mechanism to force engineers to architect their code in any specific way.  Instead, we found that we could build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward.  We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours.  Some people thought this was crazy, but we couldn’t depend on the infrequent occurrence to impact behavior.  Knowing that this would happen on a frequent basis created strong alignment among our engineers to build in the redundancy and automation to survive this type of incident without any impact to the millions of Netflix members around the world.

We value Chaos Monkey as a highly effective tool for improving the quality of our service.  Now Chaos Monkey has evolved.  We rewrote the service for improved maintainability and added some great new features.  The evolution of Chaos Monkey is part of our commitment to keep our open source software up to date with our current environment and needs.

Integration with Spinnaker

Chaos Monkey 2.0 is fully integrated with Spinnaker, our continuous delivery platform.
Service owners set their Chaos Monkey configs through the Spinnaker apps, Chaos Monkey gets information about how services are deployed from Spinnaker, and Chaos Monkey terminates instances through Spinnaker.

Since Spinnaker works with multiple cloud backends, Chaos Monkey does as well. In the Netflix environment, Chaos Monkey terminates virtual machine instances running on AWS and Docker containers running on Titus, our container cloud.

Integration with Spinnaker gave us the opportunity to improve the UX as well.  We interviewed our internal customers and came up with a more intuitive method of scheduling terminations.  Service owners can now express a schedule in terms of the mean time between terminations, rather than a probability over an arbitrary period of time.  We also added grouping by app, stack, or cluster, so that applications that have different redundancy architectures can schedule Chaos Monkey appropriate to their configuration. Chaos Monkey now also supports specifying exceptions so users can opt out specific clusters.  Some engineers at Netflix use this feature to opt out small clusters that are used for testing.

Chaos Monkey Spinnaker UI
Tracking Terminations

Chaos Monkey can now be configured for specifying trackers.  These external services will receive a notification when Chaos Monkey terminates an instance.  Internally, we use this feature to report metrics into Atlas, our telemetry platform, and Chronos, our event tracking system.  The graph below, taken from Atlas UI, shows the number of Chaos Monkey terminations for a segment of our service.  We can see chaos in action.  Chaos Monkey even periodically terminates itself.

Chaos Monkey termination metrics in Atlas
Termination Only

Netflix only uses Chaos Monkey to terminate instances.  Previous versions of Chaos Monkey allowed the service to ssh into a box and perform other actions like burning up CPU, taking disks offline, etc.  If you currently use one of the prior versions of Chaos Monkey to run an experiment that involves anything other than turning off an instance, you may not want to upgrade since you would lose that functionality.


We also used this opportunity to introduce many small features such as automatic opt-out for canaries, cross-account terminations, and automatic disabling during an outage.  Find the code on the Netflix github account and embrace the chaos!

-Chaos Engineering Team at Netflix
Lorin Hochstein, Casey Rosenthal

Wednesday, October 12, 2016

To Be Continued: Helping you find shows to continue watching on Netflix


Our objective in improving the Netflix recommendation system is to create a personalized experience that makes it easier for our members to find great content to enjoy. The ultimate goal of our recommendation system is to know the exact perfect show for the member and just start playing it when they open Netflix. While we still have a long way to achieve that goal, there are areas where we can reduce the gap significantly.

When a member opens the Netflix website or app, she may be looking to discover a new movie or TV show that she never watched before, or, alternatively, she may want to continue watching a partially-watched movie or a TV show she has been binging on. If we can reasonably predict when a member is more likely to be in the continuation mode and which shows she is more likely to resume, it makes sense to place those shows in prominent places on the home page.
While most recommendation work focuses on discovery, in this post, we focus on the continuation mode and explain how we used machine learning to improve the member experience for both modes. In particular, we focus on a row called “Continue Watching” (CW) that appears on the main page of the Netflix member homepage on most platforms. This row serves as an easy way to find shows that the member has recently (partially) watched and may want to resume. As you can imagine, a significant proportion of member streaming hours are spent on content played from this row.

Continue Watching

Previously, the Netflix app in some platforms displayed a row with recently watched shows (here we use the term show broadly to include all forms of video content on Netflix including movies and TV series) sorted by recency of last time each show was played. How the row was placed on the page was determined by some rules that depended on the device type. For example, the website only displayed a single continuation show on the top-left corner of the page. While these are reasonable baselines, we set out to unify the member experience of CW row across platforms and improve it along two dimensions:

  • Improve the placement of the row on the page by placing it higher when a member is more likely to resume a show (continuation mode), and lower when a member is more likely to look for a new show to watch (discovery mode)
  • Improve the ordering of recently-watched shows in the row using their likelihood to be resumed in the current session

Intuitively, there are a number of activity patterns that might indicate a member’s likelihood to be in the continuation mode. For example, a member is perhaps likely to resume a show if she:

  • is in the middle of a binge; i.e., has been recently spending a significant amount of time watching a TV show, but hasn’t yet reached its end
  • has partially watched a movie recently
  • has often watched the show around the current time of the day or on the current device
On the other hand, a discovery session is more likely if a member:
  • has just finished watching a movie or all episodes of a TV show
  • hasn’t watched anything recently
  • is new to the service
These hypotheses, along with the high fraction of streaming hours spent by members in continuation mode, motivated us to build machine learning models that can identify and harness these patterns to produce a more effective CW row.

Building a Recommendation Model for Continue Watching

To build a recommendation model for the CW row, we first need to compute a collection of features that extract patterns of the behavior that could help the model predict when someone will resume a show. These may include features about the member, the shows in the CW row, the member’s past interactions with those shows, and some contextual information. We then use these features as inputs to build machine learning models. Through an iterative process of variable selection, model training, and cross validation, we can refine and select the most relevant set of features.

While brainstorming for features, we considered many ideas for building the CW models, including:

  1. Member-level features:
    • Data about member’s subscription, such as the length of subscription, country of signup, and language preferences
    • How active has the member been recently
    • Member’s past ratings and genre preferences
  2. Features encoding information about a show and interactions of the member with it:
    • How recently was the show added to the catalog, or watched by the member
    • How much of the movie/show the member watched
    • Metadata about the show, such as type, genre, and number of episodes; for example kids shows may be re-watched more
    • The rest of the catalog available to the member
    • Popularity and relevance of the show to the member
    • How often do the members resume this show
  3. Contextual features:
    • Current time of the day and day of the week
    • Location, at various resolutions
    • Devices used by the member

Two applications, two models

As mentioned above, we have two tasks related to organizing a member's continue watching shows: ranking the shows within the CW row and placing the CW row appropriately on the member’s homepage.

Show ranking

To rank the shows within the row, we trained a model that optimizes a ranking loss function. To train it, we used sessions where the member resumed a previously-watched show - i.e., continuation sessions - from a random set of members. Within each session, the model learns to differentiate amongst candidate shows for continuation and ranks them in the order of predicted likelihood of play. When building the model, we placed special importance on having the model place the show of play at first position.

We performed an offline evaluation to understand how well the model ranks the shows in the CW row. Our baseline for comparison was the previous system, where the shows were simply sorted by recency of last time each show was played. This recency rank is a strong baseline (much better than random) and is also used as a feature in our new model. Comparing the model vs. recency ranking, we observed significant lift in various offline metrics. The figure below displays Precision@1 of the two schemes over time. One can see that the lift in performance is much greater than the daily variation.

This model performed significantly better than recency-based ranking in an A/B test and better matched our expectations for member behavior. As an example, we learned that the members whose rows were ranked using the new model had fewer plays originating from the search page. This meant that many members had been resorting to searching for a recently-watched show because they could not easily locate it on the home page; a suboptimal experience that the model helped ameliorate.

Row placement

To place the CW row appropriately on a member’s homepage, we would like to estimate the likelihood of the member being in a continuation mode vs. a discovery mode. With that likelihood we could take different approaches. A simple approach would be to turn row placement into a binary decision problem where we consider only two candidate positions for the CW row: one position high on the page and another one lower down. By applying a threshold on the estimated likelihood of continuation, we can decide in which of these two positions to place the CW row. That threshold could be tuned to optimize some accuracy metrics. Another approach is to take the likelihood and then map it onto different positions, possibly based on the content at that location on the page. In any case, getting a good estimate of the continuation likelihood is critical for determining the row placement. In the following, we discuss two potential approaches for estimating the likelihood of the member operating in a continuation mode.

Reusing the show-ranking model

A simple approach to estimating the likelihood of continuation vs. discovery is to reuse the scores predicted by the show-ranking model. More specifically, we could calibrate the scores of individual shows in order to estimate the probability P(play(s)=1) that each show s will be resumed in the given session. We can use these individual probabilities over all the shows in the CW row to obtain an overall probability of continuation; i.e., the probability that at least one show from the CW row will be resumed. For example, under a simple assumption of independence of different plays, we can write the probability that at least one show from the CW row will be played as:

Dedicated row model

In this approach, we train a binary classifier to differentiate between continuation sessions as positive labels and sessions where the user played a show for the first time (discovery sessions) as negative labels. Potential features for this model could include member-level and contextual features, as well as the interactions of the member with the most recent shows in the viewing history.
Comparing the two approaches, the first approach is simpler because it only requires having a single model as long as the probabilities are well calibrated. However, the second one is likely to provide a more accurate estimate of continuation because we can train a classifier specifically for it.

Tuning the placement

In our experiments, we evaluated our estimates of continuation likelihood using classification metrics and achieved good offline metrics. However, a challenge that still remains is to find an optimal mapping for that estimated likelihood, i.e., to balance continuation and discovery. In this case, varying the placement creates a trade-off between two types of errors in our prediction: false positives (where we incorrectly predict that the member wants to resume a show from the CW row) and false negatives (where we incorrectly predict that the member wants to discover new content). These two types of errors have different impacts on the member. In particular, a false negative makes it harder for members to continue bingeing on a show. While experienced members can find the show by scrolling down the page or by using the search functionality, the additional friction can make it more difficult for people new to the service. On the other hand, a false positive leads to wasted screen real estate, which could have been used to display more relevant recommendation shows for discovery. Since the impacts of the two types of errors on the member experience are difficult to measure accurately offline, we A/B tested different placement mappings and were able to learn the appropriate value from online experiments leading to the highest member engagement.

Context Awareness

One of our hypotheses was that continuation behavior depends on context: time, location, device, etc. If that is the case, given proper features, the trained models should be able to detect those patterns and adapt the predicted probability of resuming shows based on the current context of a member. For example, members may have habits of watching a certain show around the same time of the day (for example, watching comedies at around 10 PM on weekdays). As an example of context awareness, the following screenshots demonstrate how the model uses contextual features to distinguish between the behavior of a member on different devices. In this example, the profile has just watched a few minutes of the show “Sid the Science Kid” on an iPhone and the show “Narcos” on the Netflix website. In response, the CW model immediately ranks “Sid the Science Kid” at the top position of the CW row on the iPhone, and puts “Narcos” at the first position on the website.

Serving the Row

Members expect the CW row to be responsive and change dynamically after they watch a show. Moreover, some of the features in the model are time and device dependent and can not be precomputed in advance, which is an approach we use for some of our recommendation systems. Therefore, we need to compute the CW row in real-time to make sure it is fresh when we get a request for a homepage at the start of a session. To keep it fresh, we also need to update it within a session after certain user interactions and immediately push that update to the client to update their homepage. Computing the row on-the-fly at our scale is challenging and requires careful engineering. For example, some features are more expensive to compute for the users with longer viewing history, but we need to have reasonable response times for all members because continuation is a very common scenario. We collaborated with several engineering teams to create a dynamic and scalable way for serving the row to address these challenges.


Having a better Continue Watching row clearly makes it easier for our members to jump right back into the content they are enjoying while also getting out of the way when they want to discover something new. While we’ve taken a few steps towards improving this experience, there are still many areas for improvement. One challenge is that we seek to unify how we place this row with respect to the rest of the rows on the homepage, which are predominantly focused on discovery. This is challenging because different algorithms are designed to optimize for different actions, so we need a way to balance them. We also want to be thoughtful about pushing CW too much; we want people to “Binge Responsibly” and also explore new content. We also have details to dig into like how to determine if a show is actually finished by a user so we can remove it from the row. This can be complicated by scenarios such as if someone turned off their TV but not the playing device or fell asleep watching. We also keep an eye out for new ways to use the CW model in other aspects of the product.
Can’t wait to see how the Netflix Recommendation saga continues? Join us in tackling these kinds of algorithmic challenges and help write the next episode.

Wednesday, September 21, 2016

Zuul 2 : The Netflix Journey to Asynchronous, Non-Blocking Systems

We recently made a major architectural change to Zuul, our cloud gateway. Did anyone even notice!?  Probably not... Zuul 2 does the same thing that its predecessor did -- acting as the front door to Netflix’s server infrastructure, handling traffic from all Netflix users around the world.  It also routes requests, supports developers’ testing and debugging, provides deep insight into our overall service health, protects Netflix from attacks, and channels traffic to other cloud regions when an AWS region is in trouble. The major architectural difference between Zuul 2 and the original is that Zuul 2 is running on an asynchronous and non-blocking framework, using Netty.  After running in production for the last several months, the primary advantage (one that we expected when embarking on this work) is that it provides the capability for devices and web browsers to have persistent connections back to Netflix at Netflix scale.  With more than 83 million members, each with multiple connected devices, this is a massive scale challenge.  By having a persistent connection to our cloud infrastructure, we can enable lots of interesting product features and innovations, reduce overall device requests, improve device performance, and understand and debug the customer experience better.  We also hoped the Zuul 2 would offer resiliency benefits and performance improvements, in terms of latencies, throughput, and costs.  But as you will learn in this post, our aspirations have differed from the results.

Differences Between Blocking vs. Non-Blocking Systems

To understand why we built Zuul 2, you must first understand the architectural differences between asynchronous and non-blocking (“async”) systems vs. multithreaded, blocking (“blocking”) systems, both in theory and in practice.  

Zuul 1 was built on the Servlet framework. Such systems are blocking and multithreaded, which means they process requests by using one thread per connection. I/O operations are done by choosing a worker thread from a thread pool to execute the I/O, and the request thread is blocked until the worker thread completes. The worker thread notifies the request thread when its work is complete. This works well with modern multi-core AWS instances handling 100’s of concurrent connections each. But when things go wrong, like backend latency increases or device retries due to errors, the count of active connections and threads increases. When this happens, nodes get into trouble and can go into a death spiral where backed up threads spike server loads and overwhelm the cluster.  To offset these risks, we built in throttling mechanisms and libraries (e.g., Hystrix) to help keep our blocking systems stable during these events.

Multithreaded System Architecture

Async systems operate differently, with generally one thread per CPU core handling all requests and responses. The lifecycle of the request and response is handled through events and callbacks. Because there is not a thread for each request, the cost of connections is cheap. This is the cost of a file descriptor, and the addition of a listener. Whereas the cost of a connection in the blocking model is a thread and with heavy memory and system overhead. There are some efficiency gains because data stays on the same CPU, making better use of CPU level caches and requiring fewer context switches. The fallout of backend latency and “retry storms” (customers and devices retrying requests when problems occur) is also less stressful on the system because connections and increased events in the queue are far less expensive than piling up threads.

Asynchronous and Non-blocking System Architecture

The advantages of async systems sound glorious, but the above benefits come at a cost to operations. Blocking systems are easy to grok and debug. A thread is always doing a single operation so the thread’s stack is an accurate snapshot of the progress of a request or spawned task; and a thread dump can be read to follow a request spanning multiple threads by following locks. An exception thrown just pops up the stack. A “catch-all” exception handler can cleanup everything that isn’t explicitly caught.   

Async, by contrast, is callback based and driven by an event loop. The event loop’s stack trace is meaningless when trying to follow a request. It is difficult to follow a request as events and callbacks are processed, and the tools to help with debugging this are sorely lacking in this area. Edge cases, unhandled exceptions, and incorrectly handled state changes create dangling resources resulting in ByteBuf leaks, file descriptor leaks, lost responses, etc. These types of issues have proven to be quite difficult to debug because it is difficult to know which event wasn’t handled properly or cleaned up appropriately.

Building Non-Blocking Zuul

Building Zuul 2 within Netflix’s infrastructure was more challenging than expected. Many services within the Netflix ecosystem were built with an assumption of blocking.  Netflix’s core networking libraries are also built with blocking architectural assumptions; many libraries rely on thread local variables to build up and store context about a request. Thread local variables don’t work in an async non-blocking world where multiple requests are processed on the same thread.  Consequently, much of the complexity of building Zuul 2 was in teasing out dark corners where thread local variables were being used. Other challenges involved converting blocking networking logic into non-blocking networking code, and finding blocking code deep inside libraries, fixing resource leaks, and converting core infrastructure to run asynchronously.  There is no one-size-fits-all strategy for converting blocking network logic to async; they must be individually analyzed and refactored. The same applies to core Netflix libraries, where some code was modified and some had to be forked and refactored to work with async.  The open source project Reactive-Audit was helpful by instrumenting our servers to discover cases where code blocks and libraries were blocking.

We took an interesting approach to building Zuul 2. Because blocking systems can run code asynchronously, we started by first changing our Zuul Filters and filter chaining code to run asynchronously.  Zuul Filters contain the specific logic that we create to do our gateway functions (routing, logging, reverse proxying, ddos prevention, etc). We refactored core Zuul, the base Zuul Filter classes, and our Zuul Filters using RxJava to allow them to run asynchronously. We now have two types of filters that are used together: async used for I/O operations, and a sync filter that run logical operations that don’t require I/O.  Async Zuul Filters allowed us to execute the exact same filter logic in both a blocking system and a non-blocking system.  This gave us the ability to work with one filter set so that we could develop gateway features for our partners while also developing the Netty-based architecture in a single codebase. With async Zuul Filters in place, building Zuul 2 was “just” a matter of making the rest of our Zuul infrastructure run asynchronously and non-blocking. The same Zuul Filters could just drop into both architectures.

Results of Zuul 2 in Production

Hypotheses varied greatly on benefits of async architecture with our gateway. Some thought we would see an order of magnitude increase in efficiency due to the reduction of context switching and more efficient use of CPU caches and others expected that we’d see no efficiency gain at all.  Opinions also varied on the complexity of the change and development effort. 

So what did we gain by doing this architectural change? And was it worth it? This topic is hotly debated. The Cloud Gateway team pioneered the effort to create and test async-based services at Netflix. There was a lot of interest in understanding how microservices using async would operate at Netflix, and Zuul looked like an ideal service for seeing benefits. 

While we did not see a significant efficiency benefit in migrating to async and non-blocking, we did achieve the goals of connection scaling. Zuul does benefit by greatly decreasing the cost of network connections which will enable push and bi-directional communication to and from devices. These features will enable more real-time user experience innovations and will reduce overall cloud costs by replacing “chatty” device protocols today (which account for a significant portion of API traffic) with push notifications. There also is some resiliency advantage in handling retry storms and latency from origin systems better than the blocking model. We are continuing to improve on this area; however it should be noted that the resiliency advantages have not been straightforward or without effort and tuning. 

With the ability to drop Zuul’s core business logic into either blocking or async architectures, we have an interesting apples-to-apples comparison of blocking to async.  So how do two systems doing the exact same real work, although in very different ways, compare in terms of features, performance and resiliency?  After running Zuul 2 in production for the last several months, our evaluation is that the more CPU-bound a system is, the less of an efficiency gain we see.  

We have several different Zuul clusters that front origin services like API, playback, website, and logging. Each origin service demands that different operations be handled by the corresponding Zuul cluster.  The Zuul cluster that fronts our API service, for example, does the most on-box work of all our clusters, including metrics calculations, logging, and decrypting incoming payloads and compressing responses.  We see no efficiency gain by swapping an async Zuul 2 for a blocking one for this cluster.  From a capacity and CPU point of view they are essentially equivalent, which makes sense given how CPU-intensive the Zuul service fronting API is. They also tend to degrade at about the same throughput per node. 

The Zuul cluster that fronts our Logging services has a different performance profile. Zuul is generally receiving logging and analytics messages from devices and is write-heavy, so requests are large, but responses are small and not encrypted by Zuul.  As a result, Zuul is doing much less work for this cluster.  While still CPU-bound, we see about a 25% increase in throughput corresponding with a 25% reduction in CPU utilization by running Netty-based Zuul.  We thus observed that the less work a system actually does, the more efficiency we gain from async. 

Overall, the value we get from this architectural change is high, with connection scaling being the primary benefit, but it does come at a cost. We have a system that is much more complex to debug, code, and test, and we are working within an ecosystem at Netflix that operates on an assumption of blocking systems. It is unlikely that the ecosystem will change anytime soon, so as we add and integrate more features to our gateway it is likely that we will need to continue to tease out thread local variables and other assumptions of blocking in client libraries and other supporting code.  We will also need to rewrite blocking calls asynchronously.  This is an engineering challenge unique to working with a well established platform and body of code that makes assumptions of blocking. Building and integrating Zuul 2 in a greenfield would have avoided some of these complexities, but we operate in an environment where these libraries and services are essential to the functionality of our gateway and operation within Netflix’s ecosystem.

We are in the process of releasing Zuul 2 as open source. Once it is released, we’d love to hear from you about your experiences with it and hope you will share your contributions! We plan on adding new features such as http/2 and websocket support to Zuul 2 so that the community can also benefit from these innovations.

- The Cloud Gateway Team (Mikey Cohen, Mike Smith, Susheel Aroskar, Arthur Gonigberg, Gayathri Varadarajan, and Sudheer Vinukonda)

Thursday, September 15, 2016


Why IMF?

As Netflix expanded into a global entertainment platform, our supply chain needed an efficient way to vault our masters in the cloud that didn’t require a different version for every territory in which we have our service.  A few years ago we discovered the Interoperable Master Format (IMF), a standard created by the Society of Motion Picture and Television Engineers (SMPTE). The IMF framework is based on the Digital Cinema standard of component based elements in a standard container with assets being mapped together via metadata instructions.  By using this standard, Netflix is able to hold a single set of core assets and the unique elements needed to make those assets relevant in a local territory.  So for a title like Narcos, where the video is largely the same in all territories, we can hold the Primary AV and the specific frames that are different for, say, the Japanese title sequence version.  This reduces duplication of assets that are 95% the same and allows us to hold that 95% once and piece it to the 5% differences needed for a specific use case.   The format also serves to minimize the risk of multiple versions being introduced into our vault, and allows us to keep better track of our assets, as they stay within one contained package, even when new elements are introduced.  This allows us to avoid “versionitis” as outlined in this previous blog.  We can leverage one set of master assets and utilize supplemental or additional master assets in IMF to make our localized language versions, as well as any transcoded versions, without needing to store anything more than master materials.  Primary AV, supplemental AV, subtitles, non-English audio and other assets needed for global distribution can all live in an “uber” master that can be continually added to as needed rather than recreated.  When a “virtual-version” is needed, the instructions simply need to be created, not the whole master.  IMF provides maximum flexibility without having to actually create every permutation of a master.  

OSS for IMF:

Netflix has a history of identifying shared problems within industries and seeking solutions via open source tools.  Because many of our content partners have the same issues Netflix has with regard to global versions of their content, we saw IMF as a shared opportunity in the digital supply chain space.  In order to support IMF interoperability and share the benefits of the format with the rest of the content community, we have invested in several open source IMF tools.  One example of these tools is the IMF Transform Tool  which gives users the ability to transcode from IMF to DPP (Digital Production Partnership).  Realizing Netflix is only one recipient of assets from content owners, we wanted to create a solution that would allow them to enjoy the benefits of IMF and still create deliverables to existing outlets.  Similarly, Netflix understands the EST business is still important to content owners, so we’re adding another open source transform tool that will go from IMF to an iTunes-compatible like package (when using Apple ProRes encoder). This will allow users to take a SMPTE compliant IMF and convert it to a package which can be used for TVOD delivery without incurring significant costs via proprietary tools.  A final shared problem is editing those sets of instructions we mentioned earlier.  There are many great tools in the marketplace that create IMF packages, and while they are fully featured and offer powerful solutions for creating IMFs, they can be overkill for making quick changes to a CPL (Content Play List).  Things like adding metadata markers, EIDR numbers or other changes to the instructions for that IMF can all be done in our newly released OSS IMF CPL Editor.  This leaves the fully functioned commercial software/hardware tools open in facilities for IMF creation and not tied up making small changes to metadata.

IMF Transforms

The IMF Transform uses other open source technologies from Java, ffmpeg, bmxlib and x.264 in the framework.  These tools and their source code can be found on GitHub at

IMF CPL Editor

The IMF CPL Editor is cross platform and can be compiled on Mac, Windows and/or Linux operating systems.  The tool will open a composition playlist (CPL) in a timeline and list all assets.  The essence files will be supported in .mxf wrapped .wav, .ttml or .imsc files. The user can add, edit and delete audio, subtitle and metadata assets from the timeline. The edits can be saved back to the existing CPL or saved as a new CPL modifying the Packing List (PKL) and Asset Map as well.  The source code and compiled tool will be open source and available at (

What’s Next:

We hope others will branch these open source efforts and make even more functions available to the growing community of IMF users.  It would be great to see a transform function to other AS-11 formats, XDCAM 50 or other widely used broadcast “play-out” formats.  In addition to the base package functionality that currently exists, Netflix will be adding supplemental package support to the IMF CPL Editor in October. We look forward to seeing what developers create. These solutions coupled with the Photon tool Netflix has already released create strong foundations to make having an efficient and comprehensive library in IMF an achievable goal for content owners seeking to exploit their assets in the global entertainment market.

By: Chris Fetner and Brian Kenworthy

Wednesday, September 14, 2016

Netflix OSS Meetup Recap - September 2016

Last week, we welcomed roughly 200 attendees to Netflix HQ in Los Gatos for Season 4, Episode 3 of our Netflix OSS Meetup. The meetup group was created in 2013 to discuss our various OSS projects amongst the broader community of OSS enthusiasts. This episode centered around security-focused OSS releases, and speakers included both Netflix creators of security OSS as well as community users and contributors.

We started the night with an hour of networking, Mexican food, and drinks. As we kicked off the presentations, we discussed the history of security OSS at Netflix - we first released Security Monkey in 2014, and we're closing in on our tenth security release, likely by the end of 2016. The slide below provides a comprehensive timeline of the security software we've released as Netflix OSS.

Wes Miaw of Netflix began the presentations with a discussion of MSL (Message Security Layer), a modern security protocol that addresses a number of difficult security problems. Next was Patrick Kelley, also of Netflix, who gave the crowd an overview of Repoman, an upcoming OSS release that works to right-size permissions within Amazon Web Services environments.

Next up were our external speakers. Vivian Ho and Ryan Lane of Lyft discussed their use of BLESS, an SSH Certificate Authority implemented as an AWS Lambda function. They're using it in conjunction with their OSS kmsauth to provide engineers SSH access to AWS instances. Closing the presentations was Chris Dorros of OpenDNS/Cisco. Chris talked about his contribution to Lemur, the SSL/TLS certificate management system we open sourced last year. Chris has added functionality to support the DigiCert Certificate Authority. After the presentations, the crowd moved back to the cafeteria, where we'd set up demo stations for a variety of our security OSS releases.

Patrick Kelley talking about Repoman

Thanks to everyone who attended - we're planning the next meetup for early December 2016. Join our group for notifications. If you weren't able to attend, we have both the slides and video available.

Upcoming Talks from the Netflix Security Team

Below is a schedule of upcoming presentations from members of the Netflix security team (through 2016). If you'd like to hear more talks from Netflix security, some of our past presentations are available on our YouTube channel

Automacon (Portland, OR) Sept 27-29, 2016
Scott Behrens and Andy Hoernecke
AppSecUSA 2016 (DC) - Oct 11-14, 2016
Scott Behrens and Andy Hoernecke
O'Reilly Security NYC (NYC) - Oct 30-Nov 2, 2016
Ping Identify SF (San Francisco) - Nov 2, 2016
QConSF (San Francisco) - Nov 7-11, 2016
The Psychology of Security Automation
Manish Mehta
AWS RE:invent (Las Vegas) - Nov 28-Dec 2, 2016
Solving the First Secret Problem: Securely Establishing Identity using the AWS Metadata Service
AWS RE:invent (Las Vegas) - Nov 28-Dec 2, 2016

If you're interested in solving interesting security problems while developing OSS that the rest of the world can use, we'd love to hear from you! Please see our jobs site for openings.

By Jason Chan