Wednesday, September 30, 2015

Moving from Asgard to Spinnaker

Six years ago, Netflix successfully jumped headfirst into the AWS Cloud and along the way we ended up writing quite a lot of software to help us out. One particular project proved instrumental in allowing us to efficiently automate AWS deployments: Asgard

Asgard created an intuitive model for cloud-based applications that has made deployment and ongoing management of AWS resources easy for hundreds of engineers at Netflix. Introducing the notion of clusters, applications, specific naming conventions, and deployment options like rolling push and red/black has ultimately yielded more productive teams who can spend more time coding business logic rather than becoming AWS experts.  What’s more, Asgard has been a successful OSS project adopted by various companies. Indeed, the utility of Asgard’s battle-hardened AWS deployment and management features is undoubtedly due to the hard work and innovation of its contributors both within Netflix and the community.

Netflix, nevertheless, has evolved since first embracing the cloud. Our footprint within AWS has expanded to meet the demand of an increasingly global audience; moreover, the number of applications required to service our customers has swelled. Our rate of innovation, which maintains our global competitive edge, has also grown. Consequently, our desire to move code rapidly, with a high degree of confidence and overall visibility, has also increased. In this regard Asgard has fallen short.

Everything required to produce a deployment artifact, in this case an AMI, has never been addressed in Asgard. Consequently, many teams at Netflix constructed their own Continuous Delivery workflows. These workflows were typically related Jenkins jobs that tied together code check-ins with building and testing, then AMI creations and, finally, deployments via Asgard. This final step involved automation against Asgard’s REST API, which was never intended to be leveraged as a first class citizen.

Roughly a year ago a new project, dubbed Spinnaker, kicked off to enable end-to-end global Continuous Delivery at Netflix. The goals of this project were to create a Continuous Delivery platform that would:
  • enable repeatable automated deployments captured as flexible pipelines and configurable pipeline stages
  • provide a global view across all the environments that an application passes through in its deployment pipeline
  • offer programmatic configuration and execution via a consistent and reliable API
  • be easy to configure, maintain, and extend  
  • be operationally resilient  
  • provide the existing benefits of Asgard without a migration



What’s more, we wanted to leverage a few lessons learned from Asgard. One particular goal of this new platform is to facilitate innovation within its umbrella. The original Asgard model was difficult to extend so the community forked Asgard to provide alternative implementations. Since these changes weren’t merged back into Asgard, those innovations were lost to the wider community. Spinnaker aims to make it easier to extend and enhance cloud deployment models in a way that doesn't require forking. Whether the community desires additional cloud providers, different deployment artifacts or new stages in a Continuous Delivery pipeline, extensions to Spinnaker will be available to everyone in the community without the need to fork. 

We additionally wanted to create a platform that, while replacing Asgard, doesn’t exclude it. A big-bang migration process off Asgard would be out of the question for Netflix and for the community. Consequently, changes to cloud assets via Asgard are completely compatible with changes to those same assets via our new platform. And vice versa!

Finally, we deliberately chose not to reimplement everything in Asgard. Ultimately, Asgard took on too much undifferentiated heavy lifting from the AWS console. Consequently, for those features that are not directly related to cluster management, such as SNS, SQS, and RDS Management, Netflix users and the community are encouraged to use the AWS Console.

Our new platform only implements those Asgard-like features related to cluster management from the point of view of an application (and even a group of related applications: a project). This application context allows you to work with a particular application’s related clusters, ASGs, instances, Security Groups, and ELBs, in all the AWS accounts in which the application is deployed.

Today, we have both systems running side by side with the vast majority of all deployments leveraging our new platform. Nevertheless, we’re not completely done with gaining the feature parity we desire with Asgard. That gap is closing rapidly and in the near future we will be sunsetting various Asgard instances running in our infrastructure. At this point, Netflix engineers aren’t committing code to Asgard’s Github repository; nevertheless, we happily encourage the OSS community’s active participation in Asgard going forward. 

Asgard served Netflix well for quite a long time. We learned numerous lessons along our journey and are ready to focus on the future with a new platform that makes Continuous Delivery a first-class citizen at Netflix and elsewhere. We plan to share this platform, Spinnaker, with the Open Source Community in the coming months.

-Delivery Engineering Team at Netflix 

Monday, September 28, 2015

Creating Your Own EC2 Spot Market

by: Andrew Park, Darrell Denlinger, & Coburn Watson

Netflix prioritizes innovation and reliability above efficiency, and as we continue to scale globally, finding opportunities that balance these three variables becomes increasingly difficult. However, every so often there is a process or application that can shift the curve out on all three factors; for Netflix this process was incorporating hybrid autoscaling engines for our services via Scryer & Amazon Auto Scaling.


Currently over 15% of our EC2 footprint autoscales, and the majority of this usage is covered by reserved instances as we value the pricing and capacity benefits. The combination of these two factors have created an “internal spot market” that has a daily peak of over 12,000 unused instances. We have been steadily working on building an automated system that allows us to effectively utilize these troughs.


Creating the internal spot capacity is straightforward: implement auto scaling and purchase reserved instances. In this post we’ll focus on how to leverage this trough given the complexities that stem from our large scale and decentralized microservice architecture. In the subsequent post, the Encoding team discusses the technical details in automating Netflix’s internal spot market and highlights some of the lessons learned.


How the internal spot began


The initial foray into large scale borrowing started in the Spring of 2015. A new algorithm for one of our personalization services ballooned their video ranking precompute cluster, expanding the size by 5x overnight. Their precompute cluster had an SLA to complete their daily jobs between midnight and 11am, leaving over 1,500 r3.4xlarges unused during the afternoon and evening.


Motivated by the inefficiencies, we actively searched for another service that had relatively interruptible jobs that could run during the off-hours. The Encoding team, who is responsible for converting the raw master video files into consumable formats for our device ecosystem, was the perfect candidate. The initial approach applied was a borrowing schedule based on historical availability, with scale-downs manually communicated between the Personalization, Encoding, and Cloud Capacity teams.


Preliminary Manual Borrowing


As the Encoding team continued to reap the benefits of the extra capacity, they became interested in borrowing from the various sizable troughs in other instance types. Because of a lack of real time data exposing the unused capacity between our accounts, we embarked on a multi-team effort to create the necessary tooling and processes to allow borrowing to occur on a larger, more automated scale.


Current Automated Borrowing


Borrowing considerations


The first requirement to automated borrowing is building out the telemetry exposing unused reservation counts. Given our autoscaling engines operate at a minute granularity, we could not leverage AWS’ billing file as our data source. Instead, the Engineering Tools team built an API inside our deployment platform that exposed real time unused reservations at the minute level. This unused calculation combined input data from our deployment tool, monitoring system, and AWS’ reservation system.


The second requirement is finding batch jobs that are short in duration or interruptible in nature. Our batch Encoding jobs had a minimum duration SLA between five minutes to an hour, making them a perfect fit for our initial twelve hour borrowing window. An additional benefit is having jobs that are resource agnostic, allowing for more borrowing opportunities as our usage landscape creates various troughs by instance type.


The last requirement is for teams to absorb the telemetry data and to set appropriate rules for when to borrow instances. The main concern was whether or not this borrowing would jeopardize capacity for services in the critical path. We alleviated this issue by placing all of our borrowing into a separate account from our production account and leveraging the financial advantages of consolidated billing. Theoretically, a perfectly automated borrowing system would have the same operational and financial results regardless of account structure, but leveraging consolidated billing creates a capacity safety net.


Conclusion


In the ideal state, the internal spot market can be the most efficient platform for running short duration or interruptible jobs through instance level bin-packing. A series of small steps moved us in the right direction, such as:
  • Identifying preliminary test candidates for resource sharing
  • Creating shorter run-time jobs or modifying jobs to be more interruptible
  • Communicating broader messaging about resource sharing
In the next post of this series, the Encoding team talks through their use cases of the internal spot market, depicting the nuances of real time borrowing at such scale. Their team is actively working through this exciting efficiency problem and many others at Netflix; please check our Jobs site if you want to help us solve these challenges!

Friday, September 25, 2015

Chaos Engineering Upgraded

Several years ago we introduced a tool called Chaos Monkey. This service pseudo-randomly plucks a server from our production deployment on AWS and kills it. At the time we were met with incredulity and skepticism. Are we crazy? In production?!?

Our reasoning was sound, and the results bore that out. Since we knew that server failures are guaranteed to happen, we wanted those failures to happen during business hours when we were on hand to fix any fallout. We knew that we could rely on engineers to build resilient solutions if we gave them the context to *expect* servers to fail. If we could align our engineers to build services that survive a server failure as a matter of course, then when it accidentally happened it wouldn’t be a big deal. In fact, our members wouldn’t even notice. This proved to be the case.

Chaos Kong

Building on the success of Chaos Monkey, we looked at an extreme case of infrastructure failure. We built Chaos Kong, which doesn’t just kill a server. It kills an entire AWS Region1.

It is very rare that an AWS Region becomes unavailable, but it does happen. This past Sunday (September 20th, 2015) Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1 Region. That instability caused more than 20 additional AWS services that are dependent on DynamoDB to fail. Some of the Internet’s biggest sites and applications were intermittently unavailable during a six- to eight-hour window that day.


Netflix did experience a brief availability blip in the affected Region, but we sidestepped any significant impact because Chaos Kong exercises prepare us for incidents like this. By running experiments on a regular basis that simulate a Regional outage, we were able to identify any systemic weaknesses early and fix them. When US-EAST-1 actually became unavailable, our system was already strong enough to handle a traffic failover.

Below is a chart of our video play metrics during a Chaos Kong exercise. These are three views of the same eight hour window. The top view shows the aggregate metric, while the bottom two show the same metric for the west region and the east region, respectively.

Chaos Kong exercise in progress

In the bottom row, you can clearly see traffic evacuate from the west region. The east region gets a corresponding bump in traffic as it steps up to play the role of savior. During the exercise, most of our attention stays focused on the top row. As long as the aggregate metric follows that relatively smooth trend, we know that our system is resilient to the failover. At the end of the exercise, you see traffic revert to the west region, and the aggregate view shows that our members did not experience an adverse effects. We run Chaos Kong exercises like this on a regular basis, and it gives us confidence that even if an entire region goes down, we can still serve our customers.

ADVANCING THE MODEL

We looked around to see what other engineering practices could benefit from these types of exercises, and we noticed that Chaos meant different things to different people. In order to carry the practice forward, we need a best-practice definition, a model that we can apply across different projects and different departments to make our services more resilient.

We want to capture the value of these exercises in a methodology that we can use to improve our systems and push the state of the art forward. At Netflix we have an extremely complex distributed system (microservice architecture) with hundreds of deploys every day. We don’t want to remove the complexity of the system; we want to thrive on it. We want to continue to accelerate flexibility and rapid development. And with that complexity, flexibility, and rapidity, we still need to have confidence in the resiliency of our system.

To have our cake and eat it too, we set out to develop a new discipline around Chaos. We developed an empirical, systems-based approach which addresses the chaos inherent in distributed systems at scale. This approach specifically builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it in a controlled experiment, and we use those learnings to fortify our systems before any systemic effect can disrupt the quality service that we provide our customers. We call this new discipline Chaos Engineering.

We have published the Principles of Chaos Engineering as a living document, so that other organizations can contribute to the concepts that we outline here.

CHAOS EXPERIMENT

We put these principles into practice. At Netflix we have a microservice architecture. One of our services is called Subscriber, which handles certain user management activities and authentication. It is possible that under some rare or even unknown situation Subscriber will be crippled. This might be due to network errors, under-provisioning of resources, or even by events in downstream services upon which Subscriber depends. When you have a distributed system at scale, sometimes bad things just happen that are outside any person’s control. We want confidence that our service is resilient to situations like this.

We have a steady-state definition: Our metric of interest is customer engagement, which we measure as the number of video plays that start each second. In some experiments we also look at load average and error rate on an upstream service (API). The lines that those metrics draw over time are predictable, and provide a good proxy for the steady-state of the system. We have a hypothesis: We will see no significant impact on our customer engagement over short periods of time on the order of an hour, even when Subscriber is in a degraded state. We have variables: We add latency of 30ms first to 20% then to 50% of traffic from Subscriber to its primary cache. This simulates a situation in which the Subscriber cache is over-stressed and performing poorly. Cache misses increase, which in turn increases load on other parts of the Subscriber service. Then we look for a statistically significant deviation between the variable group and the control group with respect to the system’s steady-state level of customer engagement.

If we find a deviation from steady-state in our variable group, then we have disproved our hypothesis. That would cause us to revisit the fallbacks and dependency configuration for Subscriber. We would undertake a concerted effort to improve the resiliency story around Subscriber and the services that it touches, so that customers can count on our service even when Subscriber is in a degraded state.

If we don’t find any deviation in our variable group, then we feel more confident in our hypothesis. That translates to having more confidence in our service as a whole.

In this specific case, we did see a deviation from steady-state when 30ms latency was added to 50% of the traffic going to this service. We identified a number of steps that we could take, such as decreasing the thread pool count in an upstream service, and subsequent experiments have confirmed the bolstered resiliency of Subscriber.

CONCLUSION

We started Chaos Monkey to build confidence in our highly complex system. We don’t have to simplify or even understand the system to see that over time Chaos Monkey makes the system more resilient. By purposefully introducing realistic production conditions into a controlled run, we can uncover weaknesses before they cause bigger problems. Chaos Engineering makes our system stronger, and gives us the confidence to move quickly in a very complex system.

STAY TUNED

This blog post is part of a series. In the next post on Chaos Engineering, we will take a deeper dive into the Principles of Chaos Engineering and hypothesis building with additional examples from our production experiments. If you have thoughts on Chaos Engineering or how to advance the state of the art in this field, we’d love to hear from you. Feel free to reach out to chaos@netflix.com.

-Chaos Team at Netflix
Ali Basiri, Lorin Hochstein, Abhijit Thosar, Casey Rosenthal



1. Technically, it only simulates killing an AWS Region. For our purposes, simulating this giant infrastructure failure is sufficient, and AWS doesn’t yet provide us with a way of turning off an entire region. ;-)

Thursday, September 24, 2015

John Carmack on Developing the Netflix App for Oculus

Hi, this is Anthony Park, VP of Engineering at Netflix. We've been working with Oculus to develop a Netflix app for Samsung Gear VR. The app includes a Netflix Living Room, allowing members to get the Netflix experience from the comfort of a virtual couch, wherever they bring their Gear VR headset. It's available to Oculus users today. We've been working closely with John Carmack, CTO of Oculus and programmer extraordinaire, to bring our TV user interface to the Gear VR headset. Well, honestly, John did most of the development himself(!), so I've asked him to be a guest blogger today and share his experience with implementing the new app. Here's a sneak peek at the experience, and I'll let John take it from here...


Netflix Living Room on Gear VR



The Netflix Living Room

Despite all the talk of hardcore gamers and abstract metaverses, a lot of people want to watch movies and shows in virtual reality. In fact, during the development of Gear VR, Samsung internally referred to it as the HMT, for "Head Mounted Theater." Current VR headsets can't match a high end real world home theater, but in many conditions the "best seat in the house" may be in the Gear VR that you pull out of your backpack.

Some of us from Oculus had a meeting at Netflix HQ last month, and when things seemed to be going well, I blurted out "Grab an engineer, let's do this tomorrow!"

That was a little bit optimistic, but when Vijay Gondi and Anthony Park came down from Netflix to Dallas the following week, we did get the UI running in VR on the second day, and video playing shortly thereafter.

The plan of attack was to take the Netflix TV codebase and present it on a virtual TV screen in VR. Ideally, the Netflix code would be getting events and drawing surfaces, not even really aware that it wasn't showing up on a normal 2D screen.

I wrote a "VR 2D Shell" application that functioned like a very simplified version of our Oculus Cinema application; the big screen is rendered with our peak-quality TimeWarp layer support, and the environment gets a neat dynamic lighting effect based on the screen contents. Anything we could get into a texture could be put on the screen.

The core Netflix application uses two Android Surfaces – one for the user interface layer, and one for the decoded video layer. To present these in VR I needed to be able to reference them as OpenGL textures, so the process was: create an OpenGL texture ID, use that to initialize a SurfaceTexture object, then use that to initialize a Surface object that could be passed to Netflix.

For the UI surface, this worked great -- when the Netflix code does a swapbuffers, the VR code can have the SurfaceTexture do an update, which will latch the latest image into an EGL external image, which can then be texture mapped onto geometry by the GPU.

The video surface was a little more problematic. To provide smooth playback, the video frames are queued a half second ahead, tagged with a "release time" that the Android window compositor will use to pick the best frame each update. The SurfaceTexture interface that I could access as a normal user program only had an "Update" method that always returned the very latest frame submitted. This meant that the video came out a half second ahead of the audio, and stuttered a lot.

To fix this, I had to make a small change in the Netflix video decoding system so it would call out to my VR code right after it submitted each frame, letting me know that it had submitted something with a particular release time. I could then immediately update the surface texture and copy it out to my own frame queue, storing the release time with it. This is an unfortunate waste of memory, since I am duplicating over a dozen video frames that are also being buffered on the surface, but it gives me the timing control I need.

Initially input was handled with a Bluetooth joypad emulating the LRUD / OK buttons of a remote control, but it was important to be able to control it using just the touchpad on the side of Gear VR. Our preferred VR interface is "gaze and tap", where a cursor floats in front of you in VR, and tapping is like clicking a mouse. For most things, this is better than gamepad control, but not as good as a real mouse, especially if you have to move your head significant amounts. Netflix has support for cursors, but there is the assumption that you can turn it on and off, which we don't really have.

We wound up with some heuristics driving the behavior. I auto-hide the cursor when the movie starts playing, inhibit cursor updates briefly after swipes, and send actions on touch up instead of touch down so you can perform swipes without also triggering touches. It isn't perfect, but it works pretty well.


Layering of the Android Surfaces within the Netflix Living Room



Display

The screens on the Gear VR supported phones are all 2560x1440 resolution, which is split in half to give each eye a 1280x1440 view that covers approximately 90 degrees of your field of view. If you have tried previous Oculus headsets, that is more than twice the pixel density of DK2, and four times the pixel density of DK1. That sounds like a pretty good resolution for videos until you consider that very few people want a TV screen to occupy a 90 degree field of view. Even quite large screens are usually placed far enough away to be about half of that in real life.

The optics in the headset that magnify the image and allow your eyes to focus on it introduce both a significant spatial distortion and chromatic aberration that needs to be corrected. The distortion compresses the pixels together in the center and stretches them out towards the outside, which has the positive effect of giving a somewhat higher effective resolution in the middle where you tend to be looking, but it also means that there is no perfect resolution for content to be presented in. If you size it for the middle, it will need mip maps and waste pixels on the outside. If you size it for the outside, it will be stretched over multiple pixels in the center.

For synthetic environments on mobile, we usually size our 3D renderings close to the outer range, about 1024x1024 pixels per eye, and let it be a little blurrier in the middle, because we care a lot about performance. On high end PC systems, even though the actual headset displays are lower resolution than Gear VR, sometimes higher resolution scenes are rendered to extract the maximum value from the display in the middle, even if the majority of the pixels wind up being blended together in a mip map for display.

The Netflix UI is built around a 1280x720 resolution image. If that was rendered to a giant virtual TV covering 60 degrees of your field of view in the 1024x1024 eye buffer, you would have a very poor quality image as you would only be seeing a quarter of the pixels. If you had mip maps it would be a blurry mess, otherwise all the text would be aliased fizzing in and out as your head made tiny movements each frame.

The technique we use to get around this is to have special code for just the screen part of the view that can directly sample a single textured rectangle after the necessary distortion calculations have been done, and blend that with the conventional eye buffers. These are our "Time Warp Layers". This has limited flexibility, but it gives us the best possible quality for virtual screens (and also the panoramic cube maps in Oculus 360 Photos). If you have a joypad bound to the phone, you can toggle this feature on and off by pressing the start button. It makes an enormous difference for the UI, and is a solid improvement for the video content.

Still, it is drawing a 1280 pixel wide UI over maybe 900 pixels on the screen, so something has to give. Because of the nature of the distortion, the middle of the screen winds up stretching the image slightly, and you can discern every single pixel in the UI. As you get towards the outer edges, and especially the corners, more and more of the UI pixels get blended together. Some of the Netflix UI layout is a little unfortunate for this; small text in the corners is definitely harder to read.

So forget 4K, or even full-HD. 720p HD is the highest resolution video you should even consider playing in a VR headset today.

This is where content protection comes into the picture. Most studios insist that HD content only be played in a secure execution environment to reduce opportunities for piracy. Modern Android systems' video CODECs can decode into special memory buffers that literally can't be read by anything other than the video screen scanning hardware; untrusted software running on the CPU and GPU have no ability to snoop into the buffer and steal the images. This happens at the hardware level, and is much more difficult to circumvent than software protections.

The problem for us is that to draw a virtual TV screen in VR, the GPU fundamentally needs to be able to read the movie surface as a texture. On some of the more recent phone models we have extensions to allow us to move the entire GPU framebuffer into protected memory and then get the ability to read a protected texture, but because we can't write anywhere else, we can't generate mip maps for it. We could get the higher resolution for the center of the screen, but then the periphery would be aliasing, and we lose the dynamic environment lighting effect, which is based on building a mip map of the screen down to 1x1. To top it all off, the user timing queue to get the audio synced up wouldn't be possible.

The reasonable thing to do was just limit the streams to SD resolution – 720x480. That is slightly lower than I would have chosen if the need for a secure execution environment weren't an issue, but not too much. Even at that resolution, the extreme corners are doing a little bit of pixel blending.


Flow diagram for SD video frames to allow composition with VR

In an ideal world, the bitrate / resolution tradeoff would be made slightly differently for VR. On a retina class display, many compression artifacts aren't really visible, but the highly magnified pixels in VR put them much more in your face. There is a hard limit to how much resolution is useful, but every visible compression artifact is correctable with more bitrate.




Power Consumption

For a movie viewing application, power consumption is a much bigger factor than for a short action game. My target was to be able to watch a two hour movie in VR starting at 70% battery. We hit this after quite a bit of optimization, but the VR app still draws over twice as much power as the standard Netflix Android app.

When a modern Android system is playing video, the application is only shuffling the highly compressed video data from the network to the hardware video CODEC, which decompresses it to private buffers, which are then read by the hardware composer block that performs YUV conversion and scaling directly as it feeds it to the display, without ever writing intermediate values to a framebuffer. The GPU may even be completely powered off. This is pretty marvelous – it wasn't too long ago when a PC might use 100x the power to do it all in software.

For VR, in addition to all the work that the standard application is doing, we are rendering stereo 3D scenes with tens of thousands of triangles and many megabytes of textures in each one, and then doing an additional rendering pass to correct for the distortion of the optics.

When I first brought up the system in the most straightforward way with the UI and video layers composited together every frame, the phone overheated to the thermal limit in less than 20 minutes. It was then a process of finding out what work could be avoided with minimal loss in quality.

The bulk of a viewing experience should be pure video. In that case, we only need to mip-map and display a 720x480 image, instead of composing it with the 1280x720 UI. There were no convenient hooks in the Netflix codebase to say when the UI surface was completely transparent, so I read back the bottom 1x1 pixel mip map from the previous frame's UI composition and look at the alpha channel: 0 means the UI was completely transparent, and the movie surface can be drawn by itself. 255 means the UI is solid, and the movie can be ignored. Anything in between means they need to be composited together. This gives the somewhat surprising result that subtitles cause a noticeable increase in power consumption.

I had initially implemented the VR gaze cursor by drawing it into the UI composition surface, which was a useful check on my intersection calculations, but it meant that the UI composition had to happen every single frame, even when the UI was completely static. Moving the gaze cursor back to its own 3D geometry allowed the screen to continue reusing the previous composition when nothing was changing, which is usually more than half of the frames when browsing content.

One of the big features of our VR system is the "Asynchronous Time Warp", where redrawing the screen and distortion correcting in response to your head movement is decoupled from the application's drawing of the 3D world. Ideally, the app draws 60 stereo eye views a second in sync with Time Warp, but if the app fails to deliver a new set of frames then Time Warp will reuse the most recent one it has, re-projecting it based on the latest head tracking information. For looking around in a static environment, this works remarkably well, but it starts to show the limitations when you have smoothly animating objects in view, or your viewpoint moves sideways in front of a surface.

Because the video content is 30 or 24 fps and there is no VR viewpoint movement, I cut the scene update rate to 30 fps during movie playback for a substantial power savings. The screen is still redrawn at 60 fps, so it doesn't feel any choppier when you look around. I go back to 60 fps when the lights come up, because the gaze cursor and UI scrolling animations look significantly worse at 30 fps.

If you really don't care about the VR environment, you can go into a "void theater", where everything is black except the video screen, which obviously saves additional power. You could even go all the way to a face-locked screen with no distortion correction, which would be essentially the same power draw as the normal Netflix application, but it would be ugly and uncomfortable.




A year ago, I had a short list of the top things that I felt Gear VR needed to be successful. One of them was Netflix. It was very rewarding to be able to do this work right before Oculus Connect and make it available to all of our users in such a short timeframe. Plus, I got to watch the entire season of Daredevil from the comfort of my virtual couch. Because testing, of course.

-John

Tuesday, September 22, 2015

Announcing Electric Eye

By: Michael Russell

Electric Eye Icon
Netflix ships on a wide variety of devices, ranging from small thumbdrive-sized HDMI dongles to ultra-massive 100”+ curved screen HDTVs, and the wide variety of form factors leads to some interesting challenges in testing. In this post, we’re going to describe the genesis and evolution of Electric Eye, an automated computer vision and audio testing framework created to help test Netflix on all of these devices.

Let’s start with the Twenty-First Century Communications and Video Accessibility Act of 2010, or CCVA for short. Netflix creates closed caption files for all of our original programming, like Marvel’s Daredevil, Orange is the New Black, and House of Cards, and we serve closed captions for any content that we have captions for. Closed captions are sent to devices as Timed Text Markup Language (TTML), and describe what the captions should say, when and where they should appear, and when they should disappear, amongst other things. The code to display captions on devices is a combination of JavaScript served by our servers and native code on the devices. This led to an interesting question: How can we make sure that captions are showing up completely and on time? We were having humans do the work, but occasionally humans make mistakes. Given that CCVA is the law of the land, we wanted a relatively error-proof way of ensuring compliance.

If we only ran on devices with HDMI-out, we might be able to use something like stb-tester to do the work. However, we run on a wide variety of television sets, not all of which have HDMI-out. Factor in curved screens and odd aspect ratios, and it was starting to seem like there may not be a way to do this reliably for every device. However, one of the first rules of software is that you shouldn’t let your quest for perfection get in the way of making an incremental step forward.

We decided that we’d build a prototype using OpenCV to try to handle flat-screen televisions first, and broke the problem up into two different subproblems: obtaining a testable frame from the television, and extracting the captions from the frame for comparison. To ensure our prototype didn’t cost a lot of money, we picked up a few cheap 1080p webcams from a local electronics store.

OpenCV has functionality built in to detect a checkerboard pattern on a flat surface and generate a perspective-correction matrix, as well as code to warp an image based on the matrix, which made frame acquisition extremely easy. It wasn’t very fast (manually creating a lookup table using the perspective-correction matrix for use with remap improves the speed significantly), but this was a proof of concept. Optimization could come later.

The second step was a bit tricky. Television screens are emissive, meaning that they emit light. This causes blurring, ghosting, and other issues when they are being recorded with a camera. In addition, we couldn’t just have the captions on a black screen since decoding video could potentially cause enough strain on a device to cause captions to be delayed or dropped. Since we wanted a true torture test, we grabbed video of running water (one of the most strenuous patterns to play back due to its unpredictable nature), reduced its brightness by 50%, and overlaid captions on top of it. We’d bake “gold truth” captions into the upper part of the screen, show the results from parsed and displayed TTML in the bottom, and look for differences.

When we tested using HDMI capture, we could apply a thresholding algorithm to the frame and get the captions out easily.
Images showing captured and marked up frames
The frame on the left is what we got from HDMI capture after using thresholding.  We could then mark up the original frame received and send that to testers.
When we worked with the result from the webcam, things weren’t as nice.
Image showing excessive glare and spotting on a trivially thresholded webcam image
Raw thresholding didn't work as well.
Glare from ceiling lights led to unique issues, and even though the content was relatively high contrast, the emissive nature of the screen caused the water to splash through the captions.

While all of the issues that we found with the prototype were a bit daunting, they were eventually solved through a combination of environmental corrections (diffuse lighting handled most of the glare issues) and traditional OpenCV image cleanup techniques, and it proved that we could use CV to help test Netflix. The prototype was eventually able to reliably detect deltas of as little as 66ms, and it showed enough promise to let us create a second prototype, but also led to us adopting some new requirements.

First, we needed to be real-time on a reasonable machine. With our unoptimized code using the UI framework in OpenCV, we were getting ~20fps on a mid-2014 MacBook Pro, but we wanted to get 30fps reliably. Second, we needed to be able to process audio to enable new types of tests. Finally, we needed to be cross-platform. OpenCV works on Windows, Mac, and Linux, but its video capture interface doesn’t expose audio data.

For prototype #2, we decided to switch over to using a creative coding framework named Cinder. Cinder is a C++ library best known for its use by advertisers, but it has OpenCV bindings available as a “CinderBlock” as well as a full audio DSP library. It works on Windows and Mac, and work is underway on a Linux fork. We also chose a new test case to prototype: A/V sync. Getting camera audio and video together using Cinder is fairly easy to do if you follow the tutorials on the Cinder site.

The content for this test already existed on Netflix: Test Patterns. These test patterns were created specifically for Netflix by Archimedia to help us test for audio and video issues. On the English 2.0 track, a 1250Hz tone starts playing 400ms before the ball hits the bottom, and once there, the sound transitions over to a 200ms-long 1000Hz tone. The highlighted areas on the circle line up with when these tones should play. This pattern repeats every six seconds.

For the test to work, we needed to be able to tell what sound was playing. Cinder provides a MonitorSpectralNode class that lets us figure out dominant tones with a little work. With that, we could grab each frame as it came in, detect when the dominant tone changed from 1250Hz to 1000Hz, display the last frame that we got from the camera, and *poof* a simple A/V sync test.
Perspective-corrected image showing ghosting of patterns.
Perspective-corrected image showing ghosting of patterns.
The next step was getting it so that we could find the ball on the test pattern and automate the measurement process. You may notice that in this image, you can see three balls: one at 66ms, one at 100ms, and one at 133ms. This is a result of a few factors: the emissive nature of the display, the camera being slightly out of sync with the TV, and pixel response time.

Through judicious use of image processing, histogram equalization, and thresholding, we were able to get to the point where we could detect the proper ball in the frame and use basic trigonometry to start generating numbers. We only had ~33ms of precision and +/-33ms of accuracy per measurement, but with sufficient sample sizes, the data followed a bell curve around what we felt we could report as an aggregate latency number for a device.
Test frame with location of orb highlighted and sample points overlaid atop the image.
Test frame with location of orb highlighted and sample points overlaid atop the image.
Cinder isn’t perfect. We’ve encountered a lot of hardware issues for the audio pipeline because Cinder expects all parts of the pipeline to work at the same frequency. The default audio frequency on a MacBook Pro is 44.1kHz, unless you hook it up to a display via HDMI, where it changes to 48kHz. Not all webcams support both 44.1kHz and 48kHz natively, and when we can get device audio digitally, it should be (but isn’t always) 48kHz. We’ve got a workaround in place (forcing the output frequency to be the same as the selected input), and hope to have a more robust fix we can commit to the Cinder project around the time we release.

After five months of prototypes, we’re now working on version 1.0 of Electric Eye, and we’re planning on releasing the majority of the code as open source shortly after its completion. We’re adding extra tests, such as mixer latency and audio dropout detection, as well as looking at future applications like motion graphics testing, frame drop detection, frame tear detection, and more.

Our hope is that even if testers aren’t able to use Electric Eye in their work environments, they might be able to get ideas on how to more effectively utilize computer vision or audio processing in their tests to partially or fully automate defect detection, or at a minimum be motivated to try to find new and innovative ways to reduce subjectivity and manual effort in their testing.

[Update 10/23/2015: Fixed an outdated link to Cinder's tutorials.]

Monday, September 21, 2015

Introducing Lemur

by: Kevin Glisson, Jason Chan and Ben Hagen

Netflix is pleased to announce the open source release of our x.509 certificate orchestration framework : Lemur!

The Challenge of Certificate Management
Public Key Infrastructure is a set of hardware, software, people, policies, and procedures needed to create, manage, distribute, use, store, and revoke digital certificates and manage public-key encryption. PKI allows for secure communication by establishing chains of trust between two entities.


There are three main components to PKI that we are attempting to address:
  1. Public Certificate - A cryptographic document that proves the ownership of a public key, which can be used for signing, proving identity or encrypting data.
  2. Private Key - A cryptographic document that is used to decrypt data encrypted by a public key.
  3. Certificate Authorities (CAs) - Third-party or internal services that validate those they do business with. They provide confirmation that a client is talking to the server it thinks it is. Their public certificates are loaded into major operating systems and provide a basis of trust for others to build on.


The management of all the pieces needed for PKI can be a confusing and painful experience. Certificates have expiration dates - if they are allowed to expire without replacing communication can be interrupted, impacting a system’s availability. And, private keys must never be exposed to any untrusted entities - any loss of a private key can impact the confidentiality of communications. There is also increased complexity when creating certificates that support a diverse pool of browsers and devices. It is non-trivial to track which devices and browsers trust which certificate authorities.


On top of the management of these sensitive and important pieces of information, the tools used to create manage and interact with PKI have confusing or ambiguous options. This lack of usability can lead to mistakes and undermine the security of PKI.


For non-experts the experience of creating certificates can be an intimidating one.


Empowering the Developer

At Netflix developers are responsible for their entire application environment, and we are moving to an environment that requires the use of HTTPS for all web applications. This means developers often have to go through the process of certificate procurement and deployment for their services. Let’s take a look at what a typical procurement process might look like:
Here we see an example workflow that a developer might take when creating a new service that has TLS enabled.


There are quite a few steps to this process and much of it is typically handled by humans. Let’s enumerate them:
  1. Create Certificate Signing Request (CSR) - A CSR is a cryptographically signed request that has information such as State/Province, Location, Organization Name and other details about the entity requesting the certificate and what the certificate is for. Creating a CSR typically requires the developer to use OpenSSL commands to generate a private key and enter the correct information. The OpenSSL command line contains hundreds of options and significant flexibility. This flexibility can often intimidate developers or cause them to make mistakes that undermine the security of the certificate.


  1. Submit CSR - The developer then submits the CSR to a CA. Where to submit the CSR can be confusing. Most organizations have internal and external CAs. Internal CAs are used for inter-service or inter-node communication anywhere you have control of both sides of transmission and can thus control who to trust. External CAs are typically used when you don’t have control of both sides of a transmission. Think about your browser communicating with a banking website over HTTPS. It relies on the trust built by third parties (Symantec/Digicert, GeoTrust etc.) in order to ensure that we are talking to who we think we are. External CAs are used for the vast majority of Internet-facing websites.


  1. Approve CSR - Due to the sensitive and error-prone nature of the certificate request process, the choice is often made to inject an approval process into the workflow. In this case, a security engineer would review that a request is valid and correct before issuing the certificate.


  1. Deploy Certificate - Eventually the issued certificate needs to be placed on a server that will handle the request. It’s now up to the developer to ensure that the keys and server certificates are correctly placed and configured on the server and that the keys are kept in a safe location.


  1. Store Secrets - An optional, but important step is to ensure that secrets can be retrieved at a later date. If a server ever needs to be re-deployed these keys will be needed in order to re-use the issued certificate.


Each of these steps have the developer moving through various systems and interfaces, potentially copying and pasting sensitive key material from one system to another. This kind of information spread can lead to situations where a developer might not correctly clean up the private keys they have generated or accidently expose the information, which could put their whole service at risk. Ideally a developer would never have to handle key material at all.


Toward Better Certificate Management

Certificate management is not a new challenge, tools like EJBCA, OpenCA, and more recently Let’s Encrypt are all helping to make certificate management easier. When setting out to make certificate management better we had two main goals: First, increase the usability and convenience of procuring a certificate in such a way that would not be intimidating to users. Second, harden the procurement process by generating high strength keys and handling them with care.


Meet Lemur!


Lemur

Lemur is a certificate management framework that acts as a broker between certificate authorities and internal deployment and management tools. This allows us to build in defaults and templates for the most common use cases, reduce the need for a developer to be exposed to sensitive key material, and provides a centralized location from which to manage and monitor all aspects of the certificate lifecycle.


We will use the following terminology throughout the rest of the discussion:


  • Issuers are internal or third-party certificate authorities
  • Destinations are deployment targets, for TLS these would be the servers terminating web requests.
  • Sources are any certificate store, these can include third party sources such as AWS, GAE, even source code.
  • Notifications are ways for a subscriber to be notified about a change with their certificate.


Unlike many of our tools Lemur is not tightly bound to AWS, in fact Lemur provides several different integration points that allows it to fit into just about any existing environment.


Security engineers can leverage Lemur to act as a broker between deployment systems and certificate authorities. It provides a unified view of, and tracks all certificates in an environment regardless of where they were issued.  


Let’s take a look at what a developer's new workflow would look like using Lemur:



Some key benefits of the new workflow are:
  • Developer no longer needs to know OpenSSL commands
  • Developer no longer needs to know how to safely handle sensitive key material
  • Certificate is immediately deployed and usable
  • Keys are generated with known strength properties
  • Centralized tracking and notification
  • Common API for internal users


This interface is much more forgiving than that of a command line and allows for helpful suggestions and input validation.
For advanced users, Lemur supports all certificate options that the target issuer supports.


Lemur’s destination plugins allow for a developer to pick an environment to upload a certificate. Having Lemur handle the propagation of sensitive material keeps it off developer’s laptops and ensures secure transmission. Out of the box Lemur supports multi-account AWS deployments. Over time, we hope that others can use the common plugin interface to fit their specific needs.


Even with all the things that Lemur does for us we knew there would use cases where certificates are not issued through Lemur. For example, a third party hosting and maintaining a marketing site, or a payment provider generating certificates for secure communication with their service.


To help with these use cases and provide the best possible visibility into an organization’s certificate deployment, Lemur has the concept of source plugins and the ability to import certificates. Source plugins allow Lemur to reach out into different environments and discover and catalog existing certificates, making them an easy way to bootstrap Lemur’s certificate management within an organization.

Lemur creates, discovers and deploys certificates. It also securely stores the sensitive key material created during the procurement process. Letting Lemur handle key management provides a centralized and known method of encryption and the ability to audit the key’s usage and access.


Architecture

Lemur makes use of the following components :
  • Python 2.7, 3.4 with Flask API (including a number of helper packages)
  • AngularJS UI
  • Postgres
  • Optional use of AWS Simple Email Service (SES) for email notifications
We’re shipping Lemur with built-in plugins for that allow you to issue certificates from Verisign/Symantec and allow for the discovery and deployment of certificates into AWS.


Getting Started

Lemur is available now on the Netflix Open Source site. You can try out Lemur using Docker. Detailed instructions on setup and configuration are available in our docs.


Interested in Contributing?

Feel free to reach out or submit pull requests if you have any suggestions. We’re looking forward to seeing what new plugins you create to to make Lemur your own! We hope you’ll find Lemur as useful as we do!


Conclusion

Lemur is helping the Netflix security team manage our PKI infrastructure by empowering developers and creating a paved road to SSL/TLS enabled applications. Lemur is available on our GitHub site now!