Monday, December 5, 2016

NetflixOSS: Announcing Hollow

“If you can cache everything in a very efficient way,
you can often change the game”

We software engineers often face problems that require the dissemination of a dataset which doesn’t fit the label “big data”.  Examples of this type of problem include:

  • Product metadata on an ecommerce site
  • Document metadata in a search engine
  • Metadata about movies and TV shows on an Internet television network

When faced with these we usually opt for one of two paths:

  • Keep the data in a centralized location (e.g. an RDBMS, nosql data store, or memcached cluster) for remote access by consumers
  • Serialize it (e.g. as json, XML, etc) and disseminate it to consumers which keep a local copy

Scaling each of these paths presents different challenges. Centralizing the data may allow your dataset to grow indefinitely large, but:

  • There are latency and bandwidth limitations when interacting with the data
  • A remote data store is never quite as reliable as a local copy of the data

On the other hand, serializing and keeping a local copy of the data entirely in RAM can allow many orders of magnitude lower latency and higher frequency access, but this approach has scaling challenges that get more difficult as a dataset grows in size:

  • The heap footprint of the dataset grows
  • Retrieving the dataset requires downloading more bits
  • Updating the dataset may require significant CPU resources or impact GC behavior

Engineers often select a hybrid approach — cache the frequently accessed data locally and go remote for the “long-tail” data.  This approach has its own challenges:

  • Bookkeeping data structures can consume a significant amount of the cache heap footprint
  • Objects are often kept around just long enough for them to be promoted and negatively impact GC behavior

At Netflix we’ve realized that this hybrid approach often represents a false savings.  Sizing a local cache is often a careful balance between the latency of going remote for many records and the heap requirement of keeping more data local.  However, if you can cache everything in a very efficient way, you can often change the game — and get your entire dataset in memory using less heap and CPU than you would otherwise require to keep just a fraction of it.  This is where Hollow, Netflix’s latest OSS project comes in.


logo.png


Hollow is a java library and comprehensive toolset for harnessing small to moderately sized in-memory datasets which are disseminated from a single producer to many consumers for read-only access.

“Hollow shifts the scale
datasets for which such liberation
may never previously have been considered
can be candidates for Hollow.”


Performance

Hollow focuses narrowly on its prescribed problem set: keeping an entire, read-only dataset in-memory on consumers.  It circumvents the consequences of updating and evicting data from a partial cache.

Due to its performance characteristics, Hollow shifts the scale in terms of appropriate dataset sizes for an in-memory solution. Datasets for which such liberation may never previously have been considered can be candidates for Hollow.  For example, Hollow may be entirely appropriate for datasets which, if represented with json or XML, might require in excess of 100GB.

Agility

Hollow does more than simply improve performance it also greatly enhances teams’ agility when dealing with data related tasks.

Right from the initial experience, using Hollow is easy.  Hollow will automatically generate a custom API based on a specific data model, so that consumers can intuitively interact with the data, with the benefit of IDE code completion.

But the real advantages come from using Hollow on an ongoing basis.  Once your data is Hollow, it has more potential.  Imagine being able to quickly shunt your entire production dataset current or from any point in the recent past down to a local development workstation, load it, then exactly reproduce specific production scenarios.  

Choosing Hollow will give you a head start on tooling; Hollow comes with a variety of ready-made utilities to provide insight into and manipulate your datasets.

Stability

How many nines of reliability are you after?  Three, four, five?  Nine?  As a local in-memory data store, Hollow isn’t susceptible to environmental issues, including network outages, disk failures, noisy neighbors in a centralized data store, etc.  If your data producer goes down or your consumer fails to connect to the data store, you may be operating with stale data but the data is still present and your service is still up.

Hollow has been battle-hardened over more than two years of continuous use at Netflix. We use it to represent crucial datasets, essential to the fulfillment of the Netflix experience, on servers busily serving live customer requests at or near maximum capacity.  Although Hollow goes to extraordinary lengths to squeeze every last bit of performance out of servers’ hardware, enormous attention to detail has gone into solidifying this critical piece of our infrastructure.

Origin

Three years ago we announced Zeno, our then-current solution in this space.  Hollow replaces Zeno but is in many ways its spiritual successor.
deltachain.gif
Zeno’s concepts of producer, consumers, data states, snapshots and deltas are carried forward into Hollow

As before, the timeline for a changing dataset can be broken down into discrete data states, each of which is a complete snapshot of the data at a particular point in time.  Hollow automatically produces deltas between states; the effort required on the part of consumers to stay updated is minimized.  Hollow deduplicates data automatically to minimize the heap footprint of our datasets on consumers.

Evolution

Hollow takes these concepts and evolves them, improving on nearly every aspect of the solution.  

Hollow eschews POJOs as an in-memory representation instead replacing them with a compact, fixed-length, strongly typed encoding of the data.  This encoding is designed to both minimize a dataset’s heap footprint and to minimize the CPU cost of accessing data on the fly.  All encoded records are packed into reusable slabs of memory which are pooled on the JVM heap to avoid impacting GC behavior on busy servers.

memlayout-object.png
An example of how OBJECT type records are laid out in memory

Hollow datasets are self-contained no use-case specific code needs to accompany a serialized blob in order for it to be usable by the framework.  Additionally, Hollow is designed with backwards compatibility in mind so deployments can happen less frequently.  

“Allowing for the construction of
powerful access patterns, whether
or not they were originally anticipated
while designing the data model.”

Because Hollow is all in-memory, tooling can be implemented with the assumption that random access over the entire breadth of the dataset can be accomplished without ever leaving the JVM heap.  A multitude of prefabricated tools ship with Hollow, and creation of your own tools using the basic building blocks provided by the library is straightforward.

Core to Hollow’s usage is the concept of indexing the data in various ways.  This enables O(1) access to relevant records in the data, allowing for the construction of powerful access patterns, whether or not they were originally anticipated while designing the data model.

Benefits

Tooling for Hollow is easy to set up and intuitive to understand.  You’ll be able to gain insights into your data about things you didn’t know you were unaware of.

The history tool allows for inspecting the changes in specific records over time

Hollow can make you operationally powerful.  If something looks wrong about a specific record, you can pinpoint exactly what changed and when it happened with a simple query into the history tool.  If disaster strikes and you accidentally publish a bad dataset, you can roll back your dataset to just before the error occurred, stopping production issues in their tracks.  Because transitioning between states is fast, this action can take effect across your entire fleet within seconds.

“Once your data is Hollow, it has more potential.”

Hollow has been enormously beneficial at Netflix we've seen server startup times and heap footprints decrease across the board in the face of ever-increasing metadata needs.  Due to targeted data modeling efforts identified through detailed heap footprint analysis made possible by Hollow, we will be able to continue these performance improvements.

In addition to performance wins, we've seen huge productivity gains related to the dissemination of our catalog data.  This is due in part to the tooling that Hollow provides, and in part due to architectural choices which would not have been possible without it.

Conclusion

Everywhere we look, we see a problem that can be solved with Hollow.  Today, Hollow is available for the whole world to take advantage of.  

Hollow isn’t appropriate for datasets of all sizes.  If the data is large enough, keeping the entire dataset in memory isn’t feasible.  However, with the right framework, and a little bit of data modeling, that threshold is likely much higher than you think.

Documentation is available at http://hollow.how, and the code is available on GitHub.  We recommend diving into the quick start guide — you’ll have a demo up and running in minutes, and a fully production-scalable implementation of Hollow at your fingertips in about an hour.  From there, you can plug in your data model and it’s off to the races.

Once you get started, you can get help from us directly or from other users via Gitter, or by posting to Stack Overflow with the tag “hollow”.

By Drew Koszewnik

Thursday, December 1, 2016

More Efficient Mobile Encodes for Netflix Downloads


Last January, Netflix launched globally, reaching many new members in 130 countries around the world. In many of these countries, people access the internet primarily using cellular networks or still-developing broadband infrastructure. Although we have made strides in delivering the same or better video quality with less bits (for example, with per-title encode optimization), further innovation is required to improve video quality over low-bandwidth unreliable networks. In this blog post, we summarize our recent work on generating more efficient video encodes, especially targeted towards low-bandwidth Internet connections. We refer to these new bitstreams as our mobile encodes.

Our first use case for these streams is the recently launched downloads feature on Android and iOS.

What’s new about our mobile encodes

We are introducing two new types of mobile encodes - AVCHi-Mobile and VP9-Mobile. The enhancements in the new bitstreams fall into three categories: (1) new video compression formats, (2) more optimal encoder settings, and (3) per-chunk bitrate optimization. All the changes combined result in better video quality for the same bitrate compared to our current streams (AVCMain).

New compression formats

Many Netflix-ready devices receive streams which are encoded using the H.264/AVC Main profile (AVCMain). This is a widely-used video compression format, with ubiquitous decoder support on web browsers, TVs, mobile devices, and other consumer devices. However, newer formats are available that offer more sophisticated video coding tools. For our mobile bitstreams we adopt two compression formats: H.264/AVC High profile and VP9 (profile 0). Similar to Main profile, the High profile of H.264/AVC enjoys broad decoder support. VP9, a royalty-free format developed by Google, is supported on the majority of Android devices, Chrome, and a growing number of consumer devices.

High profile of H.264/AVC shares the general architecture of H.264/AVC Main profile and among other features, offers other tools that increase compression efficiency. The tools from High profile that are relevant to our use case are:
  • 8x8 transforms and Intra 8x8 prediction
  • Quantization scaling matrices
  • Separate Cb and Cr control

VP9 has a number of tools which bring improvements in compression efficiency over H.264/AVC, including:
  • Motion-predicted blocks of sizes up to 64×64
  • 1/8th pel motion vectors
  • Three switchable 8-tap subpixel interpolation filters
  • Better coding of motion vectors
  • Larger discrete cosine transforms (DCT, 16×16, and 32×32)
  • Asymmetric discrete sine transform (ADST)
  • Loop filtering adapted to new block sizes
  • Segmentation maps

More optimal encoder settings

Apart from using new coding formats, optimizing encoder settings allows us to further improve compression efficiency. Examples of improved encoder settings are as follows:
  • Increased random access picture period: This parameter trades off encoding efficiency with granularity of random access points.
  • More consecutive B-frames or longer Alt-ref distance: Allowing the encoder to flexibly choose more B-frames in H.264/AVC or longer distance between Alt-ref frames in VP9 can be beneficial, especially for slowly changing scenes.
  • Larger motion search range: Results in better motion prediction and fewer intra-coded blocks.
  • More exhaustive mode evaluation: Allows an encoder to evaluate more encoding options at the expense of compute time.

Per-chunk encode optimization

In our parallel encoding pipeline, the video source is split up into a number of chunks, each of which is processed and encoded independently. For our AVCMain encodes, we analyze the video source complexity to select bitrates and resolutions optimized for that title. Whereas our AVCMain encodes use the same average bitrate for each chunk in a title, the mobile encodes optimize the bitrate for each individual chunk based on its complexity (in terms of motion, detail, film grain, texture, etc). This reduces quality fluctuations between the chunks and avoids over-allocating bits to chunks with less complex content.

Video compression results

In this section, we evaluate the compression performance of our new mobile encodes. The following configurations are compared:
  • AVCMain:  Our existing H.264/AVC Main profile encodes, using per-title optimization, serve as anchor for the comparison.
  • AVCHi-Mobile: H.264/AVC High profile encodes using more optimal encoder settings and per-chunk encoding.  
  • VP9-Mobile: VP9 encodes using more optimal encoder settings and per-chunk encoding.

The results were obtained on a sample of 600 full-length popular movies or TV episodes with 1080p source resolution (which adds up to about 85 million frames). We encode multiple quality points (with different resolutions), to account for different bandwidth conditions of our members.

In our tests, we calculate PSNR and VMAF to measure video quality.  The metrics are  computed after scaling the decoded videos to the original 1080p source resolution. To compare the average compression efficiency improvement, we use Bjontegaard-delta rate (BD-rate), a measure widely used in video compression. BD-rate indicates the average change in bitrate that is needed for a tested configuration to achieve the same quality as the anchor. The metric is calculated over a range of bitrate-quality points and interpolates between them to get an estimate of the relative performance of two configurations.

The graph below illustrates the results of the comparison. The bars represent BD-rate gains, and higher percentages indicate larger bitrate savings.The AVCHi-Mobile streams can deliver the same video quality at 15% lower bitrate according to PSNR and at 19% lower bitrate according to VMAF. The VP9-Mobile streams show more gains and can deliver an average of 36% bitrate savings according to PSNR and VMAF. This demonstrates that using the new mobile encodes requires significantly less bitrate for the same quality.



Viewing it another way, members can now receive better quality streams for the same bitrate. This is especially relevant for members with slow or expensive internet connectivity. The graph below illustrates the average quality (in terms of VMAF) at different available bit budgets for the video bitstream. For example, at 1 Mbps, our AVCHi-Mobile and VP9-Mobile streams show an average VMAF increase of 7 and 10, respectively, over AVC-Main. These gains represent noticeably better visual quality for the mobile streams.


How can I watch with the new mobile encodes?

Last month, we started re-encoding our catalog to generate the new mobile bitstreams and the effort is ongoing. The mobile encodes are being used in the brand new downloads feature. In the near future, we will also use these new bitstreams for mobile streaming to broaden the benefit for Netflix members, no matter how they’re watching.


By Andrey Norkin, Jan De Cock, Aditya Mavlankar and Anne Aaron

Tuesday, November 22, 2016

Netflix at AWS re:Invent 2016

Like many of our tech blog readers, Netflix is getting ready for AWS re:Invent in Las Vegas next week. Lots of Netflix engineers and recruiters will be in attendance, and we're looking forward to meeting and reconnecting with cloud enthusiasts and Netflix OSS users.

To make it a little easier to find our speakers at re:Invent, we're posting the schedule of Netflix talks here. We'll also have a booth on the expo floor and hope to see you there!

AES119 - Boosting Your Developers Productivity and Efficiency by Moving to a Container-Based Environment with Amazon ECS
Wednesday, November 30 3:40pm (Executive Summit)
Neil Hunt, Chief Product Officer

Abstract: Increasing productivity and encouraging more efficient ways for development teams to work is top of mind for nearly every IT leader. In this session, Neil Hunt, Chief Product Officer at Netflix, will discuss why the company decided to introduce a container-based approach in order to speed development time, improve resource utilization, and simplify the developer experience. Learn about the company’s technical and business goals, technology choices and tradeoffs it had to make, and benefits of using Amazon ECS.

ARC204 - From Resilience to Ubiquity - #NetflixEverywhere Global Architecture
Tuesday, November 29 9:30am
Thursday, December 1 12:30pm
Coburn Watson, Director, Performance and Reliability

Abstract: Building and evolving a pervasive, global service requires a multi-disciplined approach that balances requirements with service availability, latency, data replication, compute capacity, and efficiency. In this session, we’ll follow the Netflix journey of failure, innovation, and ubiquity. We'll review the many facets of globalization and then delve deep into the architectural patterns that enable seamless, multi-region traffic management; reliable, fast data propagation; and efficient service infrastructure. The patterns presented will be broadly applicable to internet services with global aspirations.

BDM306 - Netflix: Using Amazon S3 as the fabric of our big data ecosystem
Tuesday, November 29, 5:30pm
Wednesday, November 30, 12:30pm
Eva Tse, Director, Big Data Platform
Kurt Brown, Director, Data Platform

Abstract: Amazon S3 is the central data hub for Netflix's big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We will also provide solutions and methodologies on how you can build your own S3 big data hub.

CON313 - Netflix: Container Scheduling, Execution, and Integration with AWS
Thursday, December 1, 2:00pm
Andrew Spyker, Manager, Netflix Container Cloud

Abstract: Members from over all over the world streamed over forty-two billion hours of Netflix content last year. Various Netflix batch jobs and an increasing number of service applications use containers for their processing. In this session, Netflix presents a deep dive on the motivations and the technology powering container deployment on top of Amazon Web Services. The session covers our approach to resource management and scheduling with the open source Fenzo library, along with details of how we integrate Docker and Netflix container scheduling running on AWS. We cover the approach we have taken to deliver AWS platform features to containers such as IAM roles, VPCs, security groups, metadata proxies, and user data. We want to take advantage of native AWS container resource management using Amazon ECS to reduce operational responsibilities. We are delivering these integrations in collaboration with the Amazon ECS engineering team. The session also shares some of the results so far, and lessons learned throughout our implementation and operations.

DEV209 - Another Day in the Life of a Netflix Engineer
Wednesday, November 30, 4:00pm
Dave Hahn, Senior SRE

Abstract: Netflix is big. Really big. You just won't believe how vastly, hugely, mind-bogglingly big it is. Netflix is a large, ever changing, ecosystem system serving million of customers across the globe through cloud-based systems and a globally distributed CDN. This entertaining romp through the tech stack serves as an introduction to how we think about and design systems, the Netflix approach to operational challenges, and how other organizations can apply our thought processes and technologies. We’ll talk about: 
  • The Bits - The technologies used to run a global streaming company 
  • Making the Bits Bigger - Scaling at scale 
  • Keeping an Eye Out - Billions of metrics 
  • Break all the Things - Chaos in production is key 
  • DevOps - How culture affects your velocity and uptime

DEV311 - Multi-Region Delivery Netflix Style
Thursday, December 1, 1:00pm
Andrew Glover, Engineering Manager

Abstract: Netflix rapidly deploys services across multiple AWS accounts and regions over 4,000 times a day. We’ve learned many lessons about reliability and efficiency. What’s more, we’ve built sophisticated tooling to facilitate our growing global footprint. In this session, you’ll learn about how Netflix confidently delivers services on a global scale and how, using best practices combined with freely available open source software, you can do the same.

MBL204 - How Netflix Achieves Email Delivery at Global Scale with Amazon SES
Tuesday, November 29, 10:00am
Devika Chawla, Engineering Director

Abstract: Companies around the world are using Amazon Simple Email Service (Amazon SES) to send millions of emails to their customers every day, and scaling linearly, at cost. In this session, you learn how to use the scalable and reliable infrastructure of Amazon SES. In addition, Netflix talks about their advanced Messaging program, their challenges, how SES helped them with their goals, and how they architected their solution for global scale and deliverability.

NET304 - Moving Mountains: Netflix's Migration into VPC
Thursday, December 1, 3:30pm
Andrew Braham, Manager, Cloud Network Engineering
Laurie Ferioli, Senior Program Manager

Abstract: Netflix was one of the earliest AWS customers with such large scale. By 2014, we were running hundreds of applications in Amazon EC2. That was great, until we needed to move to VPC. Given our scale, uptime requirements, and the decentralized nature of how we manage our production environment, the VPC migration (still ongoing) presented particular challenges for us and for AWS as it sought to support our move. In this talk, we discuss the starting state, our requirements and the operating principles we developed for how we wanted to drive the migration, some of the issues we ran into, and how the tight partnership with AWS helped us migrate from an EC2-Classic platform to an EC2-VPC platform.

SAC307 - The Psychology of Security Automation
Thursday, December 1, 12:30pm
Friday, December 2, 11:00am
Jason Chan, Director, Cloud Security

Abstract: Historically, relationships between developers and security teams have been challenging. Security teams sometimes see developers as careless and ignorant of risk, while developers might see security teams as dogmatic barriers to productivity. Can technologies and approaches such as the cloud, APIs, and automation lead to happier developers and more secure systems? Netflix has had success pursuing this approach, by leaning into the fundamental cloud concept of self-service, the Netflix cultural value of transparency in decision making, and the engineering efficiency principle of facilitating a “paved road.”

This session explores how security teams can use thoughtful tools and automation to improve relationships with development teams while creating a more secure and manageable environment. Topics include Netflix’s approach to IAM entity management, Elastic Load Balancing and certificate management, and general security configuration monitoring.

STG306 - Tableau Rules of Engagement in the Cloud
Thursday, December 1, 1:00pm
Srikanth Devidi, Senior Data Engineer
Albert Wong, Enterprise Reporting Platform Manager

Abstract: You have billions of events in your fact table, all of it waiting to be visualized. Enter Tableau… but wait: how can you ensure scalability and speed with your data in Amazon S3, Spark, Amazon Redshift, or Presto? In this talk, you’ll hear how Albert Wong and Srikanth Devidi at Netflix use Tableau on top of their big data stack. Albert and Srikanth also show how you can get the most out of a massive dataset using Tableau, and help guide you through the problems you may encounter along the way.