Friday, October 31, 2014

Message Security Layer: A Modern Take on Securing Communication

Netflix serves audio and video to millions of devices and subscribers across the globe. Each device has its own unique hardware and software, and differing security properties and capabilities. The communication between these devices and our servers must be secured to protect both our subscribers and our service.
When we first launched the Netflix streaming service we used a combination of HTTPS and a homegrown security mechanism called NTBA to provide that security. However, over time this combination started exhibiting growing pains. With the advent of HTML5 and the Media Source Extensions and Encrypted Media Extensions we needed something new that would be compatible with that platform. We took this as an opportunity to address many of the shortcomings of the earlier technology. The Message Security Layer (MSL) was born from these dual concerns.

Problems with HTTPS

One of the largest problems with HTTPS is the PKI infrastructure. There were a number of short-lived incidents where a renewed server certificate caused outages. We had no good way of handling revocation: our attempts to leverage CRL and OCSP technologies resulted in a complex set of workarounds to deal with infrastructure downtimes and configuration mistakes, which ultimately led to a worse user experience and brittle security mechanism with little insight into errors. Recent security breaches at certificate authorities and the issuance of intermediate certificate authorities means placing trust in one actor requires placing trust in a whole chain of actors not necessarily deserving of trust.
Another significant issue with HTTPS is the requirement for accurate time. The X.509 certificates used by HTTPS contain two timestamps and if the validating software thinks the current time is outside that time window the connection is rejected. The vast majority of devices do not know the correct time and have no way of securely learning the correct time.
Being tied to SSL and TLS, HTTPS also suffers from fundamental security issues unknown at the time of their design. Examples include padding attacks and the use of MAC-then-Encrypt, which is less secure than Encrypt-then-MAC.
There are other less obvious issues with HTTPS. Establishing a connection requires extra network round trips and depending on the implementation may result in multiple requests to supporting infrastructure such as CRL distribution points and OCSP responders in order to validate a certificate chain. As we continually improved application responsiveness and playback startup time this overhead became significant, particularly in situations with less reliable network connectivity such as Wi-Fi or mobile networks.
Even ignoring these issues, integrating new features and behaviors into HTTPS would have been extremely difficult. The specification is fixed and mandates certain behaviors. Leveraging specific device security features would require hacking the SSL/TLS stack in unintended ways: imagine generating some form of client certificate that used a dynamically generated set of device credentials.

High-level Goals

Before starting to design MSL we had to identify its high-level goals. Other than general best practices when it comes to protocol design, the following objectives are particularly important given the scale of deployment, the fact it must run on multiple platforms, and the knowledge it will be used for future unknown use cases.
  • Cross-language. Particularly subject to JavaScript constraints such as its maximum integer value and native functions found in web browsers.
  • Automatic error recovery. With millions of devices and subscribers we need devices that enter a bad state to be able to automatically recover without compromising security.
  • Performance. We do not want our application performance and responsiveness to be limited any more than it has to be. The network is by far the most expensive performance cost.
    Figure 1. HTTP vs. HTTPS Performance
  • Flexible and extensible. Whenever possible we want to take advantage of security features provided by devices and their software. Likewise if something no longer provides the security we need then there needs to be a migration path forward.
  • Standards compatible. Although related to being flexible and extensible, we paid particular attention to being standards compatible. Specifically we want to be able to leverage the Web Crypto API now available in the major web browsers.

Security Properties

MSL is a modern cryptographic protocol that takes into account the latest cryptography technologies and knowledge. It supports the following basic security properties.
  • Integrity protection. Messages in transit are protected from tampering.
  • Encryption. Message data is protected from inspection.
  • Authentication. Messages can be trusted to come from a specific device and user.
  • Non-replayable. Messages containing non-idempotent data can be non-replayable.
MSL supports two different deployment models, which we refer to as MSL network types. A single device may participate in multiple MSL networks simultaneously.
  • Trusted services network. This deployment consists of a single client device and multiple servers. The client authenticates against the servers. The servers have shared access to the same cryptographic secrets and therefore each server must trust all other servers.
  • Peer-to-peer. This is a typical p2p arrangement where each each side of the communication is mutually authenticated.
MSL Networks
Figure 2. MSL Networks


Protocol Overview

A typical MSL message consists of a header and one or more application payload chunks. Each chunk is individually protected which allows the sender and recipient to process application data as it is transmitted. A message stream may remain open indefinitely, allowing large time gaps between chunks if desired.
MSL has pluggable authentication and may leverage any number of device and user authentication types for the initial message. The initial message will provide authentication, integrity protection, and encryption if the device authentication type supports it. Future messages will make use of session keys established as a result of the initial communication.
If the recipient encounters an error when receiving a message it will respond with an error message. Error messages consist of a header that indicates the type of error that occurred. Upon receipt of the error message the original sender can attempt to recover and retransmit the original application data. For example, if the message recipient believes one side or the other is using incorrect session keys the error will indicate that new session keys should be negotiated from scratch. Or if the message recipient believes the device or user credentials are incorrect the error will request the sender re-authenticate using new credentials.
To minimize network round-trips MSL attempts to perform authentication, key negotiation, and renewal operations while it is also transmitting application data (Figure 2). As a result MSL does not impose any additional network round trips and only minimal data overhead.
Figure 3. MSL Communication w/Application Data
This may not always be possible in which case a MSL handshake must first occur, after which sensitive data such as user credentials and application data may be transmitted (Figure 3).
Figure 4. MSL Handshake followed by Application Data
Once session keys have been established they may be reused for future communication. Session keys may also be persisted to allow reuse between application executions. In a trusted services network the session keys resulting from a key negotiation with one server can be used with all other servers.

Platform Integration

Whenever possible we would like to take advantage of the security features provided by a specific platform. Doing so often provides stronger security than is possible without leveraging those features.
Some devices may already contain cryptographic keys that can be used to authenticate and secure initial communication. Likewise some devices may have already authenticated the user and it is a better user experience if the user is not required to enter their email and password again.
MSL is a plug-in architecture which allows for the easy integration of different device and user authentication schemes, session key negotiation schemes, and cryptographic algorithms. This also means that the security of any MSL deployment heavily depends on the mechanisms and algorithms it is configured with.
The plug-in architecture also means new schemes and algorithms can be incorporated without requiring a protocol redesign.

Other Features

  • Time independence. MSL does not require time to be synchronized between communicating devices. It is possible certain authentication or key negotiation schemes may impose their own time requirements.
  • Service tokens. Service tokens are very similar to HTTP cookies: they allow applications to attach arbitrary data to messages. However service tokens can be cryptographically bound to a specific device and/or user, which prevents data from being migrated without authorization.

The Release

To learn more about MSL and find out how you can use it for your own applications visit the Message Security Layer repository on GitHub.
The protocol is fully documented and guides are provided to help you use MSL securely for your own applications. Java and JavaScript implementations of a MSL stack are available as well as some example applications. Both languages fully support trusted services and peer-to-peer operation as both client and server.

MSL Today and Tomorrow

With MSL we have eliminated many of the problems we faced with HTTPS and platform integration. Its flexible and extensible design means it will be able to adapt as Netflix expands and as the cryptographic landscape changes.
We are already using MSL on many different platforms including our HTML5 player, game consoles, and upcoming CE devices. MSL can be used just as effectively to secure internal communications. In the future we envision using MSL over Web Sockets to create long-lived secure communication channels between our clients and servers.
We take security seriously at Netflix and are always looking for the best to join our team. If you are also interested in attacking the challenges of the fastest-growing online streaming service in the world, check out our job listings.

Wesley Miaw & Mitch Zollinger
Security Engineering

Thursday, October 23, 2014

FIT : Failure Injection Testing

by Kolton Andrus, Naresh Gopalani, Ben Schmaus


It's no secret that at Netflix we enjoy deliberately breaking things to test our production systems. Doing so lets us validate our assumptions and prove that our mechanisms for handling failure will work when called upon. Netflix has a tradition of implementing a range of tools that create failure, and it is our pleasure to introduce you to the latest of these solutions, FIT or Failure Injection Testing.

FIT is a platform that simplifies creation of failure within our ecosystem with a greater degree of precision for what we fail and who we will impact. FIT also allows us to propagate our failures across the entirety of Netflix in a consistent and controlled manner.

Why We Built FIT
While breaking things is fun, we do not enjoy causing our customers pain. Some of our Monkeys, by design, can go a little too wild when let out of their cages. Latency Monkey in particular has bitten our developers, leaving them wary about unlocking the cage door.

Latency monkey adds a delay and/or failure on the server side of a request for a given service. This provides us good insight into how calling applications behave when their dependency slows down - threads pile up, the network becomes congested, etc. Latency monkey also impacts all calling applications - whether they want to participate or not, and can result in customer pain if proper fallback handling, timeouts, and bulkheads don’t work as expected. With the complexity of our system it is virtually impossible for us to anticipate where failures will happen when turning latency monkey loose. Validating these behaviors often is risky, but critical to remain resilient.

What we need is a way to limit the impact of failure testing while still breaking things in realistic ways. We need to control the outcome until we have confidence that the system degrades gracefully, and then increase it to exercise the failure at scale.  This is where FIT comes in.

How FIT works
Simulating failure starts when the FIT service pushes failure simulation metadata to Zuul. Requests matching the failure scope at Zuul are decorated with failure. This may be an added delay to a service call, or failure in reaching the persistence layer. Each injection point touched checks the request context to determine if there is a failure for that specific component. If found, the injection point simulates that failure appropriately.  Below is an outline of a simulated failure, demonstrating some of the inflection points in which failure can be injected.


FIT Architecture Example.png

Failure Scope
We only want to break those we intend, so limiting the potential blast radius is critical. To achieve this we use Zuul, which provides many powerful capabilities for inspecting and managing traffic. Before forwarding a request, Zuul checks a local store of FIT metadata to determine if this request should be impacted. If so, Zuul decorates the request with a failure context, which is then propagated to all dependent services.

For most failure tests, we use Zuul to isolate impacted requests to only a specific test account or a specific device. Once validated at that level, we expand the scope to a small percentage of production requests. If the failure tests still looks good, we will gradually dial up the chaos to 100%.

Injection Points
We have several key “building block” components that are used within Netflix. They help us to isolate failure and define fallbacks (Hystrix), communicate with dependencies (Ribbon), cache data (EVCache), or persist data (Astyanax). Each of these layers make perfect inflection points to inject failure. These layers interface with the FIT context to determine if this request should be impacted. The failure behavior is provided to that layer, which determines how to emulate that failure in a realistic fashion: sleep for a delay period, return a 500, throw an exception, etc.

Failure Scenarios
Whether we are recreating a past outage, or proactively testing the loss of a dependency, we need to know what could fail in order to build a simulation. We use an internal system that traces requests through the entirety of the Netflix ecosystem to find all of the injection points along the path. We then use these to create failure scenarios, which are sets of injection points which should or should not fail. One such example is our critical services scenario, the minimum set of our services required to stream. Another may be the loss of an individual service, including its persistence and caching layers.

Automated Testing
Failure testing tools are only as valuable as their usage. Our device testing teams have developed automation which: enables failure, launches Netflix on a device, browses through several lists, selects a video, and begins streaming. We began by validating this process works if only our critical services are available. Currently we are extending this to identify every dependency touched during this process, and systematically failing each one individually. As this is running continuously, it helps us to identify vulnerabilities when introduced.

Resiliency Strategies
FIT has proven useful to bridge the gap between isolated testing and large scale chaos exercises, and make such testing self service. It is one of many tools we have to help us build more resilient systems. The scope of the problem extends beyond just failure testing, we need a range of techniques and tools: designing for failure, better detection and faster diagnosis, regular automated testing, bulkheading, etc. If this sounds interesting to you, we’re looking for great engineers to join our reliability, cloud architecture, and API teams!



Tuesday, October 7, 2014

Using Presto in our Big Data Platform on AWS

by: Eva Tse, Zhenxiao Luo, Nezih Yigitbasi @ Big Data Platform team


At Netflix, the Big Data Platform team is responsible for building a reliable data analytics platform shared across the whole company. In general, Netflix product decisions are very data driven. So we play a big role in helping different teams to gain product and consumer insights from a multi-petabyte scale data warehouse (DW). Their use cases range from analyzing A/B tests results to analyzing user streaming experience to training data models for our recommendation algorithms.

We shared our overall architecture in a previous blog post. The underpinning of our big data platform is that we leverage AWS S3 for our DW. This architecture allows us to separate compute and storage layers. It allows multiple clusters to share the same data on S3 and clusters can be long-running and yet transient (for flexibility). Our users typically write Pig or Hive jobs for ETL and data analytics.

A small subset of the ETL output and some aggregated data is transferred to Teradata for interactive querying and reporting. On the other hand, we also have the need to do low latency interactive data exploration on our broader data set on S3. These are the use cases that Presto serves exceptionally well. Seven months ago, we first deployed Presto into production and it is now an integral part of our data ecosystem. In this blog post, we would like to share our experience with Presto and how we made it work for us!

Why Presto?

We had been in search of an interactive querying engine that could work well for us. Ideally, we wanted an open source project that could handle our scale of data & processing needs, had great momentum, was well integrated with the Hive metastore, and was easy for us to integrate with our DW on S3. We were delighted when Facebook open sourced Presto.

In terms of scale, we have a 10 petabyte data warehouse on S3. Our users from different organizations query diverse data sets across expansive date ranges. For this use case, caching a specific dataset in memory would not work because cache hit rate would be extremely low unless we have an unreasonably large cache. The streaming DAG execution architecture of Presto is well-suited for this sporadic data exploration usage pattern.

In terms of integrating with our big data platform, Presto has a connector architecture that is Hadoop friendly. It allows us to easily plug in an S3 file system. We were up and running in test mode after only a month of work on the S3 file system connector in collaboration with Facebook.

In terms of usability, Presto supports ANSI SQL, which has made it very easy for our analysts and developers to get rolling with it.  As far as limitations / drawbacks, user-defined functions in Presto are more involved to develop, build, and deploy as compared to Hive and Pig. Also, for users who want to productionize their queries, they need to rewrite them in HiveQL or Pig Latin, as we don’t currently use Presto in our critical production pipelines. While there are some minor inconveniences, the benefits of being able to interactively analyze large amounts of data is a huge win for us.

Finally, Presto was already running in production at Facebook. We did some performance benchmarking and stress testing and we were impressed. We also looked under the hood and saw well designed and documented Java code. We were convinced!

Our production environment and use cases

Currently, we are running with ~250 m2.4xlarge EC2 worker instances and our coordinator is on r3.4xlarge. Our users run ~2500 queries/workday. Our Presto cluster is completely isolated from our Hadoop clusters, though they all access the same data on our S3 DW.

Almost all of our jobs are CPU bound. We set our task memory to a rather high value (i.e., 7GB, with a slight chance in oversubscribing memory) to run some of our memory intensive queries, like big joins or aggregation queries.

We do not use disk (as we don’t use HDFS) in the cluster. Hence, we will be looking to upgrade to the current generation AWS instance type (e.g. r3), which has more memory, and has better isolation and performance than the previous generation of EC2 instances.

We are running the latest Presto 0.76 release with some outstanding pull requests that are not committed yet. Ideally, we would like to contribute everything back to open source and not carry custom patches in our deployment. We are actively working with Facebook and looking forward to committing all of our pull requests.

Presto addresses our ad hoc interactive use cases. Our users always go to Presto first for quick answers and for data exploration. If Presto does not support what they need (like big join / aggregation queries that exceed our memory limit or some specific user-defined functions that are not available), then they would go back to Hive or Pig.

We are working on a Presto user interface for our internal big data portal. Our algorithm team also built an interactive data clustering application by integrating R with Presto via an open source Python Presto client.

Performance benchmark

At a high level, we compare Presto and Hive query execution time using our own datasets and users’ queries instead of running standard benchmarks like TPC-H or TPC-DS. This way, we can translate the results back to what we can expect for our use cases. The graph below shows the results of three queries: a group-by query, a join plus a group-by query, and a needle-in-a-haystack (table scan) query. We compared the performance of Presto vs. Hive 0.11 on Hadoop 2 using Parquet input files on S3, all of which we currently use in production. Each query processed the same data set with varying data sizes between ~140GB to ~210GB depending on the file format.





Cluster setup:
40 nodes m2.4xlarge

Settings we tuned:
task.shard.max-threads=32
task.max-memory=7GB
sink.max-buffer-size=1GB
hive.s3.max-client-retries=50
hive.s3.max-error-retries=50
hive.s3.max-connections=500
hive.s3.connect-timeout=5m
hive.s3.socket-timeout=5m




We understand performance test environments and numbers are hard to reproduce. What is worth noting is the relative performance of these tests. The key takeaway is that queries that take one or two map-reduce (MR) phases in Hadoop run 10 to 100 times faster in Presto. The speedup in Presto is linear to the number of MR jobs involved. For jobs that only do a table scan (i.e., I/O bound instead of CPU bound), it is highly dependent on the read performance of the file format used. We did some work on Presto / Parquet integration, which we will cover in the next section.

Our Presto contributions

The primary and initial piece of work that made Presto work for us was S3 FileSystem integration. In addition, we also worked on optimizing S3 multipart upload. We also made a few enhancements and bug fixes based on our use cases along the way: disabling recursive directory listing, json tuple generation, foreground metastore refresh, mbean for S3 filesystem monitoring, and handling S3 client socket timeout.

In general, we are committed to make Presto work better for our users and to cover more of their needs. Here are a few big enhancements that we are currently working on:

Parquet file format support

We recently upgraded our DW to use the Parquet file format (FF) for its performance on S3 and for its flexibility to integrate with different data processing engines. Hence, we are committed to make Presto work better with Parquet FF.  (For details on why we chose Parquet and what we contributed to make it work in our environment, stay tuned for an upcoming blog post).

Developing based on Facebook’s initial Parquet integration, we added support for predicate pushdown, column position based access (instead of name based access) to Parquet columns, and data type coercion. For context, we use the Hive metastore as our source of truth for metadata, and we do schema evolution on the Hive metastore. Hence, we need column position based access to work with our Hive metastore instead of using the schema information stored in Parquet files.

Here is a comparison of Presto job execution times among different FFs. We compare read performance of sequence file (a FF we have historically used), ORCFile (we benchmarked the latest integration with predicate pushdown, vectorization and lazy materialization on read) and Parquet. We also compare the performance on S3 vs. HDFS. In this test, we use the same data sets and environment as the above benchmark test. The query is a needle-in-a-haystack query that does a select and filter on a condition that returns zero rows.

Screen Shot 2014-09-30 at 2.29.03 PM.png

As next step, we will look into improving Parquet performance further by doing predicate pushdown to eliminate whole row groups, vectorization and lazy materialization on read. We believe this will make Parquet performance on par with ORC files.

ODBC / JDBC support

This is one of the biggest asks from our users. Users like to connect to our Hive DW directly to do exploratory / ad hoc reporting because it has the full dataset. Given Presto is interactive and integrated with Hive metastore, it is a natural fit.

Presto has a native ODBC driver that was recently open sourced. We made a few bug fixes and we are working on more enhancements. Overall, it is working well now for our Tableau users in extract (non-live exploration) mode. For our users who prefer to use Microstrategy, we plan to explore different options to integrate with it next.

Map data type support

All the event data generated from our Netflix services and Netflix-enabled devices comes through our Suro pipeline before landing in our DW. For flexibility, this event data is structured as key/value pairs, which get automatically stored in map columns in our DW. Users may pull out keys as a top level columns in the Hive metastore by adjusting some configurations in our data pipeline. Still, a large number of key/value pairs remain in the map because there are a large number of keys and the key space is very sparse.

It is very common for users to lookup a specific key from the map. With our current Parquet integration, looking up a key from the map column means converting the column to JSON string first then parsing it. Facebook recently added native support for array and map data types. We plan to further enhance it to support array element or map key specific column pruning and predicate pushdown for Parquet FF to improve performance.

Our wishlist

There are still a couple of items that are high on our wishlist and we would love to contribute on these when we have the bandwidth.

Big table join. It is very common for our queries to join tables as we have a lot of normalized data in our DW. We are excited to see that distributed hash join is now supported and plan to check it out. Sort-merge join would likely be useful to solve some of the even bigger join use cases that we have.

Graceful shrink.  Given Presto is used for our ad hoc use cases, and given we run it in the cloud, it would be most efficient if we could scale up the cluster during peak hours (mostly work hours) and scale down during trough hours (night time or weekends). If Presto nodes can be blacklisted and gracefully drained before shutdown, we could combine that with available JMX metrics to do heuristic-based auto expand/shrink of the cluster.

Key takeaway

Presto makes the lives of our users a lot easier. It tremendously improves their productivity.

We have learned from our experience that getting involved and contributing back to open source technologies is the best way to make sure it works for our use cases in a fast paced and evolving environment. We have been working closely with the Facebook team to discuss our use cases and align priorities. They have been open about their roadmap, quick in adding new features, and helpful in providing feedback to our contributions. We look forward to continuing to work with them and the community to make Presto even better and more comprehensive. Let us know if you are interested in sharing your experiences using Presto.

Last but not least, the Big Data Platform team at Netflix has been heads-down innovating on our platform to meet our growing business needs. We will share more of our experiences with our Parquet FF migration and Genie 2 upgrade in upcoming blog posts.

If you are interested in solving big data problems like ours, we would like to hear from you!