Monday, August 31, 2015

Announcing Sleepy Puppy - Cross-Site Scripting Payload Management for Web Application Security Testing

by: Scott Behrens and Patrick Kelley

Netflix is pleased to announce the open source release of our cross-site scripting (XSS) payload management framework: Sleepy Puppy!

The Challenge of Cross-Site Scripting

Cross-site scripting is a type of web application security vulnerability that allows an attacker to execute arbitrary client-side script in a victim’s browser. XSS has been listed on the OWASP Top 10 vulnerability list since 2004, and developers continue to struggle with mitigating controls to prevent XSS (e.g. content security policy, input validation, output encoding). According to a recent report from WhiteHat Security, a web application is 47% likely to have one or more cross-site scripting vulnerabilities. 

A number of tools are available to identify cross-site scripting issues; however, security engineers are still challenged to fully cover the scope of applications in their portfolio. Automated scans and other security controls provide a base level of coverage, but often only focus on the target application. 


Delayed XSS Testing

Delayed XSS testing is a variant of stored XSS testing that can be used to extend the scope of coverage beyond the immediate application being tested. With delayed XSS testing, security engineers inject an XSS payload on one application that may get reflected back in a separate application with a different origin. Let’s examine the following diagram.


Here we see a security engineer inject an XSS payload into the assessment target (App #1 Server) that does not result in an XSS vulnerability. However, that payload was stored in a database (DB) and reflected back in a second application not accessible to the tester. Even though the tester can’t access the vulnerable application, the vulnerability could still be used to take advantage of the user. In fact, these types of vulnerabilities can be even more dangerous than standard XSS since the potential victims are likely to be privileged types of users (employees, administrators, etc.) 

To discover the triggering of a delayed XSS attack, the payload must alert the tester of App #2’s vulnerability in a different manner. 


Toward Better Delayed XSS Payload Management

A number of talks and tools cover XSS testing, with some focussing on the delayed variant. Tools like BEef, PortSwigger BurpSuite Collaborator, and XSS.IO are appropriate for a number of situations and can be beneficial tools in the application security engineer’s portfolio. However, we wanted a more comprehensive XSS testing framework to simplify XSS propagation and identification and allow us to work with developers to remediate issues faster. 

Without further ado, meet Sleepy Puppy! 


Sleepy Puppy

Sleepy Puppy is a XSS payload management framework that enables security engineers to simplify the process of capturing, managing, and tracking XSS propagation over long periods of time and numerous assessments.

We will use the following terminology throughout the rest of the discussion:

  • Assessments describe specific testing sessions and allow the user to optionally receive email notifications when XSS issues are identified for those assessments.
  • Payloads are XSS strings to be executed and can include the full range of XSS injection.
  • PuppyScripts are typically written in JavaScript and provide a way to collect information on where the payload executed. 
  • Captures are the screenshots and metadata collected by the default PuppyScript
  • Generic Collector is an endpoint that allows you to optionally log additional data outside the scope of a traditional capture. 

Sleepy Puppy is highly configurable, and you can create your own payloads and PuppyScripts as needed.

Security engineers can leverage the Sleepy Puppy assessment model to categorize payloads and subscribe to email notifications when delayed cross-site scripting events are triggered.

Sleepy Puppy also exposes an API for users who may want to develop plugins for scanners such as Burp or Zap. With Sleepy Puppy, our workflow of testing now looks like this:


Testing is straightforward as Sleepy Puppy ships with a number of payloads, PuppyScripts, and an assessment. To provide a better sense of how Sleepy Puppy works in action, let’s take a look at an assessment we created for the XSS Challenge web application, a sample application that allows users to practice XSS testing.



To test the XSS Challenge web app, we created an assessment named 'XSS Game', which is highlighted above. When you click and highlight an assessment, you can see a number of payloads associated with this assessment. These payloads were automatically configured to have unique identifiers to help you correlate which payloads within your assessment have executed. Throughout the course of testing, counts of captures, collections, and access log requests are provided to quickly identify which payloads are executing. 

Simply copy any payload and inject it in the web application you are testing. Injecting Sleepy Puppy payloads in stored objects that may be reflected in other applications is highly recommended. 

The default PuppyScript configured for payloads captures useful metadata including the URL, DOM with payload highlighting, user-agent, cookies, referer header, and a screenshot of the application where the payload executed. This provides the tester ample knowledge to identify the impacted application so they may mitigate the vulnerability quickly. As payloads propagate throughout a network, the tester can trace what applications the payload has executed in. For more advanced use cases, security engineers can chain PuppyScripts together and even leverage the generic collector model to capture arbitrary data from any input source. 

After the payload executes, the tester will receive an email notification (if configured) and be presented with actionable data associated with the payload execution:


Here, the security engineer is able to view all of the information collected in Sleepy Puppy.  The researcher is presented with when the payload fired, url, referrer, cookies, user agent, DOM, and a screenshot.  


Architecture

Sleepy Puppy makes use of the following components :

  • Python 2.7 with Flask (including a number of helper packages)
  • SQLAlchemy with configurable backend storage
  • Ace Javascript editor for editing PuppyScripts
  • Html2Canvas JavaScript for screenshot capture
  • Optional use of AWS Simple Email Service (SES) for email notifications and S3 for screenshot storage

We’re shipping Sleepy Puppy with built-in payloads, PuppyScripts and a default assessment.


Getting Started

Sleepy Puppy is available now on the Netflix Open Source siteYou can try out Sleepy Puppy using Docker. Detailed instructions on setup and configuration are available on the wiki page.


Interested in Contributing?

Feel free to reach out or submit pull requests if there’s anything else you’re looking for. We hope you’ll find Sleepy Puppy as useful as we do!


Special Thanks

Thanks to Daniel Miessler for the extensive feedback after our Bay Area OWASP talk which was discussed in his blogpost.  


Conclusion

Sleepy Puppy is helping the Netflix security team identify XSS propagation through a number of systems even when those systems aren’t assessed directly. We hope that the open source community can find new and interesting uses for Sleepy Puppy, and use it to simplify their XSS testing and improve remediation times. Sleepy puppy is available on our GitHub site now!

Tuesday, August 25, 2015

From Chaos to Control - Testing the resiliency of Netflix’s Content Discovery Platform

By:Leena Janardanan, Bruce Wobbe, Vilas Veeraraghavan

Introduction
Merchandising Application Platform (MAP) was conceived as a middle-tier service that would handle real time requests for content discovery. MAP does this by aggregating data from disparate data sources and implementing common business logic into one distinct layer. This centralized layer helps provide common experiences across device platforms and helps reduce duplicate, and sometimes, inconsistent business logic. In addition, it also allows recommendation systems - which are typically pre-compute systems - to be de-coupled from the real time path. MAP can be compared to a big funnel through which most of the content discovery data on a user’s screen goes through and is processed.
As an example, MAP generates localized row names for the personalized recommendations on the home page. This happens in real time, based on the locale of the user at the time the request is made. Similarly, application of maturity filters, localizing and sorting categories are examples of logic that lives in MAP.


Localized categories and row names, up-to-date My List and Continue Watching


A classic example of duplicated but inconsistent business logic that MAP consolidated was the “next episode” logic -  the rule to determine if a particular episode was completed and the next episode should be shown. In one platform, it required that credits had started and/or 95% of the episode to be finished. In another platform, it was simply that 90% of the episode had to be finished. MAP consolidated this logic into one simple call that all devices now use.
 MAP also enables discovery data to be a mix of pre-computed and real time data. On the homepage, rows like My List, Continue Watching and Trending Now are examples of real time data whereas rows like “Because you watched” are pre-computed. As an example, if a user added a title to My List on a mobile device and decided to watch the title on a Smart TV, the user would expect My List on the TV to be up-to-date immediately. What this requires is the ability to selectively update some data in real time. MAP provides the APIs and logic to detect if data has changed and update it as needed. This allows us to keep the efficiencies gained from pre-compute systems for most of the data, while also having the flexibility to keep other data fresh.
MAP also supports business logic required for various A/B tests, many of which are active on Netflix at any given time. Examples include: inserting non-personalized  rows, changing the sort order for titles within a row and changing the contents of a row.
The services that generate this data are a mix of pre-compute and real time systems. Depending on the data, the calling patterns from devices for each type of data also vary. Some data is fetched once per session, some of it is pre-fetched when the user navigates the page/screen and other data is refreshed constantly (My List, Recently Watched, Trending Now).

Architecture

MAP is comprised of two parts - a server and a client. The server is the workhorse which does all the data aggregation and applies business logic. This data is then stored in caches (see EVCache) the client reads. The client primarily serves the data and is the home for resiliency logic. The client decides when a call to the server is taking too long, when to open a circuit (see Hystrix) and, if needed, what type of fallback should be served.




MAP is in the critical path of content discovery. Without a well thought out resiliency story, any failures in MAP would severely impact the user experience and Netflix's availability. As a result, we spend a lot of time thinking about how to make MAP resilient.

Challenges in making MAP resilient
Two approaches commonly used by MAP to improve resiliency are:
(1) Implementing fallback responses for failure scenarios
(2) Load shedding - either by opening circuits to the downstream services or by limiting retries wherever possible.

There are a number of factors that make it challenging to make MAP resilient: 
(1) MAP has numerous dependencies, which translates to multiple points of failure. In addition, the behavior of these dependencies evolves over time, especially as A/B tests are launched, and a solution that works today may not do so in 6 months. At some level, this is a game of Whack-A-Mole as we try to keep up with a constantly changing eco system.

(2) There is no one type of fallback that works for all scenarios:
    • In some cases, an empty response is the only option and devices have to be able to handle that gracefully. E.g. data for the "My List" row couldn't be retrieved.
    • Various degraded modes of performance can be supported. E.g. if the latest personalized home page cannot be delivered, fallbacks can range from stale, personalized recommendations to non-personalized recommendations.
    • In other cases, an exception/error code might be the right response, indicating to clients there is a problem and giving them the ability to adapt the user experience - skip a step in a workflow, request different data, etc.

How do we go from Chaos to Control?

Early on, failures in MAP or its dependent services caused SPS dips like this:


It was clear that we needed to make MAP more resilient. The first question to answer was - what does resiliency mean for MAP? It came down to these expectations:
(1) Ensure an acceptable user experience during a MAP failure, e.g. that the user can browse our selection and continue to play videos
(2) Services that depend on MAP i.e. the API service and device platforms are not impacted by a MAP failure and continue to provide uninterrupted services
(3) Services that MAP depends on are not overwhelmed by excessive load from MAP

It is easy enough to identify obvious points of failure. For example - if a service provides data X, we could ensure that MAP has a fallback for data X being unavailable. What is harder is knowing the impact of failures in multiple services - different combinations of them - and the impact of higher latencies.

This is where the Latency Monkey and FIT come in. Running Latency Monkey in our production environment allows us to detect problems caused by latent services. With Latency Monkey testing, we have been able to fix incorrect behaviors and fine tune various parameters on the backend services like:
(1) Timeouts for various calls
(2) Thresholds for opening circuits via Hystrix
(3) Fallbacks for certain use cases
(4) Thread pool settings

FIT, on the other hand, allows us to simulate specific failures. We restrict the scope of failures to a few test accounts. This allows us to validate fallbacks as well as the user experience. Using FIT, we are able to sever connections with:
(1) Cache that handles MAP reads and writes 
(2) Dependencies that MAP interfaces with
(3) MAP service itself




What does control look like?


In a successful run of FIT or Chaos Monkey, this is how metrics look like now:
Total requests served by MAP before and during the test(no impact)

MAP successful fallbacks during the test(high fallback rate)


On a lighter note, our failure simulations uncovered some interesting user experience issues, which have since been fixed.

  1. Simulating failures in all the dependent services of MAP server caused an odd data mismatch to happen:
The Avengers shows graphic for Peaky Blinders


  1. Severing connections to MAP server and the cache caused these duplicate titles to be served:


  1. When the cache was made unavailable mid session, some rows looked like this:


  1. Simulating a failure in the “My List” service caused the PS4 UI to be stuck on adding a title to My List:


In an ever evolving ecosystem of many dependent services, the future of resiliency testing resides in automation. We have taken small but significant steps this year towards making some of these FIT tests automated. The goal is to build these tests out so they run during every release and catch any regressions.


Looking ahead for MAP, there are many more problems to solve. How can we make MAP more performant? Will our caching strategy scale to the next X million customers? How do we enable faster innovation without impacting reliability? Stay tuned for updates!

Thursday, August 20, 2015

Fenzo: OSS Scheduler for Apache Mesos Frameworks

Bringing Netflix to our millions of subscribers is no easy task. The product comprises dozens of services in our distributed environment, each of which is operating a critical component to the experience while constantly evolving with new functionality. Optimizing the launch of these services is essential for both the stability of the customer experience as well as overall performance and costs. To that end, we are happy to introduce Fenzo, an open source scheduler for Apache Mesos frameworks. Fenzo tightly manages the scheduling and resource assignments of these deployments.


Fenzo is now available in the Netflix OSS suite. Read on for more details about how Fenzo works and why we built it. For the impatient, you can find source code and docs on Github.

Why Fenzo?

Two main motivations for developing a new framework, as opposed to leveraging one of the many frameworks in the community, were to achieve scheduling optimizations and to be able to autoscale the cluster based on usage, both of which will be discussed in greater detail below. Fenzo enables frameworks to better manage ephemerality aspects that are unique to the cloud. Our use cases include a reactive stream processing system for real time operational insights and managing deployments of container based applications.


At Netflix, we see a large variation in the amount of data that our jobs process over the course of a day. Provisioning the cluster for peak usage, as is typical in data center environments, is wasteful. Also, systems may occasionally be inundated with interactive jobs from users responding to certain anomalous operational events. We need to take advantage of the cloud’s elasticity and scale the cluster up and down based on dynamic loads.


Although scaling up a cluster may seem relatively easy by watching, for example, the amount of available resources falling below a threshold, scaling down presents additional challenges. If the tasks are long lived and cannot be terminated without negative consequences, such as time consuming reconfiguration of stateful stream processing topologies, the scheduler will have to assign them such that all tasks on a host terminate at about the same time so the host can be terminated for scale down.

Scheduling Strategy

Scheduling tasks requires optimization of resource assignments to maximize the intended goals. When there are multiple resource assignments possible, picking one versus another can lead to significantly different outcomes in terms of scalability, performance, etc. As such, efficient assignment selection is a crucial aspect of a scheduler library. For example, picking assignments by evaluating every pending task with every available resource is computationally prohibitive.

Scheduling Model

Our design focused on large scale deployments with a heterogeneous mix of tasks and resources that have multiple constraints and optimizations needs. If evaluating the most optimal assignments takes a long time, it could create two problems:
  • resources become idle, waiting for new assignments
  • task launches experience increased latency


Fenzo adopts an approach that moves us quickly in the right direction as opposed to coming up with the most optimal set of scheduling assignments every time.


Conceptually, we think of tasks as having an urgency factor that determines how soon it needs an assignment, and a fitness factor that determines how well it fits on a given host.
If the task is very urgent or if it fits very well on a given resource, we go ahead and assign that resource to the task. Otherwise, we keep the task pending until either urgency increases or we find another host with a larger fitness value.

Trading Off Scheduling Speed with Optimizations

Fenzo has knobs for you to choose speed and optimal assignments dynamically. Fenzo employs a strategy of evaluating optimal assignments across multiple hosts, but, only until a fitness value deemed “good enough” is obtained. While a user defined threshold for fitness being good enough controls the speed, a fitness evaluation plugin represents the optimality of assignments and the high level scheduling objectives for the cluster. A fitness calculator can be composed from multiple other fitness calculators, representing a multi-faceted objective.

Task Constraints

Fenzo tasks can use optional soft or hard constraints to influence assignments to achieve locality with other tasks and/or affinity to resources. Soft constraints are satisfied on a best efforts basis and combine with the fitness calculator for scoring hosts for possible assignment. Hard constraints must be satisfied and act as a resource selection filter.

Fenzo provides all relevant cluster state information to the fitness calculators and constraints plugins so you can optimize assignments based on various aspects of jobs, resources, and time.

Bin Packing and Constraints Plugins

Fenzo currently has built-in fitness calculators for bin packing based on CPU, memory, or network bandwidth resources, or a combination of them.


Some of the built-in constraints address common use cases of locality with respect to resource types, assigning distinct hosts to a set of tasks, balancing tasks across a given host attribute, such as the availability zone, rack location, etc.


You can customize fitness calculators and constraints by providing new plugins.

Cluster Autoscaling

Fenzo supports cluster autoscaling using two complementary strategies:
  • Thresholds based
  • Resource shortfall analysis based


Thresholds based autoscaling lets users specify rules per host group (e.g., EC2 Auto Scaling Group, ASG) being used in the cluster. For example, there may be one ASG created for compute intensive workloads using one EC2 instance type, and another for network intensive workloads. Each rule helps maintain a configured number of idle hosts available for launching new jobs quickly.


The resource shortfall analysis attempts to estimate the number of hosts required to satisfy the pending workload. This complements the rules based scale up during demand surges. Fenzo’s autoscaling also complements predictive autoscaling systems, such as Netflix’s Scryer.

Usage at Netflix

Fenzo is currently being used in two Mesos frameworks at Netflix for a variety of use cases including long running services and batch jobs. We have observed that the scheduler is fast at allocating resources with multiple constraints and custom fitness calculators. Also, Fenzo has allowed us to scale the cluster based on current demand instead of provisioning it for peak demand.


The table below shows the average and maximum times we have observed for each scheduling run in one of our clusters. Each scheduling run may attempt to assign resources to more than one task. The run time can vary depending on the number of tasks that need assignments, the number and types of constraints used by the tasks, and the number of hosts to choose resources from.


Scheduler run time in milliseconds
Average
2 mS
Maximum
38 mS
(occasional spikes of about 30 mS)


The image below shows the number of Mesos slaves in the cluster going up and down as a result of Fenzo’s autoscaler actions over several days, representing about 3X difference in the maximum and minimum counts.
TitanAutoscaling2.png.

Fenzo Usage in Mesos Frameworks

FenzoUsageDiagram.png
A simplified diagram above shows how Fenzo is used by an Apache Mesos framework. Fenzo’s task scheduler provides the scheduling core without interaction with Mesos itself. The framework, interfaces with Mesos to get callbacks on new resource offers and task status updates. As well, it calls Mesos driver to launch tasks based on Fenzo’s assignments.

Summary

Fenzo has been a great addition to our cloud platform. It gives us a high degree of control over work scheduling on Mesos, and has enabled us to strike a balance between machine efficiency and getting jobs running quickly. Out of the box Fenzo supports cluster autoscaling and bin packing. Custom schedulers can be implemented by writing your own plugins.


Source code is available on Netflix Github. The repository contains a sample framework that shows how to use Fenzo. Also, the JUnit tests show examples of various features including writing custom fitness calculators and constraints. Fenzo wiki contains detailed documentation on getting you started.

Monday, August 17, 2015

Netflix Releases Falcor Developer Preview

by Jafar Husain, Paul Taylor and Michael Paulson

Developers strive to create the illusion that all of their application’s data is sitting right there on the user’s device just waiting to be displayed. To make that experience a reality, data must be efficiently retrieved from the network and intelligently cached on the client.

That’s why Netflix created Falcor, a JavaScript library for efficient data fetching. Falcor powers Netflix’s mobile, desktop and TV applications.

Falcor lets you represent all your remote data sources as a single domain model via JSON Graph. Falcor makes it easy to access as much or as little of your model as you want, when you want it. You retrieve your data using familiar JavaScript operations like get, set, and call. If you know your data, you know your API.

You code the same way no matter where the data is, whether in memory on the client or over the network on the server. Falcor keeps your data in a single, coherent cache and manages stale data and cache pruning for you. Falcor automatically traverses references in your graph and makes requests as needed. It transparently handles all network communications, opportunistically batching and de-duping requests.

Today, Netflix is unveiling a developer preview of Falcor:

Falcor is still under active development and we’ll be unveiling a roadmap soon. This developer preview includes a Node version of our Falcor Router not yet in production use.

We’re excited to start developing in the open and share this library with the community, and eager for your feedback and contributions.

For ongoing updates, follow Falcor on Twitter!

Wednesday, August 5, 2015

Making Netflix.com Faster

by Kristofer Baxter

Simply put, performance matters. We know members want to immediately start browsing or watching their favorite content and have found that faster startup leads to more satisfying usage. So, when building the long-awaited update to netflix.com, the Website UI Engineering team made startup performance a first tier priority.

The impact of this effort netted a 70% reduction in startup time, and was focused in three key areas:

  1. Server and Client Rendering
  2. Universal JavaScript
  3. JavaScript Payload Reductions

Server and Client Rendering

The netflix.com legacy website stack had a hard separation between server markup and client enhancement. This was primarily due to the different programming languages used in each part of our application. On the server, there was Java with Tomcat, Struts and Tiles. On the browser client, we enhanced server-generated markup with JavaScript, primarily via jQuery.

This separation led to undesirable results in our startup time. Every time a visitor came to any page on netflix.com our Java tier would generate the majority of the response needed for the entire page's lifetime and deliver it as HTML markup. Often, users would be waiting for the generation of markup for large parts of the page they would never visit.

Our new architecture renders only a small amount of the page's markup, bootstrapping the client view. We can easily change the amount of the total view the server generates, making it easy to see the positive or negative impact. The server requires less data to deliver a response and spends less time converting data into DOM elements. Once the client JavaScript has taken over, it can retrieve all additional data for the remainder of the current and future views of a session on demand. The large wins here were the reduction of processing time in the server, and the consolidation of the rendering into one language.

We find the flexibility afforded by server and client rendering allows us to make intelligent choices of what to request and render in the server and the client, leading to a faster startup and a smoother transition between views.

Universal JavaScript

In order to support identical rendering on the client and server, we needed to rethink our rendering pipeline. Our previous architecture's separation between the generation of markup on the server and the enhancement of it on the client had to be dropped.

Three large pain points shaped our new Node.js architecture:

  1. Context switching between languages was not ideal.
  2. Enhancement of markup required too much direct coupling between server-only code generating markup and the client-only code enhancing it.
  3. We’d rather generate all our markup using the same API.

There are many solutions to this problem that don't require Universal JavaScript, but we found this lesson was most appropriate: When there are two copies of the same thing, it's fairly easy for one to be slightly different than the other. Using Universal JavaScript means the rendering logic is simply passed down to the client.

Node.js and React.js are natural fits for this style of application. With Node.js and React.js, we can render from the server and subsequently render changes entirely on the client after the initial markup and React.js components have been transmitted to the browser. This flexibility allows for the application to render the exact same output independent of the location of the rendering. The hard separation is no longer present and it's far less likely for the server and client to be different than one another.

Without shared rendering logic we couldn't have realized the potential of rendering only what was necessary on startup and everything else as data became available.

Reduce JavaScript Payload Impact

Building rich interactive experiences on the web often translates into a large JavaScript payload for users. In our new architecture, we placed significant emphasis on pruning large dependencies we can knowingly replace with smaller modules and delivering JavaScript only applicable for the current visitor.

Many of the large dependencies we relied on in the legacy architecture didn't apply in the new one. We've replaced these dependencies in favor of newer, more efficient libraries. Replacing these libraries resulted in a much smaller JavaScript payload, meaning members need less JavaScript to start browsing. We know there is significant work remaining here, and we're actively working to trim our JavaScript payload down further.

Time To Interactive

In order to test and understand the impact of our choices, we monitor a metric we call time to interactive (tti).

Amount of time spent between first known startup of the application platform and when the UI is interactive regardless of view. Note that this does not require that the UI is done loading, but is the first point at which the customer can interact with the UI using an input device.

For applications running inside a web browser, this data is easily retrievable from the Navigation Timing API (where supported).

Work is Ongoing

We firmly believe high performance is not an optional engineering goal – it's a requirement for creating great user-experiences. We have made significant strides in startup performance, and are committed to challenging our industry’s best-practices in the pursuit of a better experience for our members.

Over the coming months we'll be investigating Service Workers, ASM.js, Web Assembly, and other emerging web standards to see if we can leverage them for a more performant website experience. If you’re interested in helping create and shape the next generation of performant web user-experiences apply here.