Monday, November 30, 2015

Linux Performance Analysis in 60,000 Milliseconds

You login to a Linux server with a performance issue: what do you check in the first minute?

At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance. These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools.

In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available.

First 60 Seconds: Summary

In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
Some of these commands require the sysstat package installed. The metrics these commands expose will help you complete some of the USE Method: a methodology for locating performance bottlenecks. This involves checking utilization, saturation, and error metrics for all resources (CPUs, memory, disks, e.t.c.). Also pay attention to when you have checked and exonerated a resource, as by process of elimination this narrows the targets to study, and directs any follow on investigation.

The following sections summarize these commands, with examples from a production system. For more information about these tools, see their man pages.

1. uptime

$ uptime
 23:51:26 up 21:31,  1 user,  load average: 30.02, 26.43, 19.02
This is a quick way to view the load averages, which indicate the number of tasks (processes) wanting to run. On Linux systems, these numbers include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O). This gives a high level idea of resource load (or demand), but can’t be properly understood without other tools. Worth a quick look only.

The three numbers are exponentially damped moving sum averages with a 1 minute, 5 minute, and 15 minute constant. The three numbers give us some idea of how load is changing over time. For example, if you’ve been asked to check a problem server, and the 1 minute value is much lower than the 15 minute value, then you might have logged in too late and missed the issue.

In the example above, the load averages show a recent increase, hitting 30 for the 1 minute value, compared to 19 for the 15 minute value. That the numbers are this large means a lot of something: probably CPU demand; vmstat or mpstat will confirm, which are commands 3 and 4 in this sequence.

2. dmesg | tail

$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request.  Check SNMP counters.
This views the last 10 system messages, if there are any. Look for errors that can cause performance issues. The example above includes the oom-killer, and TCP dropping a request.

Don’t miss this step! dmesg is always worth checking.

3. vmstat 1

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
34  0    0 200889792  73708 591828    0    0     0     5    6   10 96  1  3  0  0
32  0    0 200889920  73708 591860    0    0     0   592 13284 4282 98  1  1  0  0
32  0    0 200890112  73708 591860    0    0     0     0 9501 2154 99  1  0  0  0
32  0    0 200889568  73712 591856    0    0     0    48 11900 2459 99  0  0  0  0
32  0    0 200890208  73712 591860    0    0     0     0 15898 4840 98  1  1  0  0
Short for virtual memory stat, vmstat(8) is a commonly available tool (first created for BSD decades ago). It prints a summary of key server statistics on each line.

vmstat was run with an argument of 1, to print one second summaries. The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second. For now, skip the first line, unless you want to learn and remember which column is which.

Columns to check:
  • r: Number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an “r” value greater than the CPU count is saturation.
  • free: Free memory in kilobytes. If there are too many digits to count, you have enough free memory. The “free -m” command, included as command 7, better explains the state of free memory.
  • si, so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.
  • us, sy, id, wa, st: These are breakdowns of CPU time, on average across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or with Xen, the guest's own isolated driver domain).
The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle.

System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.

In the above example, CPU time is almost entirely in user-level, pointing to application level usage instead. The CPUs are also well over 90% utilized on average. This isn’t necessarily a problem; check for the degree of saturation using the “r” column.

4. mpstat -P ALL 1

$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

07:38:49 PM  CPU   %usr  %nice   %sys %iowait   %irq  %soft  %steal  %guest  %gnice  %idle
07:38:50 PM  all  98.47   0.00   0.75    0.00   0.00   0.00    0.00    0.00    0.00   0.78
07:38:50 PM    0  96.04   0.00   2.97    0.00   0.00   0.00    0.00    0.00    0.00   0.99
07:38:50 PM    1  97.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   2.00
07:38:50 PM    2  98.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   1.00
07:38:50 PM    3  96.97   0.00   0.00    0.00   0.00   0.00    0.00    0.00    0.00   3.03
This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a single-threaded application.

5. pidstat 1

$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

07:41:02 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:03 PM     0         9    0.00    0.94    0.00    0.94     1  rcuos/0
07:41:03 PM     0      4214    5.66    5.66    0.00   11.32    15  mesos-slave
07:41:03 PM     0      4354    0.94    0.94    0.00    1.89     8  java
07:41:03 PM     0      6521 1596.23    1.89    0.00 1598.11    27  java
07:41:03 PM     0      6564 1571.70    7.55    0.00 1579.25    28  java
07:41:03 PM 60004     60154    0.94    4.72    0.00    5.66     9  pidstat

07:41:03 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:04 PM     0      4214    6.00    2.00    0.00    8.00    15  mesos-slave
07:41:04 PM     0      6521 1590.00    1.00    0.00 1591.00    27  java
07:41:04 PM     0      6564 1573.00   10.00    0.00 1583.00    28  java
07:41:04 PM   108      6718    1.00    0.00    0.00    1.00     0  snmp-pass
07:41:04 PM 60004     60154    1.00    4.00    0.00    5.00     9  pidstat
Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen. This can be useful for watching patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation.

The above example identifies two java processes as responsible for consuming CPU. The %CPU column is the total across all CPUs; 1591% shows that that java processes is consuming almost 16 CPUs.

6. iostat -xz 1

$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          73.96    0.00    3.73    0.03    0.06   22.21

Device:   rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda        0.00     0.23    0.21    0.18     4.52     2.08    34.37     0.00    9.98   13.80    5.42   2.44   0.09
xvdb        0.01     0.00    1.02    8.94   127.97   598.53   145.79     0.00    0.43    1.78    0.28   0.25   0.25
xvdc        0.01     0.00    1.02    8.86   127.79   595.94   146.50     0.00    0.45    1.82    0.30   0.27   0.26
dm-0        0.00     0.00    0.69    2.32    10.47    31.69    28.01     0.01    3.23    0.71    3.98   0.13   0.04
dm-1        0.00     0.00    0.00    0.94     0.01     3.78     8.00     0.33  345.84    0.04  346.81   0.01   0.00
dm-2        0.00     0.00    0.09    0.07     1.35     0.36    22.50     0.00    2.55    0.23    5.62   1.78   0.03
This is a great tool for understanding block devices (disks), both the workload applied and the resulting performance. Look for:
  • r/s, w/s, rkB/s, wkB/s: These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied.
  • await: The average time for the I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.
  • avgqu-sz: The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.)
  • %util: Device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.
If the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated, and may be able to handle much more work.

Bear in mind that poor performing disk I/O isn’t necessarily an application issue. Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read-ahead for reads, and buffering for writes).

7. free -m

$ free -m
             total       used       free     shared    buffers     cached
Mem:        245998      24545     221453         83         59        541
-/+ buffers/cache:      23944     222053
Swap:            0          0          0
The right two columns show:
  • buffers: For the buffer cache, used for block device I/O.
  • cached: For the page cache, used by file systems.
We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. The above example looks fine, with many Mbytes in each.

The “-/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does. There’s even a website, linuxatemyram, about this confusion.

It can be additionally confusing if ZFS on Linux is used, as we do for some services, as ZFS has its own file system cache that isn’t reflected properly by the free -m columns. It can appear that the system is low on free memory, when that memory is in fact available for use from the ZFS cache as needed.

8. sar -n DEV 1

$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015     _x86_64_    (32 CPU)

12:16:48 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:49 AM      eth0  18763.00   5032.00  20686.42    478.30      0.00      0.00      0.00      0.00
12:16:49 AM        lo     14.00     14.00      1.36      1.36      0.00      0.00      0.00      0.00
12:16:49 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:16:49 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:50 AM      eth0  19763.00   5101.00  21999.10    482.56      0.00      0.00      0.00      0.00
12:16:50 AM        lo     20.00     20.00      3.25      3.25      0.00      0.00      0.00      0.00
12:16:50 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Use this tool to check network interface throughput: rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached. In the above example, eth0 receive is reaching 22 Mbytes/s, which is 176 Mbits/sec (well under, say, a 1 Gbit/sec limit).

This version also has %ifutil for device utilization (max of both directions for full duplex), which is something we also use Brendan’s nicstat tool to measure. And like with nicstat, this is hard to get right, and seems to not be working in this example (0.00).

9. sar -n TCP,ETCP 1

$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

12:17:19 AM  active/s passive/s    iseg/s    oseg/s
12:17:20 AM      1.00      0.00  10233.00  18846.00

12:17:19 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:20 AM      0.00      0.00      0.00      0.00      0.00

12:17:20 AM  active/s passive/s    iseg/s    oseg/s
12:17:21 AM      1.00      0.00   8359.00   6039.00

12:17:20 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:21 AM      0.00      0.00      0.00      0.00      0.00
This is a summarized view of some key TCP metrics. These include:
  • active/s: Number of locally-initiated TCP connections per second (e.g., via connect()).
  • passive/s: Number of remotely-initiated TCP connections per second (e.g., via accept()).
  • retrans/s: Number of TCP retransmits per second.
The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of downstream connections (active). It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g., consider a localhost to localhost connection).

Retransmits are a sign of a network or server issue; it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets. The example above shows just one new TCP connection per-second.

10. top

$ top
top - 00:15:40 up 21:56,  1 user,  load average: 31.09, 29.87, 29.92
Tasks: 871 total,   1 running, 868 sleeping,   0 stopped,   2 zombie
%Cpu(s): 96.8 us,  0.4 sy,  0.0 ni,  2.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  25190241+total, 24921688 used, 22698073+free,    60448 buffers
KiB Swap:        0 total,        0 used,        0 free.   554208 cached Mem

 20248 root      20   0  0.227t 0.012t  18748 S  3090  5.2  29812:58 java
  4213 root      20   0 2722544  64640  44232 S  23.5  0.0 233:35.37 mesos-slave
 66128 titancl+  20   0   24344   2332   1172 R   1.0  0.0   0:00.07 top
  5235 root      20   0 38.227g 547004  49996 S   0.7  0.2   2:02.74 java
  4299 root      20   0 20.015g 2.682g  16836 S   0.3  1.1  33:14.42 java
     1 root      20   0   33620   2920   1496 S   0.0  0.0   0:03.82 init
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
     3 root      20   0       0      0      0 S   0.0  0.0   0:05.35 ksoftirqd/0
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   0:06.94 kworker/u256:0
     8 root      20   0       0      0      0 S   0.0  0.0   2:38.05 rcu_sched
The top command includes many of the metrics we checked earlier. It can be handy to run it to see if anything looks wildly different from the earlier commands, which would indicate that load is variable.

A downside to top is that it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, which provide rolling output. Evidence of intermittent issues can also be lost if you don’t pause the output quick enough (Ctrl-S to pause, Ctrl-Q to continue), and the screen clears.

Follow-on Analysis

There are many more commands and methodologies you can apply to drill deeper. See Brendan’s Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing.

Tackling system reliability and performance problems at web scale is one of our passions. If you would like to join us in tackling these kinds of challenges we are hiring!

Monday, November 23, 2015

Creating Your Own EC2 Spot Market -- Part 2

In Part 1 Creating Your Own EC2 Spot Market of this series, we explained how Netflix manages its EC2 footprint and how we take advantage of our daily peak of 12,000 unused instances which we named the “internal spot market.”  This sizeable trough has significantly improved our encoding throughput, and we are pursuing other benefits from this large pool of unused resources.   
The Encoding team went through two iterations of internal spot market implementations.  The initial approach was a simple schedule-based borrowing mechanism that was quickly deployed in June in the us-east AZ to reap immediate benefits.  We applied the experience we gained to influence the next iteration of the design based on real-time availability.  
The main challenge of using the spot instances effectively is handling the dynamic nature of our instance availability.  With correct timing, running spot instances is effectively free; when the timing is off, however, any EC2 usage is billed at the on-demand price.  In this post we will discuss how the real-time, availability-based internal spot market system works and efficiently uses the unused capacity
Benefits of Extra Capacity
The encoding system at Netflix is responsible for encoding master media source files into many different output formats and bitrates for all Netflix supported devices.  A typical workload is triggered by source delivery, and sometimes the encoding system receives an entire season of a show within moments.  By leveraging the internal spot market, we have measured the equivalent of a 210% increase in encoding capacity.  With the extra boost of computing resources, we have improved our ability to handle sudden influx work and to quickly reduce our of backlog.
In addition to the production environment, the encoding infrastructure maintains 40 “farms” for development and testing.  Each farm is a complete encoding system with 20+ micro-services that matches the capability and capacity of the production environment.  
Computing resources are continuously evaluated and redistributed based on workload.  With the boost of spot market instances, the total encoding throughput increases significantly.  On the R&D side, researchers leverage these extra resources to carry out experiments in a fraction of the time it used to take.  Our QA automation is able to broaden the coverage of our comprehensive suite of continuous integration and run these jobs in less time.
Spot Market Borrowing in Action
We started the new spot market system in October, and we are encouraged by the improved performance compared to our borrowing in the first iteration.
For instance, in one of the research projects, we triggered 12,000 video encoding jobs over a weekend.  We had anticipated the work to finish in a few days, but we were pleasantly surprised to discover that the jobs were completed in only 18 hours.
The following graph captures that weekend’s activity.
The Y-axis denotes the amount of video encoder jobs queued in the messaging system, the red line represents high priority jobs, and the yellow area graph shows the amount of medium and low priority jobs.
Important Considerations
  • By launching on-demand instances in the Encoding team AWS account, the Encoding team never impacts guaranteed capacity (reserved instances) from the main Netflix account.
  • The Encoding team competes for on-demand instances with other Netflix accounts.   
  • Spot instance availability fluctuates and can become unavailable at any moment.  The encoding service needs to react to these changes swiftly.
  • It is possible to dip into unplanned on-demand usage due to sudden surge of instance usage in other Netflix accounts while we have internal spot instances running.  The benefits of borrowing must significantly outweigh the cost of these on-demand charges.
  • Available spot capacity comes in different types and sizes.  We can make the most out of them by making our jobs instance type agnostic.
Design Goals
Cost Effectiveness: Use as many spot instances as are available.  Incur as little unplanned on-demand usage as possible.
Good Citizenship: We want to minimize contention that may cause a shortage in the on-demand pool.  We take a light-handed approach by yielding spot instances to other Netflix accounts when there is competition on resources.
Automation: The Encoding team invests heavily in automation.  The encoding system is responsible for encoding activities for the entire Netflix catalog 24x7, hands free.  Spot market borrowing needs to function continuously and autonomously.
Metrics: Collect Atlas metrics to measure effectiveness, pinpoint areas of inefficiency, and trend usage patterns in near real-time.
Key Borrowing Strategies
We spend a great deal of the effort devising strategies to address the goals of Cost Effectiveness and Good Citizenship.  We started with a set of simple assumptions, and then constantly iterated using our monitoring system, allowing us to validate and fine tune the initial design to the following set of strategies below:
Real-time Availability Based Borrowing: Closely align utilization based on the fluctuating real-time spot instance availability using a Spinnaker API.  Spinnaker is a Continuous Delivery Platform that manages Netflix reservations and deployment pipelines.  It is in the optimal  position to know what instances are in use across all Netflix accounts.
Negative Surplus Monitor: Sample spot market availability, and quickly terminate (yield) borrowed instances when we detect overdraft of internal spot instances.  It enforces that our spot borrowing is treated as the lowest priority usage in the company and leads to reduced on-demand contention.
Idle Instance Detection: Detect over-allocated spot instances.  Accelerate down scaling of spot instances to improve time to release, with an additional benefit of reducing borrowing surface area.
Fair Distribution: When spot instances are abundant, distribute assignment evenly to avoid exhausting one EC2 instance type on a specific AZ.  This helps minimize on-demand shortage and contention while reducing involuntary churn due to negative surplus.
Smoothing Function: The resource scheduler evaluates assignments of EC2 instances based on a smoothed representation of workload, smoothing out jitters and spikes to prevent over-reaction.
Incremental Stepping & Fast Evaluation Cycle: Acting in incremental steps avoids over-reaction and allows us to evaluate the workload frequently for rapid self correction.  Incremental stepping also helps distribute instance usage across instance types and AZ more evenly.
Safety Margin: Reduce contention by leaving some amount of available spot instances unused.  It helps reduce involuntary termination due to minor fluctuations in usage in other Netflix accounts.
Curfew: Preemptively reduce spot usage before a predictable pattern of negative surplus inflection that drops rapidly (e.g. Nightly Netflix personal recommendation computation schedule). These curfews help minimize preventable on-demand charges.
Evacuation Monitor: A system-wide toggle to immediately evacuate all borrowing usage in case of emergency (e.g. regional traffic failover).  Eliminate on-demand contention in case of emergency.
The following graph depicts a five day span on spot usage by instance type.
This graph illustrates a few interesting points:
  • The variance in color represents different instance types in use, and in most cases the relatively even distribution of bands of color shows that instance type usage is reasonably balanced.
  • The sharp rise and drop of the peaks confirms that the encoding resource manager scales up and down relatively quickly in response to changes in workload.
  • The flat valleys show the frugality of instance usage. Spot instances are only used when there is work for them to do.
  • Not all color bands have the same height because the size of the reservation varies between instance types.  However, we are able to borrow from both large (orange) and small (green) pools, collectively satisfying the entire workload.
  • Finally, although this graph reports instance usage, it indirectly tracks the workload.  The overall shape of the graphs shows that there is no discernible pattern of the workload, such is the event driven nature of the encoding activities.
Based on the AWS billing data from October, we summed up all the borrowing hours and adjusted them relative to the r3.4xlarge instance type that makes up the Encoding reserved capacity.  With the addition of spot market instances, the effective encoding capacity increased by 210%.
Dark blue denotes spot market borrowing, and light blue represents on-demand usage.
On-demand pricing is multiple times more expensive than reserved instances, and it varies depending on instance type.  We took the October spot market usage and calculated what it would have cost with purely on-demand pricing and computed a 92% cost efficiency.
Lessons Learned
On-demand is Expensive: We already knew this fact, but the idea sinks in once we observed on-demand charges as a result of sudden overdrafts of spot usage.  A number of the strategies (e.g. Safety Margin, Curfew) listed in the above section were devised to specifically mitigate this occurrence.
Versatility: Video encoding represents 70% of our computing needs.  We made some tweaks to the video encoder to run on a much wider selection of instance types.  As a result, we were able to leverage a vast number of spot market instances during different parts of the day.
Tolerance to Interruption: The encoding system is built to withstand interruptions. This attribute works well with the internal spot market since instances can be terminated at any time.
Next Steps
Although the current spot market borrowing system is a notable improvement over the previous attempt, we are uncovering the tip of the iceberg.  In the future, we want to leverage spot market instances from different EC2 regions as they become available.  We are also heavily investing in the next generation of encoding architecture that scales more efficiently and responsively.  Here are some ideas we are exploring:
Cross Region Utilization: By borrowing from multiple EC2 regions, we triple the access to unused reservations from the current usable pool.  Using multiple regions also significantly reduces concentration of on-demand usages in a single EC2 region.
Containerization: The current encoding system is based on ASG scaling.  We are actively investing in the next generation of our encoding infrastructure using container technology.  The container model will reduce overhead in ASG scaling, minimize overhead of churning, and increase performance and throughput as Netflix continues to grow its catalog.
Resource Broker: The current borrowing system is monopolistic in that it assumes the Encoding service is the sole borrower.  It is relatively easy to implement for one borrower.  We need to create a resource broker to better coordinate access to the spot surplus when sharing amongst multiple borrowers.
In the first month of deployment, we observed significant benefits in terms of performance and throughput.  We were successful in making use of Netflix idle capacity for production, research, and QA.  Our encoding velocity increased dramatically.  Experimental research turn-around time was drastically reduced.  A comprehensive full regression test finishes in half the time it used to take.  With a cost efficiency of 92%, the spot market is not completely free but it is worth the cost.
All of these benefits translate to faster research turnaround, improved playback quality, and ultimately a better member experience.

-- Media Cloud Engineering

Friday, November 20, 2015

Sleepy Puppy Extension for Burp Suite

Netflix recently open sourced Sleepy Puppy - a cross-site scripting (XSS) payload management framework for security assessments. One of the most frequently requested features for Sleepy Puppy has been for an extension for Burp Suite, an integrated platform for web application security testing. Today, we are pleased to open source a Burp extension that allows security engineers to simplify the process of injecting payloads from Sleep Puppy and then tracking the XSS propagation over longer periods of time and over multiple assessments.

Prerequisites and Configuration
First, you need to have a copy of Burp Suite running on your system. If you do not have a copy of Burp Suite, you can download/buy Burp Suite here. You also need a Sleepy Puppy instance running on a server. You can download Sleepy Puppy here. You can try out Sleepy Puppy using Docker. Detailed instructions on setup and configuration are available on the wiki page.

Once you have these prerequisites taken care of, please download the Burp extension here.

If the Sleepy Puppy server is running over HTTPS (which we would encourage), you need to inform the Burp JVM to trust the CA that signed your Sleepy Puppy server certificate. This can be done by importing the cert from Sleepy Puppy server into a keystore and then specifying the keystore location and passphrase while starting Burp Suite. Specific instructions include:

  • Visit your Sleepy Puppy server and export the certificate using Firefox in pem format
  • Import the cert in pem format into a keystore with the command below. keytool -import -file </path/to/cert.pem> -keystore sleepypuppy_truststore.jks -alias sleepypuppy
  • You can specify the truststore information for the plugin either as an environment variable or as a JVM option.
  • Set truststore info as environmental variables and start Burp as shown below: export SLEEPYPUPPY_TRUSTSTORE_LOCATION=</path/to/sleepypuppy_truststore.jks> export SLEEPYPUPPY_TRUSTSTORE_PASSWORD=<passphrase provided while creating the truststore using keytool command above> java -jar burp.jar
  • Set truststore info as part of the Burp startup command as shown below: java -DSLEEPYPUPPY_TRUSTSTORE_PASSWORD=</path/to/sleepypuppy_truststore.jks> -DSLEEPYPUPPY_TRUSTSTORE_PASSWORD=<passphrase provided while creating the truststore using keytool command above> -jar burp.jar
Now it is time to load the Sleepy Puppy extension and explore its functionality.

Using the Extension
Once you launch Burp and load up the Sleepy Puppy extension, you will be presented with the Sleepy Puppy tab.


This tab will allow you to leverage the capabilities of Burp Suite along with the Sleepy Puppy XSS Management framework to better manage XSS testing.

Some of the features provided by the extension include:

  • Create a new assessment or select an existing assessment
  • Add payloads to your assessment and the Sleepy Puppy server from the the extension
  • When an Active Scan is conducted against a site or URL, the XSS payloads from the selected Sleepy Puppy Assessment will be executed after Burp's built-in XSS payloads
  • In Burp Intruder, the Sleepy Puppy Extension can be chosen as the payload generator for XSS testing
  • In Burp Repeater, you can replace any value in an existing request with a Sleepy Puppy payload using the context menu
  • The Sleepy Puppy tab provides statistics about any payloads that have been triggered for the selected assessment

You can watch the Sleepy Puppy extension in action at youtube.

Interested in Contributing?
Feel free to reach out or submit pull requests if there’s anything else you’re looking for. We hope you’ll find Sleepy Puppy and the Burp extension as useful as we do!

by: Rudra Peram

Monday, November 16, 2015

Global Continuous Delivery with Spinnaker

After over a year of development and production use at Netflix, we’re excited to announce that our Continuous Delivery platform, Spinnaker, is available on GitHub. Spinnaker is an open source multi-cloud Continuous Delivery platform for releasing software changes with high velocity and confidence. Spinnaker is designed with pluggability in mind; the platform aims to make it easy to extend and enhance cloud deployment models. To create a truly extensible multi-cloud platform, the Spinnaker team partnered with Google, Microsoft and Pivotal to deliver out-of-the-box cluster management and deployment. As of today, Spinnaker can deploy to and manage clusters simultaneously across both AWS and Google Cloud Platform with full feature compatibility across both cloud providers. Spinnaker also features deploys to Cloud Foundry; support for its newest addition, Microsoft Azure, is actively underway.

If you’re familiar with Netflix’s Asgard, you’ll be in good hands. Spinnaker is the replacement for Asgard and builds upon many of its concepts.  There is no need for a migration from Asgard to Spinnaker as changes to AWS assets via Asgard are completely compatible with changes to those same assets via Spinnaker and vice versa. 

Continuous Delivery with Spinnaker

Spinnaker facilitates the creation of pipelines that represent a delivery process that can begin with the creation of some deployable asset (such as an machine image, Jar file, or Docker image) and end with a deployment. We looked at the ways various Netflix teams implemented continuous delivery to the cloud and generalized the building blocks of their delivery pipelines into configurable Stages that are composable into Pipelines. Pipelines can be triggered by the completion of a Jenkins Job, manually, via a cron expression, or even via other pipelines. Spinnaker comes with a number of stages, such as baking a machine image, deploying to a cloud provider, running a Jenkins Job, or manual judgement to name a few. Pipeline stages can be run in parallel or serially. 

Spinnaker Pipelines

Spinnaker also provides cluster management capabilities and provides deep visibility into an application’s cloud footprint. Via Spinnaker’s application view, you can resize, delete, disable, and even manually deploy new server groups using strategies like Blue-Green (or Red-Black as we call it at Netflix). You can create, edit, and destroy load balancers as well as security groups. 

Cluster Management in Spinnaker
Spinnaker is a collection of JVM-based services, fronted by a customizable AngularJS single-page application. The UI leverages a rich RESTful API exposed via a gateway service. 

You can find all the code for Spinnaker on GitHubThere are also installation instructions on how to setup and deploy Spinnaker from source as well as instructions for deploying Spinnaker from pre-existing images that Kenzan and Google have created. We’ve set up a Slack channel and we are committed to leveraging StackOverflow as a means for answering community questions. Issues, questions, and pull requests are welcome. 

Monday, November 9, 2015

Netflix Hack Day - Autumn 2015

Last week, we hosted our latest installment of Netflix Hack Day. As always, Hack Day is a way for our product development staff to get away from everyday work, to have fun, experiment, collaborate, and be creative.

The following video is an inside look at what our Hack Day event looks like:

Video credit: Sean Williams

This time, we had 75 hacks that were produced by about 200 engineers and designers (and even some from the legal team!). We’ve embedded some of our favorites below.  You can also see some of our past hacks in our posts for March 2015, Feb. 2014 & Aug. 2014

While we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or otherwise be used beyond Hack Day.  We are posting them here publicly to simply share the spirit of the event.

Thanks to all of the hackers for putting together some incredible work in just 24 hours!

Netflix VHF
Watch Netflix on your Philco Predicta, the TV of tomorrow! We converted a 1950's era TV into a smart TV that runs Netflix.

Narcos: Plata O Plomo Video Game
A fun game based on the Netflix original series, Narcos.

Stream Possible
Netflix TV experience over a 3G cell connection for the non-broadband rich parts of the world.

Ok Netflix
Ok Netflix can find the exact scene in a movie or episode from listening to the dialog that comes from that scene.  Speak the dialog into Ok Netflix and Ok Netflix will do the rest, starting the right title in the right location.

Smart Channels
A way to watch themed collections of content that are personalized and also autoplay like serialized content.

And here are some pictures taken during the event.