Monday, December 31, 2012

A Closer Look At The Christmas Eve Outage

by Adrian Cockcroft

Netflix streaming was impacted on Christmas Eve 2012 by problems in the Amazon Web Services (AWS) Elastic Load Balancer (ELB) 
service that routes network traffic to the Netflix services supporting streaming. The postmortem report by AWS can be read here.

We apologize for the inconvenience and loss of service. We’d like to explain what happened and how we continue to invest in higher availability solutions.

Partial Outage

The problems at AWS caused a partial Netflix streaming outage that started at around 12:30 PM Pacific Time on December 24 and grew in scope later that afternoon. The outage primarily affected playback on TV connected devices in the US, Canada and Latin America. Our service in the UK, Ireland and Nordic countries was not impacted.

Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address that your Web browser or streaming device calls. Netflix streaming has been implemented on over a thousand different streaming devices over the last few years, and groups of similar devices tend to depend on specific ELBs. Requests from devices are passed by the ELB to the individual servers that run the many parts of the Netflix application. Out of hundreds of ELBs in use by Netflix, a handful failed, losing their ability to pass requests to the servers behind them. None of the other AWS services failed, so our applications continued to respond normally whenever the requests were able to get through.

The Netflix Web site remained up throughout the incident, supporting sign up of new customers and streaming to Macs and PCs, although at times with higher latency and a likelihood of needing to retry. Over-all streaming playback via Macs and PCs was only slightly reduced from normal levels. A few devices also saw no impact at all as those devices have an ELB configuration that kept running throughout the incident, providing normal playback levels.

At 12:24 PM Pacific Time on December 24 network traffic stopped on a few ELBs used by a limited number of streaming devices. At around 3:30 PM on December 24, network traffic stopped on additional ELBs used by game consoles, mobile and various other devices to start up and load lists of TV shows and movies. These ELBs were patched back into service by AWS at around 10:30 PM on Christmas Eve, so game consoles etc. were impacted for about seven hours. Most customers were fully able to use the service again at this point. Some additional ELB cleanup work continued until around 8 am on December 25th, when AWS finished restoring service to all the ELBs in use by Netflix, and all devices were streaming again.

Even though Netflix streaming for many devices was impacted, this wasn't an immediate blackout. Those devices that were already running Netflix when the ELB problems started were in many cases able to continue playing additional content.

Christmas Eve is traditionally a slow Netflix night as many members celebrate with families or spend Christmas Eve in other ways than watching TV shows or movies. We see significantly higher usage on Christmas Day and increased streaming rates continue until customers go back to work or school.  While we truly regret the inconvenience this outage caused our customers on Christmas Eve, we were also fortunate to have Netflix streaming fully restored before a much higher number of our customers would have been affected.

What Broke And What Should We Do About It

In its postmortem on the outage, AWS reports that was deleted by a maintenance process that was inadvertently run against the production ELB state data”. This caused data to be lost in the ELB service back end, which in turn caused the outage of a number of ELBs in the US-East region across all availability zones starting at 12:24 PM on December 24.

The problem spread gradually, causing broader impact until At 5:02 PM PST, the team disabled several of the ELB control plane workflows”.

The AWS team had to restore the missing state data from backups, which took all night. By 5:40 AM PST ... the new ELB state data had been verified.”. AWS has put safeguards in place against this particular failure, and also says We are confident that we could recover ELB state data in a similar event significantly faster”.

Netflix is designed to handle failure of all or part of a single availability zone in a region as we run across three zones and operate with no loss of functionality on two.  We are working on ways of extending our resiliency to handle partial or complete regional outages.

Previous AWS outages have mostly been at the availability zone level, and we’re proud of our track record in terms of up time, including our ability to keep Netflix streaming running while other AWS hosted services are down.

Our strategy so far has been to isolate regions, so that outages in the US or Europe do not impact each other.

It is still early days for cloud innovation and there is certainly more to do in terms of building resiliency in the cloud.
In 2012 we started to investigate running Netflix in more than one AWS region and got a better gauge on the complexity and investment needed to make these changes.

We have plans to work on this in 2013. It is an interesting and hard problem to solve, since there is a lot more data that will need to be replicated over a wide area and the systems involved in switching traffic between regions must be extremely reliable and capable of avoiding cascading overload failures. Naive approaches could have the downside of being more expensive, more complex and cause new problems that might make the service less reliable. Look for upcoming blog posts as we make progress in implementing regional resiliency.

As always, we are hiring the best engineers we can find to work on these problems, and are open sourcing the solutions we develop as part of our platform.

Happy New Year and best wishes for 2013.

Thursday, December 20, 2012

Building the Netflix UI for Wii U

Hello, my name is Joubert Nel and I’m a UI engineer on the TV UI team here at Netflix. Our team builds the Netflix experiences for hundreds of TV devices, like the PlayStation 3, Wii, Apple TV, and Google TV.

We recently launched on Nintendo’s new Wii U game console. Like other Netflix UIs, we present TV shows and movies we think you’ll enjoy in a clear and fast user interface. While this UI introduces the first Netflix 1080p browse UI for game consoles, it also expands on ideas pioneered elsewhere like second screen control.

Virtual WebKit Frame

Like many of our other device UIs, our Wii U experience is built for WebKit in HTML5. Since the Wii U has two screens, we created a Virtual WebKit Frame, which partitions the UI into one area that is output to TV and one area that is output to the GamePad.

This gives us the flexibility to vary what is rendered on each screen as the design dictates, while sharing application state and logic in a single JavaScript VM. We also have a safe zone between the TV and GamePad areas so we can animate elements off the edge of the TV without appearing on the GamePad.

We started off with common Netflix TV UI engineering performance practices such as view pooling and accelerated compositing. View pooling reuses DOM elements to minimize DOM churn, and Accelerated Compositing (AC) allows us to designate certain DOM elements to be cached as a bitmap and rendered by the Wii U’s GPU.

In WebKit, each DOM node that produces visual output has a corresponding RenderObject, stored in the Render Tree. In turn, each RenderObject is associated with a RenderLayer. Some RenderLayers get backing surfaces when hardware acceleration is enabled . These layers are called compositing layers and they paint into their backing surfaces instead of the common bitmap that represents the entire page. Subsequently, the backing surfaces are composited onto the destination bitmap. The compositor applies transformations specified by the layer’s CSS -webkit-transform to the layer’s surface before compositing it. When a layer is invalidated, only its own content needs to be repainted and re-composited. If you’re interested to learn more, I suggest reading GPU Accelerated Compositing in Chrome.


After modifying the UI to take advantage of accelerated compositing, we found that the frame rate on device was still poor during vertical navigation, even though it rendered at 60fps in desktop browsers.

When the user browses up or down in the gallery, we animate 4 rows of poster art on TV and mirror those 4 rows on the GamePad. Preparing, positioning, and animating only 4 rows allows us to reduce (expensive) structural changes to the DOM while being able to display many logical rows and support wrapping. Each row maintains up to 14 posters, requiring us to move and scale a total of 112 images during each up or down navigation. Our UI’s posters are 284 x 405 pixels and eat up 460,080 bytes of texture memory each, regardless of file size. (You need 4 bytes to represent each pixel’s RGBA value when the image is decompressed in memory.)

Layout of poster art in the gallery

To improve performance, we tried a number of animation strategies, but none yielded sufficient gains. We knew that when we kicked off an animation, there was an expensive style recalculation. But the WebKit Layout & Rendering timeline didn’t help us figure out which DOM elements were responsible.

WebKit Layout & Rendering Timeline

We worked with our platform team to help us profile WebKit, and we were now able to see how DOM elements relate to the Recalculate Style operations.

Our instrumentation helps us visualize the Recalculate Style call stack over time:
Instrumented Call Stack over Time

Through experimentation, we discovered that for our UI, there is a material performance gain when setting inline styles instead of modifying classes on elements that participate in vertical navigation.

We also found that some CSS selector patterns cause deep, expensive Recalculate Style operations. It turns out that the mere presence of the following pattern in CSS triggers a deep Recalculate Style:

.list-showing #browse { … }

Moreover, a -webkit-transition with duration greater than 0 causes the Recalculate Style operations to be repeated several times during the lifetime of the animation.
After removing all CSS selectors of this pattern, the resulting Recalculate Style shape is shallower and consumes less time.

Delivering great experiences

Our team builds innovative UIs, experiments with new concepts using A/B testing, and continually delivers new features. We also have to make sure our UIs perform fast on a wide range of hardware, from inexpensive consumer electronics devices all the way up to more powerful devices like the Wii U and PS3.

If this kind of innovation excites you as much as it does me, join our team!

Monday, December 17, 2012

Complexity In The Digital Supply Chain

Netflix launched in Denmark, Norway, Sweden, and Finland on Oct. 15th. I just returned from a trip to Europe to review the content deliveries with European studios that prepared content for this launch.

This trip reinforced for me that today’s Digital Supply Chain for the streaming video industry is awash in accidental complexity. Fortunately the incentives to fix the supply chain are beginning to emerge. Netflix needs to innovate on the supply chain so that we can effectively increase licensing spending to create an outstanding member experience. The content owning studios need to innovate on the supply chain so that they can develop an effective, permanent, and growing sales channel for digital distribution customers like Netflix. Finally, post production houses have a fantastic opportunity to pivot their businesses to eliminate this complexity for their content owning customers.

Everyone loves Star Trek because it paints a picture of a future that many of us see as fantastic and hopefully inevitable. Warp factor 5 space travel, beamed transport over global distances, and automated food replicators all bring simplicity to the mundane aspects of living and free up the characters to pursue existence on a higher plane of intellectual pursuits and exploration.

The equivalent of Star Trek for the Digital Supply Chain is an online experience for content buyers where they browse available studio content catalogs and make selections for content to license on behalf of their consumers. Once an ‘order’ is completed on this system, the materials (video, audio, timed text, artwork, meta-data) flow into retailers systems automatically and out to customers in a short and predictable amount of time, 99% of the time. Eliminating today’s supply chain complexity will allow all of us to focus on continuing to innovate with production teams to bring amazing new experiences like 3D, 4K video, and many innovations not yet invented to our customer’s homes.

We are nowhere close to this supply chain today but there are no fundamental technology barriers to building it. What I am describing is largely what has been for consumers since 2007, when Netflix began streaming. If Netflix can build this experience for our customers, then conceivably the industry can collaborate to build the same thing for the supply chain. Given the level of cooperation needed, I predict it will take five to ten years to gain a shared set of motivations, standards, and engineering work to make this happen. Netflix, especially our Digital Supply Chain team, will be heavily involved due to our early scale in digital distribution.

To realize the construction of the Starship Enterprise, we need to innovate on two distinct but complementary tracks. They are:
  1. Materials quality: Video, audio, text, artwork, and descriptive meta data for all of the needed spoken languages
  2. B2B order and catalog management: Global online systems to track content orders and to curate content catalogs

Materials Quality
Netflix invested heavily in 2012 in making it easier to deliver high quality video, audio, text, art work, and meta data to Netflix. We expanded our accepted video formats to include the de facto industry standard of Apple Pro Res. We built a new team, Content Partner Operations, to engage content owners and post production houses and mentor their efforts to prepare content for Netflix.

The Content Partner Operations team also began to engage video and audio technology partners to include support for the file formats called out by the Netflix Delivery Specification in the equipment they provide to the industry to prepare and QC digital content. Throughout 2013 you will see the Netflix Delivery Specification supported by a growing list of those equipment manufacturers. Additionally the Content Partner Operations team will establish a certification process for post production houses ability to prepare content for Netflix. Content owners that are new to Netflix delivery will be able to turn any one of many post production houses certified to deliver to Netflix from all of our regions around the world.

Content owners ability to prepare content for Netflix varies considerably. Those content owners who perform the best are those who understand the lineage of all of the files they send to Netflix. Let me illustrate this ‘lineage’ reference with an example.

There is a movie available for Netflix streaming that was so magnificently filmed, it won an Oscar for Cinematography. It was filmed widescreen in a 2.20:1 aspect ratio but it was available for streaming on Netflix in a modified 4:3 aspect ratio. How can this happen? I attribute this poor customer experience to an industry wide epidemic of ‘versionitis’. After this film was produced, it was released in many formats. It was released in theaters, mastered for Blu-ray, formatted for airplane in flight viewing and formatted for the 4x3 televisions that prevailed in the era of this film. The creation of many versions of the film makes perfect sense but versioning becomes versionitis when retailers like Netflix neglect to clearly specify which version they want and when content owners don’t have a good handle on which versions they have. The first delivery made to Netflix of this film must have been derived from the 4x3 broadcast television cut. Netflix QC initially missed this problem and we put this version up for our streaming customers. We eventually realized our error and issued a re-delivery request from the content owner to receive this film in the original aspect ratio that the filmmakers intended for viewing the film. Versionitis from the initial delivery resulted in a poor customer experience and then Netflix and the content owner incurred new and unplanned spending to execute new deliveries to fix the customer experience.

Our recent trip to Europe revealed that the common theme of those studios that struggled with delivery was versionitis. They were not sure which cut of video to deliver or if those cuts of video were aligned with language subtitle files for the content. The studios that performed the best have a well established digital archive that avoids versionitis. They know the lineage of all of their video sources and those video files’ alignment with their correlated subtitle files.

There is a link between content owner revenue and content owner delivery skill. Frequently Netflix finds itself looking for opportunities to grow its streaming catalogs quickly with budget dollars that have not yet been allocated. Increasingly the Netflix deal teams are considering the effectiveness of a content owner’s delivery abilities when making those spending decisions. Simply put, content owners who can deliver quickly and without error are getting more licensing revenue from Netflix than those content owners suffering from versionitis and the resulting delivery problems.

B2B order and catalog management
Today Netflix has a set of tools for managing content orders and curating our content catalogs. These tools are internal to our business and we currently engage the industry for delivery tracking through phone calls and emails containing spreadsheets of content data.

We can do a lot better than to engage the industry with spreadsheets attached to email. We will rectify this in the first half of 2013 with the release of the initial versions of our Content Partner Portal. The universal reaction to reviewing our Nordic launch with content owners was that we were showing them great data (timeliness, error rates, etc) about their deliveries but that they need to see such data much more frequently. The Content Partner Portal will allow all of these metrics to be shared in real time with content owner operations teams while the deliveries are happening. We also foresee that the Content Partner Portal will be used by the Netflix deal team to objectively assess the delivery performance of content owners when planning additional spending.

We also see a role for shared industry standards to help with delivery tracking and catalog curation. The EIDR initiative, for identifying content and versions of content, offers the potential for alignment across companies in the Digital Supply Chain. We are building the ability to label titles with EIDR into our new Content Partner Portal.

Final thoughts
Today’s supply chain is messy and not well suited to help companies in our industry to fully embrace the rapidly growing channel of internet streaming. We are a long way from the Starship Enterprise equivalent of the Digital Supply Chain but the growing global consumer demand for internet streaming clearly provides the incentive to invest together in modernizing the supply chain.

Netflix has many initiatives underway to innovate in developing the supply chain in 2013, some of which were discussed in this post, and we look forward to continuing to collaborate with our content owning partners supply chain innovation efforts.

Netflix is hiring for open positions in our Digital Supply Chain team. Please visit to see our open positions. We also put together a short video about the supply chain for a recent job fair. Here is a link to that video.

Kevin McEntee
VP Digital Supply Chain

Tuesday, December 11, 2012

Hystrix Dashboard + Turbine Stream Aggregator

by Ben Christensen, Puneet Oberai and Ben Schmaus

Two weeks ago we introduced Hystrix, a library for engineering resilience into distributed systems. Today we're open sourcing the Hystrix dashboard application, as well as a new companion project called Turbine that provides low latency event stream aggregation.

The Hystrix dashboard has significantly improved our operations by reducing discovery and recovery times during operational events. The duration of most production incidents (already less frequent due to Hystrix) is far shorter, with diminished impact, because we are now able to get realtime insights (1-2 second latency) into system behavior.

The following snapshot shows six HystrixCommands being used by the Netflix API. Under the hood of this example dashboard, Turbine is aggregating data from 581 servers into a single stream of metrics supporting the dashboard application, which in turn streams the aggregated data to the browser for display in the UI.

When a circuit is failing then it changes colors (gradient from green through yellow, orange and red) such as this:

The diagram below shows one "circuit" from the dashboard along with explanations of what all of the data represents.

We've purposefully tried to pack a lot of information into the dashboard so that engineers can quickly consume and correlate data.

The following video shows the dashboard operating with data from a Netflix API cluster:

The Turbine deployment at Netflix connects to thousands of Hystrix-enabled servers and aggregates realtime streams from them. Netflix uses Turbine with a Eureka plugin that handles instances joining and leaving clusters (due to autoscaling, red/black deployments, or just being unhealthy).

Our alerting systems have also started migrating to Turbine-powered metrics streams so that in one minute of data there are dozens or hundreds of points of data for a single metric. This high resolution of metrics data makes for better and faster alerting.

The Hystrix dashboard can be used either to monitor an individual instance without Turbine or in conjunction with Turbine to monitor multi-machine clusters:

Turbine can be found on Github at:

Dashboard documentation is at:

We expect people to want to customize the UI so the javascript modules have been implemented in a way that they can easily be used standalone in existing dashboards and applications. We also expect different perspectives on how to visualize and represent data and look forward to contributions back to both Hystrix and Turbine.

We are always looking for talented engineers so if you're interested in this type of work contact us via

Monday, December 10, 2012

Videos of the Netflix talks at AWS Re:Invent

by Adrian Cockcroft

Most of the talks and panel sessions at AWS Re:Invent were recorded, but there are so many sessions that it's hard to find the Netflix ones. Here's a link to all of the videos posted by AWS that mention Netflix:

They are presented below in what seems like a natural order that tells the Netflix story, starting with the migration and video encoding talks, then talking about availability, Cassandra based storage, "big data" and security architecture, ending up with operations and cost optimization. Unfortunately a talk on Chaos Monkey had technical issues with the recording and is not available.

Embracing the Cloud

Presented by Neil Hunt - Chief Product Officer, and Yury Israilevsky - VP Cloud and Platform Engineering.

Join the product and cloud computing leaders of Netflix to discuss why and how the company moved to Amazon Web Services. From early experiments for media transcoding, to building the operational skills to optimize costs and the creation of the Simian Army, this session guides business leaders through real world examples of evaluating and adopting cloud computing.


Netflix's Encoding Transformation

Presented by Kevin McEntee, VP Digital Supply Chain.

Netflix designed a massive scale cloud based media transcoding system from scratch for processing professionally produced studio content. We bucked the common industry trend of vertical scaling and, instead, designed a horizontally scaled elastic system using AWS to meet the unique scale and time constraints of our business. Come hear how we designed this system, how it continues to get less expensive for Netflix, and how AWS represents a transformative opportunity in the wider media owning industry.


Highly Available Architecture at Netflix

Presented by Adrian Cockcroft (@adrianco) Director of Architecture

This talk describes a set of architectural patterns that support highly available services that are also scalable, low cost, low latency and allow agile continuous deployment development practices. The building blocks for these patterns have been released at as open source projects for others to use.


Optimizing Your Cassandra Database on AWS

Presented by Ruslan Meshenberg - Director of Cloud Platform Engineering and Gregg Ulrich - Cassandra DevOps Manager

For a service like Netflix, data is crucial. In this session, Netflix details how they chose and leveraged Cassandra, a highly-available and scalable open source key/value store. In this presentation they discuss why they chose Cassandra, the tools and processes they developed to quickly and safely move data into AWS without sacrificing availability or performance, and best practices that help Cassandra work well in AWS.

Data Science with Elastic Map Reduce

Presented by Kurt Brown - Director, Data Science Engineering Platform

In this talk, we dive into the Netflix Data Science & Engineering architecture. Not just the what, but also the why. Some key topics include the big data technologies we leverage (Cassandra, Hadoop, Pig + Python, and Hive), our use of Amazon S3 as our central data hub, our use of multiple persistent Amazon Elastic MapReduce (EMR) clusters, how we leverage the elasticity of AWS, our data science as a service approach, how we make our hybrid AWS / data center setup work well, and more.

Security Panel

Featuring Jason Chan, Director of Cloud Security Architecture.

Learn from fellow customers, including Jason Chan of Netflix, Khawaja Shams of NASA, and Rahul Sharma of Averail, who have leveraged the AWS secure platform to build business critical applications and services. During this panel discussion, our panelists share their experiences utilizing the AWS platform to operate some of the world’s largest and most critical applications.

How Netflix Operates Clouds for Maximum Freedom and Agility

Presented by Jeremy Edberg (@jedberg), Reliability Architect

In this session, learn how Netflix has embraced DevOps and leveraged all that Amazon has to offer to allow our developers maximum freedom and agility.

Optimizing Costs with AWS

Presented by Coburn Watson - Manager, Cloud Performance Engineering

Find out how Netflix, one of the largest, most well-known and satisfied AWS customers, develop and run their applications efficiently on AWS. The manager of the Netflix Cloud Performance Engineering team outlines a common-sense approach to effectively managing AWS usage costs while giving the engineers unconstrained operational freedom.

Intro to Chaos Monkey and the Simian Army

Presented by Ariel Tsetlin - Director of Cloud Solutions

Why were the monkeys created, what makes up the Simian Army, and how do we run and manage them in the production environment.

Unfortunately the video recording had technical problems.

In Closing...

We had a great time and enjoyed the opportunity to have a large number of Netflix executives, managers and architects tell the "Netflix in the Cloud" story in much more detail than usual. Hopefully this summary makes it easier to watch all our talks and follow that story.

Monday, December 3, 2012

AWS Re:Invent was Awesome!

by Adrian Cockcroft

There was a very strong Netflix presence at AWS Re:Invent in Las Vegas this week, from Reed Hastings appearing in the opening keynote, to a packed series of ten talks by Netflix management and engineers, and our very own expo booth. The event was a huge success, over 6000 attendees, great new product and service announcements, very well organized and we are looking forward to doing it again next year.

Wednesday Morning Keynote

The opening keynote with Andy Jassy contains an exciting review of the Curiosity Mars landing showing how AWS was used to feed information and process images for the watching world. Immediately afterwards (at 36'40") Andy sits down with Reed Hastings.

Reed talks about taking inspiration from Nicholas Carr's book "the Big Switch" to realize that cloud would be the future, and over the last four years, Netflix has moved from initial investigation to having deployed about 95% of our capacity on AWS. By the end of next year Reed aims to be 100% on AWS and to be the biggest business entirely hosted on AWS apart from Amazon Retail. Streaming in 2008 was around a million hours a month, now it's over a billion hours a month. A thousandfold increase is over four years is difficult to plan for, and while Netflix took the risk of being an early adopter of AWS in 2009, we were avoiding a bigger risk of being unable to build out capacity for streaming ourselves. "The key is that now we're on a cost curve and an architecture... that as all of this room does more with AWS we benefit, by that collective effect that gets you to scale and brings prices down."

Andy points out that Amazon Retail competes with Netflix in the video space, and asks what gave Reed the confidence to move to AWS. Reed replies that Jeff Bezos and Andy have both been very clear that AWS is a great business that should be run independently and the more that Amazon Retail competes with Netflix, the better symbol Netflix is that it's safe to run on AWS. Andy replies "Netflix is every bit as important a customer of AWS as Amazon Retail, and that's true for all of our external customers".

The discussion moves onto the future of cloud, and Reed points out that as wonderful as AWS is, we are still in the assembly language phase of cloud computing. Developers shouldn't have to be picking individual instance types, just as they no longer need to worry about CPU register allocation because compilers handle that for them. Over the coming years, the cloud will add the ability to move live instances between instance types. We can see that this is technically possible because VMware does that today with VMotion, but bringing this capability to public cloud would allow cost optimization, improvements in bi-sectional bandwidth and great improvements in efficiency. There are great technical challenges to do this seamlessly at scale, and Reed wished Andy well in tackling these hard problems in the coming years.

The second area of future development is consumer devices that are touch based, understand voice commands and are backed by ever more powerful cloud based services. For Netflix, the problem is to pick the best movies to show on a small screen for a particular person at that point in time, from a huge catalog of TV shows and movies. The ability to cheaply throw large amounts of compute power at this ranking problem lets Netflix experiment rapidly to improve the customer experience.

In the final exchange, Andy asks what advice he can give to the audience, and Reed says to build products that you find exciting, and to watch House of Cards on Netflix on February 1st next year.

Next Andy talks about the rate at which AWS introduces and updates products, from 61 in 2010, to 82 in 2011 to 158 in 2012. He then went on to introduce AWS Redshift, a low cost data warehouse as a service that we are keen to evaluate as we replace our existing datacenter based data warehouse with a cloud based solution.

Along with presentations from NASDAQ and SAP, Andy finished up with examples of mission critical applications that are running on AWS, including including a huge diagram showing the Obama For America election back end, consisting of over 200 applications. We were excited to find out that the OFA tech team were using the Netflix open source management console Asgard to manage their deployments on AWS, and to see the Asgard icon scattered across this diagram. During the conference we met the OFA team and many other AWS end users who have also started using various @NetflixOSS projects.

Thursday Morning Keynote

The second day keynote with Werner Vogels started off with Werner talking about architecture. Starting around 43 minutes in he describes some 21st Century Architectural patterns which are being used by, AWS itself, and are also very similar to the Netflix architectural practices. After a long demo from Matt Wood that used the AWS Console to laboriously do what Asgard does in a few clicks there is an interesting description of how S3 was designed for resilience and scalability by Alyssa Henry, the VP of Storage Services for AWS.

Werner returns to talk about some more architectural principles, a customer talk from Animoto, then announces two new high end instance types that will become available in the coming weeks. The cr1.8xlarge has 240GB of RAM and two 120GB solid state disks, it's ideal for running in memory analytics. The hs1.8xlarge has 114GB of RAM and twenty four 2TB hard drives in the instance, it's ideal for running data warehouses, and is clearly the raw back end instance behind the Redshift data warehouse product announced the day before. Finally he discussed data driven architectures and introduces AWS Data Pipeline, then Matt Wood comes on again to do a demo.

Thursday Afternoon Fireside Chat

The final keynote, fireside chat with Werner Vogels and Jeff Bezos has interesting discussions of lean start-up principles and the nature of innovation. At 29'50" they discuss Netflix and the issues of competition between Amazon Prime and Netflix. Jeff says there is no issue, "We bust our butt every day for Netflix", and Werner says the way AWS works is the same for everyone, there are no special cases for, Netflix or anyone else.

The discussion continues with an introduction to the 10,000 year clock and the Blue Origin vertical take off and vertical landing spaceship that Jeff is also involved in as side projects.

Netflix in the Expo Hall and @NetflixOSS

The exhibition area was impressive, with many interesting vendors that highlight the strong ecosystem around AWS. Netflix had a small booth which was aimed primarily at recruiting, but also provided a place to meet with the speakers and to meet people using the @NetflixOSS platform components. Over the last year Netflix has been gradually open sourcing our platform. While we aren't finished yet, it is now emerging as a way for other companies to rapidly adopt the same highly available architecture on AWS that has been very successful for Netflix.

More Coming Soon

There were a large number of presentations at AWS Re:Invent, the organizers have stated that videos of all the presentations will be posted to their YouTube channel, and some slides are already on Netflix also archives its presentations on and we plan to link to the videos of Netflix talks when they are posted, here's a list of what's coming, with links to some of the slides.

Wed 1:00-1:45
Coburn Watson
Optimizing Costs with AWS

Wed 2:05-2:55
Kevin McEntee
Netflix’s Transcoding Transformation

Wed 3:25-4:15
Neil Hunt / Yury Izrailevsky
Netflix: Embracing the Cloud

Wed 4:30-5:20
Adrian Cockcroft
High Availability Architecture at Netflix

Thu 10:30-11:20
Jeremy Edberg
Rainmakers – Operating Clouds

Thu 11:35-12:25
Kurt Brown
Data Science with Elastic Map Reduce (EMR)

Thu 11:35-12:25
Jason Chan
Security Panel: Learn from CISOs working with AWS

Thu 3:00-3:50
Adrian Cockcroft
Compute & Networking Masters Customer Panel

Thu 3:00-3:50
Ruslan Meshenberg/Gregg Ulrich
Optimizing Your Cassandra Database on AWS

Thu 4:05-4:55
Ariel Tseitlin
Intro to Chaos Monkey and the Simian Army

Monday, November 26, 2012

Introducing Hystrix for Resilience Engineering

by Ben Christensen

In a distributed environment, failure of any given service is inevitable. Hystrix is a library designed to control the interactions between these distributed services providing greater tolerance of latency and failure. Hystrix does this by isolating points of access between the services, stopping cascading failures across them, and providing fallback options, all of which improve the system's overall resiliency.

Hystrix evolved out of resilience engineering work that the Netflix API team began in 2011. Over the course of 2012, Hystrix continued to evolve and mature, eventually leading to adoption across many teams within Netflix. Today tens of billions of thread-isolated and hundreds of billions of semaphore-isolated calls are executed via Hystrix every day at Netflix and a dramatic improvement in uptime and resilience has been achieved through its use.

The following links provide more context around Hystrix and the challenges that it attempts to address:

Getting Started

Hystrix is available on GitHub at

Full documentation is available at including Getting Started, How To Use, How It Works and Operations examples of how it is used in a distributed system.

You can get and build the code as follows:
$ git clone git://
$ cd Hystrix/
$ ./gradlew build

Coming Soon

In the near future we will also be releasing the real-time dashboard for monitoring Hystrix as we do at Netflix:

We hope you find Hystrix to be a useful library. We'd appreciate any and all feedback on it and look forward to fork/pulls and other forms of contribution as we work on its roadmap.

Are you interested in working on great open source software? Netflix is hiring!

Tuesday, November 20, 2012

Announcing Blitz4j - a scalable logging framework

By Karthikeyan Ranganathan

We are proud to announce Blitz4j , a critical component of the Netflix logging infrastructure that helps Netflix achieve high volume logging without affecting scalability of the applications.

What is Blitz4j?

Blitz4j is a logging framework built on top of log4j to reduce multithreaded contention and enable highly scalable logging without affecting application performance characteristics.
At Netflix, Blitz4j is used to log billions of events for monitoring, business intelligence reporting, debugging and other purposes. Blitz4j overcomes traditional log4j bottlenecks and comes built with a highly scalable and customizable asynchronous framework. Blitz4j also comes with the ability to convert the existing log4j appenders to use the asynchronous model without changing the existing log4j configurations.
Blitz4j makes runtime reconfigurations of log4j pretty easy. Blitz4j also tries to mitigate data loss and provides a way to summarize the log information during log storms.

Why is scalable logging important?

Logging is a critical part of any application infrastructure. At Netflix, we collect data for monitoring, business intelligence reporting etc. There is also a need to turn on finer grain of logging level for debugging customer issues.
In addition, in a service-oriented architecture, you depend on other central services and if those services break unexpectedly, your applications tend to log orders of magnitude higher than normal. This is where the scalability of the logging infrastructure comes to the fore. Any scalable logging infrastructure should be able to handle these kind of log storms providing useful information about the breakages without affecting the application performance.

History of Blitz4j

At Netflix, log4j has been used as a logging framework for a few years. It had worked fine for us, until the point where there was a real need to log lots of data. When our traffic increased and when the need for per-instance logging went up, log4j's frailties started to get exposed.

Problems with Log4j

Contended Synchronization with Root Logger
Log4j follows a hierarchical model and that makes it easy to turn off/on logging based on a package or a class level (if your logger definition follows that model). In this model, root logger is at the top of the hierarchy. In most cases, all loggers have to get access the root logger to log to the appenders configured there. One of the biggest problems here is that locking is needed on the root logger to write to the appenders. For a high traffic application that logs lot of data this is a big contention point as all application threads have to synchronize on the root logger.

  void callAppenders(LoggingEvent event) {
    int writes = 0;
   for(Category c = this; c != null; c=c.parent) {
     // Protected against simultaneous call to addAppender, removeAppender,...
      synchronized(c) {
    if(c.aai != null) {
      writes += c.aai.appendLoopOnAppenders(event);
    if(!c.additive) {

    if(writes == 0) {

This severely limits the application scalability. Even if the critical section is executed quickly, this is a huge bottleneck in high volume logging applications.
The reason for the lock seems to be for protection against potential change in the list of appenders which should be a rare event. For us, the thread dump has exposed this contention numerous times.
Asynchronous appender to the rescue?
The longer the time spent in logging to the appender, more threads wait on the logging to complete. Any buffering here helps the scalability of the application tremendously. Some log4j appenders (such as File Appender) comes with the ability to buffer the data and that helps this problem quite a bit. The built in log4j asynchronous appender alleviates this problem quite a bit, but it still does not remove this synchronization. For us, thread dumps revealed another point of contention when logging to the appender with the use of asynchronous appender.
It was quite clear that the built-in asynchronous appender was less scalable because of this synchronization.

  void doAppend(LoggingEvent event) {
    if(closed) {
      LogLog.error("Attempted to append to closed appender named ["+name+"].");
    if(!isAsSevereAsThreshold(event.getLevel())) {

    Filter f = this.headFilter;
    while(f != null) {
      switch(f.decide(event)) {
      case Filter.DENY: return;
      case Filter.ACCEPT: break FILTER_LOOP;
      case Filter.NEUTRAL: f = f.getNext();

Deadlock Vulnerability
This double locking (root logger and appender) also makes the application vulnerable to deadlocks if your appender by any chance tries to take a lock on a resource and if that resource tries to log to the appender at the same time.
Locking on Logger Cache
In log4j, loggers are cached in a Hashtable and that needs to be locked for any retrieval of a cached logger. When you want to change the log4j settings dynamically, there are 2 steps in the process.
  1. Reset and empty out all current log4j configurations
  2. Load all configurations including new configurations
During the reset process, locks have to be held on both the cache and the individual loggers. If any of the appenders tried to look up the logger from the cache at the same time,we have the classic case of locks trying to be held in opposite directions and the chance of a deadlock.

  void shutdown() {
    Logger root = getRootLogger();

    // begin by closing nested appenders

    synchronized(ht) {
      Enumeration cats = this.getCurrentLoggers();
      while(cats.hasMoreElements()) {
    Logger c = (Logger) cats.nextElement();

Why Blitz4j?

Central to all the contention and the deadlock vulnerabilities is the locking model in log4j. If the log4j used any of the concurrent data structures with JDK 1.5 and above, most of the problems would be solved. That is exactly what blitz4j does.
Blitz4j overrides key parts of the log4j architecture to remove the locks and replace them with concurrent data structures.Blitz4j puts the emphasis more on application performance and stability rather than accuracy in logging. This means Blitz4j leans more towards the asynchronous model of logging and tries to make the logging useful by retaining the time-order of logging messages.
While the log4j's built-in asynchronous appender is similar in functionality to the one offered by blitz4j, Blitz4j comes with the following differences
  1. Remove all critical synchronizations with concurrent data structures.
  2. Extreme configurability in terms of in-memory buffer and worker threads
  3. More isolation of application threads from logging threads by replacing the wait-notify model with an executor pool model.
  4. Better handling of log messages during log storms with configurable summary.
Apart from the above, Blitz4j also provides the following.
  1. Ability to dynamically configure log4j levels for debugging production problems without affecting the application performance.
  2. Automatic conversion of any log4j appender to the asynchronous model statically or at runtime.
  3. Realtime metrics regarding performance using Servo and dynamic configurability using Archaius.
If your application is equipped with enough memory, it is possible to achieve both application and logging performance without any logging data loss. The power of this model comes to the fore during log storms when the critical dependencies break unexpectedly causing orders of magnitude increase in logging.

Why not use LogBack?

For a new project, LogBack may be an apt choice. For existing projects, there seems to be considerable amount of work to achieve the promised scalability. Besides, Blitz4j has stood the test of time and scrutiny at Netflix and given the familiarity and ubiquity of log4j, it is our architectural choice here at Netflix.

Blitz4j Performance

The  graph below from a couple of our streaming servers which logs about 300-500 lines a second, gives an indication of performance of blitz4j (with asynchronous appender) as compared to log4j (without asynchronous appender).
In a steady state, the latter is atleast 3 times more expensive than blitz4j. There are numerous spikes that happen with the log4j implementation that are due to the synchronizations talked about earlier.

Other things we observed:
When log4j (without synchronizations removed) is used with asynchronous appender, it scales much better (graph not included here), but it just takes that much higher amount of logging for the spikes to show up.

We have also observed with blit4j even with very high amount of logging turned on, the application response times remained unaffected when compared to log4j (with synchronization) where the response times degraded rapidly when the amount of logging increased.


 Blitz4j Source
Blitz4j Wiki

If building critical infrastructure components like this, for a service that millions of people use world wide, excites you, take a look at