Tuesday, March 29, 2011

NoSQL @ Netflix Talk (Part 1)

A few weeks ago, I gave the first in a series of planned talks on the topic of NoSQL @ Netflix. By now, it is widely known that Netflix has achieved something remarkable over the past 2 years – accelerated subscriber growth with an ever-improving streaming experience. In addition to streaming more titles to more devices in both the US and Canada, Netflix has moved its infrastructure, data, and applications to the AWS cloud.

In the spirit of helping others with similar needs, we are sharing our experiences with AWS and NoSQL technologies via this tech blog and several speaking appearances at conferences. Via these efforts, we hope to foster both improvements in cloud and NoSQL offerings and collaboration with open-source communities.

The NoSQL @ Netflix series specifically aims to share our recommendations on the best use of NoSQL technologies in high-traffic websites. What makes our experience unique is that we are using publicly available NoSQL and cloud technology to serve high-traffic customer-driven read-write workloads. Once again:


Netflix’s NoSQL Use-cases = public NoSQL +
public cloud +
customer traffic +
R/W workload +
high traffic conditions

The video below was loosely based on the following whitepaper. Some of the key questions addressed by the video and whitepaper are as follows:

  • What sort of data can you move to NoSQL?
  • Which NoSQL technologies are we working with?
  • How did we translate RDBMS concepts to NoSQL?
Driven by a culture that prizes curiosity and continuous improvement, Netflix is already pushing NoSQL technology and adoption further. If you would like to work with us on these technologies, have a look at our open positions

The slides are available here

Caption: The first 10 minutes are from sponsors and the last 30 minutes are Q & A.

Siddharth "Sid" Anand, Cloud Systems

    Tuesday, March 8, 2011

    Cloud Connect Keynote : Complexity and Freedom



    On March 8th, 2011, I was fortunate to be able to deliver 10 minutes of the keynote address for the Cloud Connect conference in Santa Clara, California. Here are some of the points I made during the talk.

    Availability

    We started this cloud re-architecture effort in 2008 in the aftermath of an outage of our DVD shipping software in August of that year. An unfortunate confluence of events caused our systems to go down. We had singleton vertically scaled databases for both our website and the nascent Netflix streaming functionality. We knew those two systems were equally vulnerable. We had to re-architect for high availability and move to a service oriented architecture spread across redundant data centers.

    Why Cloud?

    In August of 2008, there were already web based startups that were not building data centers because they were building in the cloud. Some of those start ups will grow to be as big as Netflix and therefore Netflix gave serious consideration to building for the clouds during this re-architecture effort.

    Why AWS (Amazon Web Services)?

    Our definition of cloud is a public, shared, and multi-tenant cloud. AWS is the market leader and has been able to create a continuous and virtuous cycle. Large AWS customers demand (and receive) continuous improvements from AWS. Those improvements, in turn, attract more large customers and the cycle then repeats itself. Netflix has benefited nicely from jumping on and riding that virtuous cycle.

    Agility

    We went to the cloud looking for high availability. We found availability but we are also happy that we found a lot of new agility as well. Our software developers and our business found new agility by eliminating a lot of complexity.

    Essential vs. Accidental Complexity : No Silver Bullet

    In 1986, Dr. Fred Books of University of North Carolina, Chapel Hill wrote his famous paper entitled 'No Silver Bullet'. This paper touches on a lot of things but the thing most relevant to this post is the contrast Brooks paints between Essential complexity and Accidental complexity. Essential complexity is caused by the problem to be solved, and nothing can remove it. An vital example of essential complexity at Netflix is our personalized movie recommendation system. Accidental complexity relates to problems that we create on our own and which can be fixed. In 1986, one example of retiring accidental complexity that Brooks wrote about was coding large scale systems in assembly language, because adequate high level languages were not viable. That accidental complexity was largely retired by 1986 when Brooks wrote the paper.

    Accidental complexity is generational. Every new application domain repeats the cycle of early phases of accidental complexity that are eventually retired. In the mid 1990's I was writing code that parsed raw http request headers. Everyone had to do that so they could write the early dynamic web applications that many of us worked on in those days.

    Building and running data centers is the accidental complexity of the 2011 generation. If you are building a data center that hosts less than multiples of 10's of thousands of machines, then you are inviting complexity, centralized control, and process that you don't need for your business. At Netflix, recurring issues of data center space, equipment upgrades, power and cooling fire drills, and data center moves were all accidental complexities that distracted from software development towards our essential complexities.

    Running data centers also requires an accurate capacity forecast so the equipment needed to add capacity is racked, stacked, and tested before it is needed. For Netflix, an accurate capacity forecast requires an accurate business forecast. Netflix's good fortune has made this difficult. We started 2010 with just over 12 million subscribers and finished the year with over 20 million subscribers, far above what we predicted at the beginning of 2010. The newly added load put us at risk of running out of data center capacity. At the same time we were re-architecting for the cloud. We moved over 80% of our customer transactions, mostly for movie discovery and streaming, to the AWS cloud. The elasticity of the cloud enabled us to absorb that growth with little pain. The move to the cloud also allowed us to eliminate a lot of the centralized process required to run data centers.

    Killing Process : Freedom and Responsibility

    You may want to take a look at the Netflix Culture Deck, found at jobs.netflix.com. It talks about how we love killing process and lot about our value of Freedom and Responsibility. Here are 2 relevant sentences from the culture deck:

    1. Our model is to increase employee freedom as we grow, rather than limit it.

    2. Responsible people thrive on freedom and are worthy of freedom.

    Implementing Freedom and Responsibility in our service oriented cloud architecture means the following things:

    1. Each engineering team owns their own deployment. They push changes and re-architect when they need to without seeking widespread alignment and without a sign-off process.

    2. Software developers own capacity procurement. In the cloud, adding cpu and storage are simple API calls.

    3. We don't have a single point of control over cloud spending. We've had a few bugs that consumed extra resources, but we also had those when we had a more centralized process for adding capacity to our data center.

    Centralized process and control were needed in the past to help manage the complexity of operating our own data centers. We eliminated a lot of that complexity by moving to the cloud and these three facts of operating in the clouds at Netflix have delivered a tremendous new agility as our business and engineering teams continue to grow.

    Availability and Agility

    We moved to the clouds looking for availability. We have also found a tremendous agility by eliminating complexity, process, and control. There was a steep learning curve and moments of doubt along the way but the end result is that Netflix software developers now have a lot more freedom to innovate and evolve our architectures rapidly as the business continues it's rapid growth. We continue to seek great talent to add to our engineering teams. I hope you'll take a look at our open positions at jobs.netflix.com.


    Thanks,

    Kevin McEntee
    VP Engineering, Systems & ECommerce