Monday, October 29, 2012

Governator - Lifecycle and Dependency Injection

By Jordan Zimmerman


Yahhh – PostConstruct me now unt PreDestroy me lay-tuh

Governator

Governator is a library of extensions and utilities that extend Google Guice to provide:
  • Classpath scanning and automatic binding
  • Lifecycle management
  • Configuration to field mapping
  • Field validation
  • Parallelized object warmup
  • Lazy singleton support
  • Generic binding annotations
Governator is another component in the growing Netflix Open Source Platform. See the Governator website at Github for complete details: https://github.com/Netflix/governator.

Why the name Governator? It Governs the bootstrapping process and all my OSS projects end in "tor" (see: Curator and Exhibitor) - also it makes me laugh ;)

What is Dependency Injection?

According to Wikipedia, Dependency Injection "is a software design pattern that allows a choice of component to be made at run-time rather than compile time." It is also known as the Inversion of Control Pattern which was originally described by Martin Fowler.

As has been described on the Netflix Tech Blog and elsewhere, Netflix utilizes a highly distributed architecture comprised of many individual service types and many shared libraries. The Netflix Platform Team is responsible for making those libraries and services easy to configure and use. As the number of these libraries and services has grown, our existing methods have reached their limits. That is why we're moving to a Dependency Injection based system. This system will make it much easier for teams to use common libraries without having to worry about initialization and object lifecycle details.

What is Object Lifecycle?

In many programming languages, Java in particular, there is much more to the lifecycle of an object than just its allocation. The object may need to register itself with existing objects, it may need to load property files, it may need to warm caches, etc.

At Netflix, our shared libraries must load dynamic properties before they are ready to be used (see the Tech Blog post regarding Archaius). Additionally, libraries such as Astyanax need to warm internal caches before they can perform at optimum levels. Users of a library shouldn't need to concern themselves with these details if possible.

Google Guice

Guice is a Dependency Injection container created and open sourced by Google. We've decided to standardize on Guice because it is a small library that is well supported, has strong acceptance by the Java community and is relatively easy to extend to our needs. Using a library like Guice makes the
introduction of Dependency Injection much easier.

Summary

We hope you find Governator as useful as we do. We'd appreciate any and all feedback on it. Are you interested in working on great open source software? Netflix is hiring! http://jobs.netflix.com


Post-mortem of October 22,2012 AWS degradation

By Jeremy Edberg & Ariel Tseitlin

On Monday, October 22nd, Amazon experienced a service degradation.  It was highly publicized due to the fact that it took down many popular websites for multiple hours. Netflix however, while not completely unscathed, handled the outage with very little customer impact.  We did some things well and could have done some things better, and we'd like to share with you the timeline of the outage from our perspective and some of the best practices we used to minimize customer impact.

Event Timeline

On Monday, just after 8:30am, we noticed that a couple of large websites that are hosted on Amazon were having problems and displaying errors.  We took a survey of our own monitoring system and found no impact to our systems.  At 10:40am, Amazon updated their status board showing degradation in the EBS service.  Since Netflix focuses on making sure services can handle individual instance failure and since we avoid using EBS for data persistence, we still did not see any impact to our service.  

At around 11am, some Netflix customers started to have intermittent problems.  Since much of our client software is designed for resilience against intermittent server problems, most customers did not notice.  At 11:15am, the problem became significant enough that we opened an internal alert and began investigating the cause of the problem.  At the time, the issue was exhibiting itself as a network issue, not an EBS issue, which caused some initial confusion.  We should have opened an alert earlier, which would have helped us narrow down the issue faster and let us remediate sooner.

When we were able to narrow down the network issue to a single zone, Amazon was also able to confirm that the degradation was limited to a single Availability Zone.  Once we learned the impact was isolated to one AZ, we began evacuating the affected zone.

Due to previous single zone outages, one of the drills we run is a zone evacuation drill.  Between the zone evacuation drill and our learnings from previous outages, the decision to evacuate the troubled zone was an easy one -- we expected it to be as quick and painless as it was during past drills.  So that is what we did.

In the past we identified zone evacuations as a good way of solving problems isolated to a single zone and as such have made it easy in Asgard to do this with a few clicks per application.  That preparation came in handy on Monday when we were able to evacuate the troubled zone in just 20 minutes and completely restore service to all customers.  

Building for High Availability

We’ve developed a few patterns for improving the availability of our service.
Past outages and a mindset for designing in resiliency at the start have taught us a few best-practices about building high availability systems.

Redundancy

One of the most important things that we do is we build all of our software to operate in three Availability Zones.  Right along with that is making each app resilient to a single instance failing.  These two things together are what made zone evacuation easier for us.  We stopped sending traffic to the affected zone and everything kept running.  In some cases we needed to actually remove the instances from the zone, but this too was done with a just a few clicks to reconfigure the auto scaling group.

We apply the same three Availability Zone redundancy model to our Cassandra clusters.  We configure all our clusters to use a replication factor of three, with each replica located in a different Availability Zone.  This allowed Cassandra to handle the outage remarkably well.  When a single zone became unavailable, we didn't need to do anything.  Cassandra routed requests around the unavailable zone and when it recovered, the ring was repaired.

Simian Army

Everyone has the best intentions when building software.  Good developers and architects think about error handling, corner cases, and building resilient systems.  However, thinking about them isn’t enough.  To ensure resiliency on an ongoing basis, you need to alway test your system’s capabilities and its ability to handle rare events.  That’s why we built the Simian Army:  Chaos Monkey to test resilience to instance failure, Latency Monkey to test resilience to network and service degradation, and Chaos Gorilla to test resilience to zone outage.  A future improvement we want to make is expanding the Chaos Gorilla to make zone evacuation a one-click operation, making the decision even easier.  Once we build up our muscles further, we want to introduce Chaos Kong to test resilience to a complete regional outage.

Easy tooling & Automation

The last thing that made zone evacuation an easy decision is our cloud management tool, known as Asgard .  With just a couple of clicks, service owners are able to stop the traffic to the instances or delete the instances as necessary.

Embracing Mistakes

Every time we have an outage, we make sure that we have an incident review.  The purpose of these reviews is not to place blame, but to learn what we can do better.  After each incident we put together a detailed timeline and then ask ourselves, “What could we have done better?  How could we lessen the impact next time? How could we have detected the problem sooner?”  We then take those answers and try to solve classes of problems instead of just the most recent problem.  This is how we develop our best practices.

Conclusion

We weathered this last AWS outage quite well and learned a few more lessons to improve on. With each outage, we look for opportunities to improve both the way our system is built and the way we detect and react to failure.  While we feel we’ve built a highly available and reliable service, there’s always room to grow and improve.  

If you like thinking about high availability and how to build more resilient systems, we have many openings throughout the company, including a few Site Reliability Engineering positions.