Thursday, September 6, 2012

Netflix @ Recsys 2012

By Xavier Amatriain


We are just a few days away from the 2012 ACM Recommender Systems Conference (#Recsys2012), that this year will take place in Dublin, Ireland. Over the years, Recsys has been and still is one of our favorite conferences, not only because of how relevant the area is to our business, but also because of its unique blend of academic research and industrial applications.

In fact, if you had to mention a single company that is identified with recommender systems and technologies, that would probably be Netflix. The Netflix Prize started a year before the first RecSys conference in Minneapolis, and it impacted Recommender Systems researchers and practitioners in many ways. So, it comes as no surprise that the relation between the conference and Netflix also goes a long way. Netflix has been involved in the conference throughout the years. And, this time in Dublin is not going to be any different. Not only is Netflix a proud sponsor of the conference, but you will also have the chance to listen to presentations and meet some of the people that make the wheels of the Netflix recommendations turn. Here are some of the highlights of Netflix' participation:
  • Harald Steck and Xavier Amatriain are involved in organizing the workshop on "Recommender Utility Evaluation: Beyond RMSE". We believe that finding the right evaluation metrics is one of the key issues for recommender systems. This workshop will be a great event to not only discover the latest research in the area, but also to brainstorm and discuss on the issue of recsys evaluation.
  • On that same workshop, you should not miss the keynote by our Director of Innovation Carlos Gomez-Uribe. The talk is entitled "Challenges and Limitations in the Offline and Online Evaluation of Recommender Systems: A Netflix Case Study". Carlos will give some insights into how we deal with online A/B and offline experimental metrics.
  • On Tuesday, Xavier Amatriain will be giving a 90 minute tutorial on "Building industrial-scale real-world Recommender Systems". In this tutorial, he will talk about all those things that matter in a recommender system, and are usually outside of the academic focus. He will describe different ways that recommendations can be presented to the users, evaluation through A/B testing, data, and software architectures.
Besides Harald, Carlos and Xavier, you should also look forward to meeting other members of the personalization team at Netflix. Rex Lam, Justin Basilico, and Kelvin Jiang will also be attending the conference. We are all looking forward to meeting old and new friends and interacting with the Recsys community during the conference. If you want to make sure we meet, feel free to leave a comment to this post. See you in Dublin!

Tuesday, September 4, 2012

Netflix Shares Cloud Load Balancing And Failover Tool: Eureka!

By Karthikeyan Ranganathan

We are proud to announce Eureka, a service registry that is a critical component of the Netflix infrastructure in the AWS cloud, and underpins our mid-tier load balancing, deployment automation, data storage, caching and various other services.

What is Eureka?

Eureka is a REST based service that is primarily used in the AWS cloud for locating services for the purpose of load balancing and failover of middle-tier servers. We call this service, the Eureka Server. Eureka also comes with a java-based client component, the Eureka Client, which makes interactions with the service much easier. The client also has a built-in load balancer that does basic round-robin load balancing. At Netflix, a much more sophisticated load balancer wraps Eureka to provide weighted load balancing based on several factors like traffic, resource usage, error conditions etc to provide superior resiliency. We have previously referred to Eureka as the Netflix discovery service.

What is the need for Eureka?

In AWS cloud, because of it's inherent nature, servers come and go. Unlike the traditional load balancers which work with servers with well known IP addresses and host names, in AWS load balancing requires much more sophistication in registering and de-registering servers with the load balancer on the fly. Since AWS does not yet provide a middle tier load balancer, Eureka fills a big gap in that area.

How different is Eureka from AWS ELB?

AWS Elastic Load Balancer is a load balancing solution for edge services exposed to end-user web traffic. Eureka fills the need for mid-tier load balancing. While you can theoretically put your mid-tier services behind the AWS ELB, you expose them to the outside world and thereby losing all the usefulness of the AWS security groups.

One of the solutions that the proxy based load balancers including AWS ELB  offer that Eureka does not offer out of the box is sticky session based load balancing. In fact, at Netflix, the majority of mid-tier services are stateless and prefer non-sticky based load balancing as it works very well with the AWS autoscaling model.

AWS ELB is a traditional proxy-based load balancing solution whereas with Eureka load balancing happens at the instance/server/host level. The client instances know all the information about which servers they need to talk to and can contact them directly.

Another important aspect that differentiates proxy-based load balancing from load balancing using Eureka is that your application can be resilient to the outages of the load balancers, since the information regarding the available servers is cached on the client. This does require a small amount of memory, but buys better resiliency. There is also a small decrease in latency since contacting the server directly avoids two network hops through a proxy.

How different is Eureka from Route 53?

Route 53 is a naming service, and while Eureka provides naming for the mid-tier servers the similarity stops there. Route 53 is a DNS service which can host your DNS records even for non-AWS data centers. Route 53 can also do latency based routing across AWS regions. Eureka is analogous to internal DNS and has nothing to do with the DNS servers across the world. Eureka is also region-isolated in the sense that it does not know about servers in other AWS regions. It's primary purpose of holding information is for load balancing within a region.
While you can register your mid-tier servers with Route 53 and rely on AWS security groups to protect your servers from public access, your mid-tier server identity is still exposed to the external world. It also comes with the drawback of the traditional DNS based load balancing solutions where the traffic can still be routed to servers which may not be healthy or may not even exist (in the case of AWS cloud where servers can disappear anytime).

How is Eureka used at Netflix?

At Netflix, Eureka is used for the following purposes apart from mid-tier load balancing.

  • For aiding Netflix Asgard - an open source tool for managing cloud deployments.
    • Fast rollback of versions in case of problems avoiding the re-launch of 100's of instances.
    • In rolling pushes, for avoiding propagation of a new version to all instances.
  • For our Cassandra deployments to take instances out of traffic for maintenance.
  • For our Memcached based Evcache services to identify the list of nodes in the ring.
  • For carrying other additional application specific metadata about services.
When should I use Eureka?

You typically run in the AWS cloud and you have a host of middle tier services which you do not want to register with AWS ELB or expose traffic from outside world. You are either looking for a simple round-robin load balancing solution or are willing to write your own wrapper around Eureka based on your load balancing need. You do not have the need for sticky sessions and load session data from an external cache such as memcached. More importantly, if your architecture fits the model where a client based load balancer is favored, Eureka is well positioned to fit that usage.

How does the application client and application server communicate?

The communication technology could be anything you like. Eureka helps you find the information of the services you would want to communicate with but does not impose any restrictions on the protocol or method of communication. For instance, you can use Eureka to obtain the target server address and use protocols such as thrift, http(s) or any other RPC mechanisms.

High level architecture
Eureka High level ArchitectureThe architecture above depicts how Eureka is deployed at Netflix and this is how you would typically run it. There is one Eureka cluster per region which knows only about instances in its region. There is at the least one Eureka server per zone to handle zone failures.

Services register with Eureka and then send heartbeats to renew their leases every 30 seconds.If the client cannot renew the lease for a few times, it is taken out of the server registry in about 90 seconds.The registration information and the renewals are replicated to all the Eureka nodes in the cluster. The clients from any zone can look up the registry information (happens every 30 seconds) to locate their services (which could be in any zone) and make remote calls.

Non-java services and clients

For services that are non-java based, you have a choice of implementing the client part of Eureka in the language of the service or you can run a "sidekick" which is essentially a Java application with an embedded Eureka client that handles the registrations and heartbeats. REST based endpoints are also exposed for all operations that are supported by the Eureka client. Non-java clients can use the REST endpoints to query for information about other services.

Configurability

With Eureka you can add or remove cluster nodes on the fly. You can tune the internal configurations from timeouts to thread pools. Eureka uses Archaius and configurations can be tuned dynamically.

Resilience

Eureka benefits from the experience gained over several years operating on AWS, with resiliency built into both the client and the servers.Eureka clients are built to handle the failure of one or more Eureka servers. Since Eureka clients have the registry cache information in them, they can operate reasonably well, even when all of the Eureka servers go down.

Eureka Servers are resilient to other Eureka peers going down. Even during a network partition between the clients and servers, the servers have built-in resiliency to prevent a large scale outage.

Multiple Regions

Deploying Eureka in multiple AWS regions is a trivial task. Eureka clusters between regions do not communicate with one another.

Monitoring

Eureka uses Servo to track a lot information in both the client and the server for performance, monitoring and alerting.The data is typically available in the JMX registry and can be exported to Amazon Cloud Watch.

Stay tuned for

Asgard and Eureka integration.
More sophisticated Netflix mid-tier load balancing solutions.

If building critical infrastructure components like this, for a service that millions of people use world wide, excites you, take a look at jobs.netflix.com.