Friday, July 29, 2016

Distributed Resource Scheduling with Apache Mesos


Netflix uses Apache Mesos to run a mix of batch, stream processing, and service style workloads. For over two years, we have seen an increased usage for a variety of use cases including real time anomaly detection, training and model building batch jobs, machine learning orchestration, and Node.js based microservices. The recent release of Apache Mesos 1.0 represents maturity of the technology that has evolved significantly since we first started to experiment with it.

Our initial use of Apache Mesos was motivated by fine grained resource allocation to tasks of various sizes that can be bin packed to a single EC2 instance. In the absence of Mesos, or a similar resource manager, we would have had to forego fine grained allocation for increased number of instances with suboptimal usage, or develop a technology similar to Mesos, or at least a subset of it.

The increasing adoption of containers for stream processing and batch jobs continues to drive usage in Mesos-based resource scheduling. More recently, developer benefits from working with Docker-based containers brought in a set of service style workloads on Mesos clusters. We present here an overview of some of the projects using Apache Mesos across Netflix engineering. We show the different use cases they address and how they each use the technology effectively. For further details on each of the projects, we provide links to other posts in sections below.

Cloud native scheduling using Apache Mesos

In order to allocate resources from various EC2 instances to tasks, we need a resource manager that makes the resources available for scheduling and carries out the logistics of launching and monitoring the tasks over a distributed set of EC2 instances. Apache Mesos separates out resource allocation to “frameworks” that wish to use the cluster, from scheduling of resources to tasks by the frameworks. While Mesos determines how many resources are allocated to a framework, the framework’s scheduler determines which resources to assign to which tasks, and when. The schedulers are presented a relatively simple API so they can focus on the scheduling logic and react to failures, which are inevitable in a distributed system. This allows users to write different schedulers that cater to various use cases, instead of Mesos having to be the single monolithic scheduler for all use cases. The diagram below from Mesos documentation shows “Framework 1” receiving an offer from “Agent 1” and launching two tasks.
MesosArchitecture.jpg
The Mesos community has seen multiple schedulers developed over time that cater to specific use cases and present specific APIs to their users.

Netflix runs various microservices in an elastic cloud, AWS EC2. Operating Mesos clusters in a cloud native environment required us to ensure that the schedulers can handle two aspects in addition to what the schedulers that operate in a data center environment do - increased ephemerality of the agents running the tasks, and the ability to autoscale the Mesos agent cluster based on demand. Also, the use cases we had in mind called for a more advanced scheduling of resources than a first fit kind of assignment. For example, bin packing of tasks to agents by their use of CPUs, memory, and network bandwidth in order to minimize fragmentation of resources. Bin packing also helps us free up as many agents as possible to ease down scaling of the agent cluster by terminating idle agents without terminating running tasks.

Identifying a gap in such capabilities among the existing schedulers, last year we contributed a scheduling library called Fenzo. Fenzo autoscales the agent cluster based on demand and assigns resources to tasks based on multiple scheduling objectives composed via fitness criteria and constraints. The fitness criteria and the constraints are extensible via plugins, with a few common implementations built in, such as bin packing and spreading tasks of a job across EC2 availability zones for high availability. Any Mesos framework that runs on the JVM can use the Fenzo Java library.

Mesos at Netflix

Here are three projects currently running Apache Mesos clusters.

Mantis

Mantis is a reactive stream processing platform that operates as a cloud native service with a focus on operational data streams. Mantis covers varied use cases including real-time dashboarding, alerting, anomaly detection, metric generation, and ad-hoc interactive exploration of streaming data. We created Mantis to make it easy for teams to get access to real-time events and build applications on top of them. Currently, Mantis is processing event streams of up to 8 million events per second and running hundreds of stream-processing jobs around the clock. One such job focuses on individual titles, processing fine-grained insights to figure out if, for example, there are playback issues with House of Cards, Season 4, Episode 1 on iPads in Brazil. This amounts to tracking millions of unique combinations of data all the time.

The Mantis platform comprises a master and an agent cluster. Users submit stream-processing applications as jobs that run as one or more workers on the agent cluster. The master uses the Fenzo scheduling library with Apache Mesos to optimally assign resources to a job’s workers. One such assignment objective places perpetual stream processing jobs on agents separate from those running transient interactive jobs. This helps scale down of the agent cluster when the transient jobs complete. The below diagram shows Mantis architecture. Workers from the various jobs may run on the same agent using Cgroups based resource isolation.

MantisArchitectureImage.png

Titus

Titus is a Docker container job management and execution platform. Initially, Titus served batch jobs that included algorithm training (similar titles for recommendations, A/B test cell analysis, etc.) as well as hourly ad-hoc reporting and analysis jobs. More recently, Titus has started to support service style jobs (Netflix microservices) that are in need of a consistent local development experience as well as more fine grained resource management. Titus' initial service style use is for the API re-architecture using server side NodeJS.


The above architecture diagram for Titus shows its Master using Fenzo to assign resources from Mesos agents. Titus provides tight integration into the Netflix microservices and AWS ecosystem, including integrations for service discovery, software based load balancing, monitoring, and our CI/CD pipeline, Spinnaker. The ability to write custom executors in Mesos allows us to easily tune the container runtime to fit in with the rest of the ecosystem.

Meson

Meson is a general purpose workflow orchestration and scheduling framework that was built to manage machine learning pipelines.

Meson caters to a heterogeneous mix of jobs with varying resource requirements for CPU, memory, and disk space. It supports the running of Spark jobs along with other batch jobs in a shared cluster. Tasks are resource isolated on the agents using Cgroups based isolation. The Meson scheduler evaluates readiness of tasks based on a graph and launches the ready tasks using resource offers from Mesos. Failure handling includes re-launching failed tasks as well as terminating tasks determined to have gone astray.
MesonArchitecture.png

The above diagram shows Meson’s architecture. The Meson team is currently working on enhancing its scheduling capabilities using the Fenzo scheduling library.  

Continuing work with Apache Mesos

As we continue to evolve Mantis, Titus, and Meson projects, Apache Mesos provides a stable, reliable, and scalable resource management platform. We engage with the Mesos community through our open source contribution, Fenzo, and by exchanging ideas at MesosCon conferences - connect with us at the upcoming MesosCon Europe 2016, or see our past sessions from 2014, 2015, and earlier this year (Lessons learned and Meson).

Our future work on these projects includes adding SLAs (service level agreements - such as disparate capacity guarantees for service and batch style jobs), security hardening of agents and containers, increasing operational efficiency and visibility, and adoption across a broader set of use cases. There are exciting projects in the pipeline across Mesos, Fenzo, and our frameworks to make these and other efforts successful.

If you are interested in helping us evolve our resource scheduling and container deployment projects, join our Container Platform or Personalization Infrastructure teams.

  • Sharma Podila, Andrew Spyker, Neeraj Joshi, Antony Arokiasamy