Tuesday, August 25, 2015

From Chaos to Control - Testing the resiliency of Netflix’s Content Discovery Platform

By:Leena Janardanan, Bruce Wobbe, Vilas Veeraraghavan

Introduction
Merchandising Application Platform (MAP) was conceived as a middle-tier service that would handle real time requests for content discovery. MAP does this by aggregating data from disparate data sources and implementing common business logic into one distinct layer. This centralized layer helps provide common experiences across device platforms and helps reduce duplicate, and sometimes, inconsistent business logic. In addition, it also allows recommendation systems - which are typically pre-compute systems - to be de-coupled from the real time path. MAP can be compared to a big funnel through which most of the content discovery data on a user’s screen goes through and is processed.
As an example, MAP generates localized row names for the personalized recommendations on the home page. This happens in real time, based on the locale of the user at the time the request is made. Similarly, application of maturity filters, localizing and sorting categories are examples of logic that lives in MAP.


Localized categories and row names, up-to-date My List and Continue Watching


A classic example of duplicated but inconsistent business logic that MAP consolidated was the “next episode” logic -  the rule to determine if a particular episode was completed and the next episode should be shown. In one platform, it required that credits had started and/or 95% of the episode to be finished. In another platform, it was simply that 90% of the episode had to be finished. MAP consolidated this logic into one simple call that all devices now use.
 MAP also enables discovery data to be a mix of pre-computed and real time data. On the homepage, rows like My List, Continue Watching and Trending Now are examples of real time data whereas rows like “Because you watched” are pre-computed. As an example, if a user added a title to My List on a mobile device and decided to watch the title on a Smart TV, the user would expect My List on the TV to be up-to-date immediately. What this requires is the ability to selectively update some data in real time. MAP provides the APIs and logic to detect if data has changed and update it as needed. This allows us to keep the efficiencies gained from pre-compute systems for most of the data, while also having the flexibility to keep other data fresh.
MAP also supports business logic required for various A/B tests, many of which are active on Netflix at any given time. Examples include: inserting non-personalized  rows, changing the sort order for titles within a row and changing the contents of a row.
The services that generate this data are a mix of pre-compute and real time systems. Depending on the data, the calling patterns from devices for each type of data also vary. Some data is fetched once per session, some of it is pre-fetched when the user navigates the page/screen and other data is refreshed constantly (My List, Recently Watched, Trending Now).

Architecture

MAP is comprised of two parts - a server and a client. The server is the workhorse which does all the data aggregation and applies business logic. This data is then stored in caches (see EVCache) the client reads. The client primarily serves the data and is the home for resiliency logic. The client decides when a call to the server is taking too long, when to open a circuit (see Hystrix) and, if needed, what type of fallback should be served.




MAP is in the critical path of content discovery. Without a well thought out resiliency story, any failures in MAP would severely impact the user experience and Netflix's availability. As a result, we spend a lot of time thinking about how to make MAP resilient.

Challenges in making MAP resilient
Two approaches commonly used by MAP to improve resiliency are:
(1) Implementing fallback responses for failure scenarios
(2) Load shedding - either by opening circuits to the downstream services or by limiting retries wherever possible.

There are a number of factors that make it challenging to make MAP resilient: 
(1) MAP has numerous dependencies, which translates to multiple points of failure. In addition, the behavior of these dependencies evolves over time, especially as A/B tests are launched, and a solution that works today may not do so in 6 months. At some level, this is a game of Whack-A-Mole as we try to keep up with a constantly changing eco system.

(2) There is no one type of fallback that works for all scenarios:
    • In some cases, an empty response is the only option and devices have to be able to handle that gracefully. E.g. data for the "My List" row couldn't be retrieved.
    • Various degraded modes of performance can be supported. E.g. if the latest personalized home page cannot be delivered, fallbacks can range from stale, personalized recommendations to non-personalized recommendations.
    • In other cases, an exception/error code might be the right response, indicating to clients there is a problem and giving them the ability to adapt the user experience - skip a step in a workflow, request different data, etc.

How do we go from Chaos to Control?

Early on, failures in MAP or its dependent services caused SPS dips like this:


It was clear that we needed to make MAP more resilient. The first question to answer was - what does resiliency mean for MAP? It came down to these expectations:
(1) Ensure an acceptable user experience during a MAP failure, e.g. that the user can browse our selection and continue to play videos
(2) Services that depend on MAP i.e. the API service and device platforms are not impacted by a MAP failure and continue to provide uninterrupted services
(3) Services that MAP depends on are not overwhelmed by excessive load from MAP

It is easy enough to identify obvious points of failure. For example - if a service provides data X, we could ensure that MAP has a fallback for data X being unavailable. What is harder is knowing the impact of failures in multiple services - different combinations of them - and the impact of higher latencies.

This is where the Latency Monkey and FIT come in. Running Latency Monkey in our production environment allows us to detect problems caused by latent services. With Latency Monkey testing, we have been able to fix incorrect behaviors and fine tune various parameters on the backend services like:
(1) Timeouts for various calls
(2) Thresholds for opening circuits via Hystrix
(3) Fallbacks for certain use cases
(4) Thread pool settings

FIT, on the other hand, allows us to simulate specific failures. We restrict the scope of failures to a few test accounts. This allows us to validate fallbacks as well as the user experience. Using FIT, we are able to sever connections with:
(1) Cache that handles MAP reads and writes 
(2) Dependencies that MAP interfaces with
(3) MAP service itself




What does control look like?


In a successful run of FIT or Chaos Monkey, this is how metrics look like now:
Total requests served by MAP before and during the test(no impact)

MAP successful fallbacks during the test(high fallback rate)


On a lighter note, our failure simulations uncovered some interesting user experience issues, which have since been fixed.

  1. Simulating failures in all the dependent services of MAP server caused an odd data mismatch to happen:
The Avengers shows graphic for Peaky Blinders


  1. Severing connections to MAP server and the cache caused these duplicate titles to be served:


  1. When the cache was made unavailable mid session, some rows looked like this:


  1. Simulating a failure in the “My List” service caused the PS4 UI to be stuck on adding a title to My List:


In an ever evolving ecosystem of many dependent services, the future of resiliency testing resides in automation. We have taken small but significant steps this year towards making some of these FIT tests automated. The goal is to build these tests out so they run during every release and catch any regressions.


Looking ahead for MAP, there are many more problems to solve. How can we make MAP more performant? Will our caching strategy scale to the next X million customers? How do we enable faster innovation without impacting reliability? Stay tuned for updates!

Thursday, August 20, 2015

Fenzo: OSS Scheduler for Apache Mesos Frameworks

Bringing Netflix to our millions of subscribers is no easy task. The product comprises dozens of services in our distributed environment, each of which is operating a critical component to the experience while constantly evolving with new functionality. Optimizing the launch of these services is essential for both the stability of the customer experience as well as overall performance and costs. To that end, we are happy to introduce Fenzo, an open source scheduler for Apache Mesos frameworks. Fenzo tightly manages the scheduling and resource assignments of these deployments.


Fenzo is now available in the Netflix OSS suite. Read on for more details about how Fenzo works and why we built it. For the impatient, you can find source code and docs on Github.

Why Fenzo?

Two main motivations for developing a new framework, as opposed to leveraging one of the many frameworks in the community, were to achieve scheduling optimizations and to be able to autoscale the cluster based on usage, both of which will be discussed in greater detail below. Fenzo enables frameworks to better manage ephemerality aspects that are unique to the cloud. Our use cases include a reactive stream processing system for real time operational insights and managing deployments of container based applications.


At Netflix, we see a large variation in the amount of data that our jobs process over the course of a day. Provisioning the cluster for peak usage, as is typical in data center environments, is wasteful. Also, systems may occasionally be inundated with interactive jobs from users responding to certain anomalous operational events. We need to take advantage of the cloud’s elasticity and scale the cluster up and down based on dynamic loads.


Although scaling up a cluster may seem relatively easy by watching, for example, the amount of available resources falling below a threshold, scaling down presents additional challenges. If the tasks are long lived and cannot be terminated without negative consequences, such as time consuming reconfiguration of stateful stream processing topologies, the scheduler will have to assign them such that all tasks on a host terminate at about the same time so the host can be terminated for scale down.

Scheduling Strategy

Scheduling tasks requires optimization of resource assignments to maximize the intended goals. When there are multiple resource assignments possible, picking one versus another can lead to significantly different outcomes in terms of scalability, performance, etc. As such, efficient assignment selection is a crucial aspect of a scheduler library. For example, picking assignments by evaluating every pending task with every available resource is computationally prohibitive.

Scheduling Model

Our design focused on large scale deployments with a heterogeneous mix of tasks and resources that have multiple constraints and optimizations needs. If evaluating the most optimal assignments takes a long time, it could create two problems:
  • resources become idle, waiting for new assignments
  • task launches experience increased latency


Fenzo adopts an approach that moves us quickly in the right direction as opposed to coming up with the most optimal set of scheduling assignments every time.


Conceptually, we think of tasks as having an urgency factor that determines how soon it needs an assignment, and a fitness factor that determines how well it fits on a given host.
If the task is very urgent or if it fits very well on a given resource, we go ahead and assign that resource to the task. Otherwise, we keep the task pending until either urgency increases or we find another host with a larger fitness value.

Trading Off Scheduling Speed with Optimizations

Fenzo has knobs for you to choose speed and optimal assignments dynamically. Fenzo employs a strategy of evaluating optimal assignments across multiple hosts, but, only until a fitness value deemed “good enough” is obtained. While a user defined threshold for fitness being good enough controls the speed, a fitness evaluation plugin represents the optimality of assignments and the high level scheduling objectives for the cluster. A fitness calculator can be composed from multiple other fitness calculators, representing a multi-faceted objective.

Task Constraints

Fenzo tasks can use optional soft or hard constraints to influence assignments to achieve locality with other tasks and/or affinity to resources. Soft constraints are satisfied on a best efforts basis and combine with the fitness calculator for scoring hosts for possible assignment. Hard constraints must be satisfied and act as a resource selection filter.

Fenzo provides all relevant cluster state information to the fitness calculators and constraints plugins so you can optimize assignments based on various aspects of jobs, resources, and time.

Bin Packing and Constraints Plugins

Fenzo currently has built-in fitness calculators for bin packing based on CPU, memory, or network bandwidth resources, or a combination of them.


Some of the built-in constraints address common use cases of locality with respect to resource types, assigning distinct hosts to a set of tasks, balancing tasks across a given host attribute, such as the availability zone, rack location, etc.


You can customize fitness calculators and constraints by providing new plugins.

Cluster Autoscaling

Fenzo supports cluster autoscaling using two complementary strategies:
  • Thresholds based
  • Resource shortfall analysis based


Thresholds based autoscaling lets users specify rules per host group (e.g., EC2 Auto Scaling Group, ASG) being used in the cluster. For example, there may be one ASG created for compute intensive workloads using one EC2 instance type, and another for network intensive workloads. Each rule helps maintain a configured number of idle hosts available for launching new jobs quickly.


The resource shortfall analysis attempts to estimate the number of hosts required to satisfy the pending workload. This complements the rules based scale up during demand surges. Fenzo’s autoscaling also complements predictive autoscaling systems, such as Netflix’s Scryer.

Usage at Netflix

Fenzo is currently being used in two Mesos frameworks at Netflix for a variety of use cases including long running services and batch jobs. We have observed that the scheduler is fast at allocating resources with multiple constraints and custom fitness calculators. Also, Fenzo has allowed us to scale the cluster based on current demand instead of provisioning it for peak demand.


The table below shows the average and maximum times we have observed for each scheduling run in one of our clusters. Each scheduling run may attempt to assign resources to more than one task. The run time can vary depending on the number of tasks that need assignments, the number and types of constraints used by the tasks, and the number of hosts to choose resources from.


Scheduler run time in milliseconds
Average
2 mS
Maximum
38 mS
(occasional spikes of about 30 mS)


The image below shows the number of Mesos slaves in the cluster going up and down as a result of Fenzo’s autoscaler actions over several days, representing about 3X difference in the maximum and minimum counts.
TitanAutoscaling2.png.

Fenzo Usage in Mesos Frameworks

FenzoUsageDiagram.png
A simplified diagram above shows how Fenzo is used by an Apache Mesos framework. Fenzo’s task scheduler provides the scheduling core without interaction with Mesos itself. The framework, interfaces with Mesos to get callbacks on new resource offers and task status updates. As well, it calls Mesos driver to launch tasks based on Fenzo’s assignments.

Summary

Fenzo has been a great addition to our cloud platform. It gives us a high degree of control over work scheduling on Mesos, and has enabled us to strike a balance between machine efficiency and getting jobs running quickly. Out of the box Fenzo supports cluster autoscaling and bin packing. Custom schedulers can be implemented by writing your own plugins.


Source code is available on Netflix Github. The repository contains a sample framework that shows how to use Fenzo. Also, the JUnit tests show examples of various features including writing custom fitness calculators and constraints. Fenzo wiki contains detailed documentation on getting you started.

Monday, August 17, 2015

Netflix Releases Falcor Developer Preview

by Jafar Husain, Paul Taylor and Michael Paulson

Developers strive to create the illusion that all of their application’s data is sitting right there on the user’s device just waiting to be displayed. To make that experience a reality, data must be efficiently retrieved from the network and intelligently cached on the client.

That’s why Netflix created Falcor, a JavaScript library for efficient data fetching. Falcor powers Netflix’s mobile, desktop and TV applications.

Falcor lets you represent all your remote data sources as a single domain model via JSON Graph. Falcor makes it easy to access as much or as little of your model as you want, when you want it. You retrieve your data using familiar JavaScript operations like get, set, and call. If you know your data, you know your API.

You code the same way no matter where the data is, whether in memory on the client or over the network on the server. Falcor keeps your data in a single, coherent cache and manages stale data and cache pruning for you. Falcor automatically traverses references in your graph and makes requests as needed. It transparently handles all network communications, opportunistically batching and de-duping requests.

Today, Netflix is unveiling a developer preview of Falcor:

Falcor is still under active development and we’ll be unveiling a roadmap soon. This developer preview includes a Node version of our Falcor Router not yet in production use.

We’re excited to start developing in the open and share this library with the community, and eager for your feedback and contributions.

For ongoing updates, follow Falcor on Twitter!

Wednesday, August 5, 2015

Making Netflix.com Faster

by Kristofer Baxter

Simply put, performance matters. We know members want to immediately start browsing or watching their favorite content and have found that faster startup leads to more satisfying usage. So, when building the long-awaited update to netflix.com, the Website UI Engineering team made startup performance a first tier priority.

The impact of this effort netted a 70% reduction in startup time, and was focused in three key areas:

  1. Server and Client Rendering
  2. Universal JavaScript
  3. JavaScript Payload Reductions

Server and Client Rendering

The netflix.com legacy website stack had a hard separation between server markup and client enhancement. This was primarily due to the different programming languages used in each part of our application. On the server, there was Java with Tomcat, Struts and Tiles. On the browser client, we enhanced server-generated markup with JavaScript, primarily via jQuery.

This separation led to undesirable results in our startup time. Every time a visitor came to any page on netflix.com our Java tier would generate the majority of the response needed for the entire page's lifetime and deliver it as HTML markup. Often, users would be waiting for the generation of markup for large parts of the page they would never visit.

Our new architecture renders only a small amount of the page's markup, bootstrapping the client view. We can easily change the amount of the total view the server generates, making it easy to see the positive or negative impact. The server requires less data to deliver a response and spends less time converting data into DOM elements. Once the client JavaScript has taken over, it can retrieve all additional data for the remainder of the current and future views of a session on demand. The large wins here were the reduction of processing time in the server, and the consolidation of the rendering into one language.

We find the flexibility afforded by server and client rendering allows us to make intelligent choices of what to request and render in the server and the client, leading to a faster startup and a smoother transition between views.

Universal JavaScript

In order to support identical rendering on the client and server, we needed to rethink our rendering pipeline. Our previous architecture's separation between the generation of markup on the server and the enhancement of it on the client had to be dropped.

Three large pain points shaped our new Node.js architecture:

  1. Context switching between languages was not ideal.
  2. Enhancement of markup required too much direct coupling between server-only code generating markup and the client-only code enhancing it.
  3. We’d rather generate all our markup using the same API.

There are many solutions to this problem that don't require Universal JavaScript, but we found this lesson was most appropriate: When there are two copies of the same thing, it's fairly easy for one to be slightly different than the other. Using Universal JavaScript means the rendering logic is simply passed down to the client.

Node.js and React.js are natural fits for this style of application. With Node.js and React.js, we can render from the server and subsequently render changes entirely on the client after the initial markup and React.js components have been transmitted to the browser. This flexibility allows for the application to render the exact same output independent of the location of the rendering. The hard separation is no longer present and it's far less likely for the server and client to be different than one another.

Without shared rendering logic we couldn't have realized the potential of rendering only what was necessary on startup and everything else as data became available.

Reduce JavaScript Payload Impact

Building rich interactive experiences on the web often translates into a large JavaScript payload for users. In our new architecture, we placed significant emphasis on pruning large dependencies we can knowingly replace with smaller modules and delivering JavaScript only applicable for the current visitor.

Many of the large dependencies we relied on in the legacy architecture didn't apply in the new one. We've replaced these dependencies in favor of newer, more efficient libraries. Replacing these libraries resulted in a much smaller JavaScript payload, meaning members need less JavaScript to start browsing. We know there is significant work remaining here, and we're actively working to trim our JavaScript payload down further.

Time To Interactive

In order to test and understand the impact of our choices, we monitor a metric we call time to interactive (tti).

Amount of time spent between first known startup of the application platform and when the UI is interactive regardless of view. Note that this does not require that the UI is done loading, but is the first point at which the customer can interact with the UI using an input device.

For applications running inside a web browser, this data is easily retrievable from the Navigation Timing API (where supported).

Work is Ongoing

We firmly believe high performance is not an optional engineering goal – it's a requirement for creating great user-experiences. We have made significant strides in startup performance, and are committed to challenging our industry’s best-practices in the pursuit of a better experience for our members.

Over the coming months we'll be investigating Service Workers, ASM.js, Web Assembly, and other emerging web standards to see if we can leverage them for a more performant website experience. If you’re interested in helping create and shape the next generation of performant web user-experiences apply here.

Monday, August 3, 2015

Netflix at Velocity 2015: Linux Performance Tools

There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? At Velocity 2015, I gave a 90 minute tutorial on Linux performance tools. I’ve spoken on this topic before, but given a 90 minute time slot I was able to include more methodologies, tools, and live demonstrations, making it the most complete tour of the topic I’ve done. The video and slides are below.

In this tutorial I summarize traditional and advanced performance tools, including: top, ps, vmstat, iostat, mpstat, free, strace, tcpdump, netstat, nicstat, pidstat, swapon, lsof, sar, ss, iptraf, iotop, slaptop, pcstat, tiptop, rdmsr, lmbench, fio, pchar, perf_events, ftrace, SystemTap, ktap, sysdig, and eBPF; and reference many more. I also include updated tools diagrams for observability, sar, benchmarking, and tuning (including the image above).

This tutorial can be shared with a wide audience – anyone working on Linux systems – as a free crash course on Linux performance tools. I hope people enjoy it and find it useful. Here's the playlist.

Part 1 (youtube) (54 mins):

Part 2 (youtube) (45 mins):

Slides (slideshare):

At Netflix, we have Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. Much of the time we don't need to login to instances directly, but when we do, this tutorial covers the tools we use.

Thanks to O'Reilly for hosting a great conference, and those who attended.

If you are passionate about the content in this tutorial, we're hiring, particularly for senior SREs and performance engineers: see Netflix jobs!

Tuesday, July 28, 2015

Tuning Tomcat For A High Throughput, Fail Fast System

Problem

Netflix has a number of high throughput, low latency mid tier services. In one of these services, it was observed that in case there is a huge surge in traffic in a very short span of time, the machines became cpu starved and would become unresponsive. This would lead to a bad experience for the clients of this service. They would get a mix of read and connect timeouts. Read timeouts can be particularly bad if the read timeouts are set to be very high. The client machines will wait to hear from the server for a long time. In case of SOA, this can lead to a ripple effect as the clients of these clients will also start getting read timeouts and all services can slow down. Under normal circumstances, the machines had ample amount of cpu free and the service was not cpu intensive. So, why does this happen? In order to understand that, let's first look at the high level stack for this service. The request flow would look like this




On simulating the traffic surge in the test environment it was found that the reason for cpu starvation was improper apache and tomcat configuration. On a sudden increase in traffic, multiple apache workers became busy and a very large number of tomcat threads also got busy. There was a huge jump in system cpu as none of the threads could do any meaningful work since most of the time cpu would be context switching.

Solution

Since this was a mid tier service, there was not much use of apache. So, instead of tuning two systems (apache and tomcat), it was decided to simplify the stack and get rid of apache. To understand why too many tomcat threads got busy, let's understand the tomcat threading model.

High Level Threading Model for Tomcat Http Connector

Tomcat has an acceptor thread to accept connections. In addition, there is a pool of worker threads which do the real work. The high level flow for an incoming request is:
  1. TCP handshake between OS and client for establishing a connection. Depending on the OS implementation there can be a single queue for holding the connections or there can be multiple queues. In case of multiple queues, one holds incomplete connections which have not yet completed the tcp handshake. Once completed, connections are moved to the completed connection queue for consumption by the application. "acceptCount" parameter in tomcat configuration is used to control the size of these queues.
  2. Tomcat acceptor thread accepts connections from the completed connection queue.
  3. Checks if a worker thread is available in the free thread pool. If not, creates a worker thread if the number of active threads < maxThreads. Else wait for a worker thread to become free.
  4. Once a free worker thread is found, acceptor thread hands the connection to it and gets back to listening for new connections.
  5. Worker thread does the actual job of reading input from the connection, processing the request and sending the response to the client. If the connection was not keep alive then it closes the connection and places itself in the free thread pool. For a keep alive connection, waits for more data to be available on the connection. In case data does not become available until keepAliveTimeout, closes the connection and makes itself available in the free thread pool.
In case the number of tomcat threads and acceptCount values are set to be too high, a sudden increase in traffic will fill up the OS queues and make all the worker threads busy. When more requests than that can be handled by the system are sent to the machines, this "queuing" of requests is inevitable and will lead to increased busy threads, causing cpu starvation eventually.  Hence, the crux of the solution is to avoid too much queuing of requests at multiple points (OS and tomcat threads) and fail fast (return http status 503) as soon the application's maximum capacity is reached. Here is a recommendation for doing this in practice:

Fail fast in case the system capacity for a machine is hit

Estimate the number of threads expected to be busy at peak load. If the server responds back in 5 ms on avg for a request, then a single thread can do a max of 200 requests per second (rps). In case the machine has a quad core cpu, it can do max 800 rps. Now assume that 4 requests (since the assumption is that the machine is a quad core) come in parallel and hit the machine. This will make 4 worker threads busy. For the next 5 ms all these threads will be busy. The total rps to the system is the max value of 800, so in next 5 ms, 4 more requests will come and make another 4 threads busy. Subsequent requests will pick up one of the already busy threads which has become free. So, on an average there should not be more than 8 threads busy at 800 rps. The behavior will be a little different in practice because all system resources like cpu will be shared. Hence one should experiment for the total throughput the system can sustain and do a calculation for expected number of busy threads. This will provide a base line for the number of threads needed to sustain peak load. In order to provide some buffer lets more than triple the number of max threads needed to 30. This buffer is arbitrary and can be further tuned if needed. In our experiments we used a slightly more than 3 times buffer and it worked well.

Track the number of active concurrent requests in memory and use it for fast failing. If the number of concurrent requests is near the estimated active threads (8 in our example) then return an http status code of 503. This will prevent too many worker threads becoming busy because once the peak throughput is hit, any extra threads which become active will be doing a very light weight job of returning 503 and then be available for further processing.

Configure Operating System parameters

The acceptCount parameter for tomcat dictates the length of the queues at the OS level for completing tcp handshake operations (details are OS specific). It's important to tune this parameter, otherwise one can have issues with establishing connections to the machine or it can lead to excessive queuing of connections in OS queues which will lead to read timeouts. The implementation details of handling incomplete and complete connections vary across OS. There can be a single queue of connections or multiple queues for incomplete and complete connections (please refer to the References section for details). So, a nice way to tune the acceptCount parameter is to start with a small value and keep increasing it unless the connection errors get removed.

Having too large a value for acceptCount means that the incoming requests can get accepted at the OS level. However, if the incoming rps is more than what a machine can handle, all the worker threads will eventually become busy and then the acceptor thread will wait for a worker thread to become free. More requests will continue to pile up in the OS queues since acceptor thread will consume them only when a worker thread becomes available. In the worst case, these requests will timeout while waiting in the OS queues, but will still be processed by the server once they get picked by the tomcat's acceptor thread. This is a complete waste of processing resources as a client will never receive any response.

If the value of acceptCount is too small, then in case of a high rps there will not be enough space for OS to accept connections and make it available for the acceptor thread. In this case, connect timeout errors will be returned to the client way below the actual throughput for the server is reached.

Hence experiment by starting with a small value like 10 for acceptCount and keep increasing it until there are are no connection errors from the server.

On doing both the changes above, even if all the worker threads become busy in the worst case, the servers will not be cpu starved and will be able to do as much work as possible (max throughput).

Other considerations

As explained above, each incoming connection is ultimately handled to a worker tomcat thread. In case http keep alive is turned on, a worker thread will continue to listen on a connection and will not be available in the free thread pool. So, if the clients are not smart to close the connection once it's not being actively used, the server can very easily run out of worker threads. If keep alive is turned on then one has to size the server farm by keeping this constraint in mind.

Alternatively, if keep alive is turned off then one does not have to worry about the problem of inactive connections using worker threads. However, in this case on each call one has to pay the price of opening and closing the connection. Further, this will also create a lot of sockets in the TIME_WAIT state which can put pressure on the servers.

Its best to pick the choice based on the use cases for the application and to test the performance by running experiments.

Results

Multiple experiments were run with different configurations. The results are shown below. The dark blue line is the original configuration with apache and tomcat. All the other are different configurations for the stack with only tomcat


Throughput
Note the drop after a sustained period of traffic higher than what can be served by server.


Busy Apache Workers


Idle cpu
Note that the original configuration got so busy that it was not even able to publish the stats for idle cpu on a continuous basis. The stats were published (valued 0) for the base configuration intermittently as highlighted in the red circles


Server average latency to process a request

Note

Its possible to achieve the same results by tuning the combination of apache and tomcat to work together. However, since there was not much use of apache for our service, we found the above model simpler with one less moving part. It's best to make choices by a combination of understanding the system and use of experimentation and testing in a real-world environment to verify hypothesis.

References

  1. https://books.google.com/books/about/UNIX_Network_Programming.html?id=ptSC4LpwGA0C&source=kp_cover&hl=en
  2. http://www.sean.de/Solaris/soltune.html
  3. https://tomcat.apache.org/tomcat-7.0-doc/config/http.html
  4. http://grepcode.com/project/repository.springsource.com/org.apache.coyote/com.springsource.org.apache.coyote/

Acknowledgment

I would like to thank Mohan Doraiswamy for his suggestions in this effort.