Scalability - dealing with overload

Last time we looked at a good way of measuring load. This time we're going to look at ways to deal with overload, whether that's caused by bursts of traffic, glitches in response times, or some other mechanism. Overload in this context means that requests are arriving faster than the system can respond - where the response time consists of the request latency (queuing time) + processing time.

The first approach to dealing with overload is to just let it happen. This is actually a reasonable approach if the overload is temporary and only happens occasionally. Your system will catch up once the traffic goes back to normal levels. The danger is that the overload is sustained. If the system keeps accepting requests that arrive faster than the rate at which they can be processed, the system response times will degrade until something goes "bang". Extrapolating from realistic simulations involving a cute bunny and a rocket-propelled grenade, the result will not be pretty.

A better approach is to decouple the request arrival rate from the processing rate by inserting an "incoming" queue between the arrival thread pool and the transaction thread pool. Requests are taken off the incoming queue for processing at the maximum rate that the transaction thread pool can handle while still maintaining good response times.

Of course if the overload is sustained, we're just moving the problem around. As the queue grows and grows, the latency (and therefore the overall response times) will grow with it. If the queue is unbounded, it will eventually cause significant memory effects - first polluting the CPU caches and then consuming enough memory that your system will undergo a rapid unscheduled disassembly again.

To avoid this, you have to limit the length of the incoming queue. And once that queue is full, you have to start rejecting new requests. For example, if your transport is HTTP you could respond to new requests with HTTP 503 along with a "Retry-After" header suggesting an appropriate number of seconds to delay before sending the next request. In effect you push back upstream saying "we're full" and politely turn people away at the door. In this way you can keep the response time both reasonable and consistent for those requests that aren't rejected. The system runs up to maximum load and then degrades gracefully without crashing or corrupting data. This is an example of staged event-driven architecture (SEDA).

A logical design of a real-life implementation might look something like this:

More generally, system reliability is often about enforcing explicit limits, whether that's queue lengths, message rates, payload sizes, bandwidth throttling, response times, etc. With explicit limits you can define the specific boundaries of your system and then use those limits for your testing. Don't illustrate an interface between two services with an arrow saying "JSON over HTTP". Say instead "JSON over HTTP with a max message rate of 100 per second and a max message size of 5 KB and a max ACK time of 50 milliseconds". This is a big part of the design-by-contract philosophy. Unbounded anything is a reliability anti-pattern. Without explicit limits, systems fail in unpredictable and unexpected ways. 

ASP.NET implementation note: If you're using ASP.NET in classic mode (don't do that), it uses a maximum of 12 concurrent threads per CPU core before starting to issue HTTP 503 messages. If you're doing non-blocking requests, that should be fine. If you're doing blocking requests, then the request to the back-end processing service should be asynchronous to improve scalability. In integrated mode, ASP.NET limits the number of requests rather than the thread usage. You can read a lot more detail in Thomas Marquadt's blog entry on asynchronous work in ASP.NET