Queues Don't Fix Overload (2014)

Back in the 90s. I was part of a team working R&D project on flexible manufacturing. One of the central concepts was the use of buffers (also called decouplers) between manufacturing stations. These buffers were the space needed to place not finished products while they waited for the next station. Typically a conveyor belt. The main service is that both stations could run at slightly different rhythm, and they would not be causing problems on the other one. Either starving it or overwhelming it. Queues are pretty similar. You can have a consumer and a producer working and the queue in between prevents them from being blocked by each other. Thus I now see the main role of queues as decouplers.

Queues fix throughput volatility (not throughput mismatch) at the cost of added latency. If your widget producer is producing 1000 widgets every half-hour and 0 every other half-hour, and your widget consumer needs to consume 100 widgets every six minutes, a 1000-widget queue solves the problem, in exchange for a half-hour increase in end-to-end processing time. But, as the title and article allude, your widget producer and widget consumer still need to process widgets at the same rate on average. The longer the time window needed for those averages to match, the larger your queue needs to be (and the higher the latency).

That's also why queues are inappropriate for addressing traffic spikes caused by market dynamics (celebrity news, sales events etc.). Those dynamics typically occur over the course of several hours, whereas a web request's latency SLA is on the order of several seconds.

Author clearly has a wealth of real experience, but I have trouble reconciling some of it to the “real world.”

Supposing that you have “too many” messages in your queue, commanding your frontend client to retry its transaction that would’ve added one more, instead of accepting and enqueuing one additional job, doesn’t seem to me to change much. Instead of creating a mess for whoever is in charge of those servers, the mess is created directly in view of the end user, who sees whatever you show them when their transaction is being retried.

Their point about the bottleneck being the real problem that must be addressed if loads are going to be sustained at such a high level is indisputable, though.

I think I would define the necessary rule as: the queue’s maximum size just needs to be greater than the spikes you expect, but that’s of course no insight, just a definition.

I have found queues to be incredibly valuable at solving situations where load has occasional spikes, but urgency of the jobs being done is low. For instance, every time a user views a piece of content you want to make sure that you increment a counter of how many times the content has been viewed, and you also want to touch the timestamp of when that user last did a thing. If that happens even two hours late, it’s probably gonna be fine. The thing that the queue pattern excels at in the realm of Web applications, especially, is allowing you to have an HTTP GET which can be served entirely by a Web worker that is only allowed to talk to a read replica, which allows extensive horizontal scale. Analytics and other incidentals can be handled async in background jobs (and indeed, in emergencies, load-shedding those ancillary things has barely any impact).

I recognize that all of this probably sounds “obvious” - but I have seen enough codebases that do synchronous writes during GET transactions that I would stop short of calling this “common knowledge.”

Once you've read this, pick up Harchol–Balter's book Performance Modeling and Design of Computer Systems: Queueing Theory in Action. It's a really good introduction to the depth of this topic, and you'll come out with superpowers you didn't have before.

Or you could just play Factorio.

Similar things apply to buffered channels in Go: buffering frequently hides deadlocks, etc, but you don't observe them until the buffer is full. So buffers generally should be kept either as zero or very small to be able to catch synchronisation issues early

Two past submissions with notable discussions:

https://news.ycombinator.com/item?id=39041477 - 18 Jan 2024, 153 comments

https://news.ycombinator.com/item?id=8632043 - 19 Nov 2014, 60 comments

Reminds me of this other queuing theory blogpost classic https://medium.com/swlh/fifo-considered-harmful-793b76f98374

TLDR LIFO (stack, not queue) is often a better choice for many workloads, despite violating our sense of fairness.

"People misuse queues all the time. The most egregious case being to fix issues with slow apps, and consequently, with overload. But to say why, I'll need to take bits of talks and texts I have around the place, and content that I have written in more details about in Erlang in Anger."

This feels like an unfinished gripe—summarize the key point now rather than promising a future deep dive. Backing claims with a concise example would make the argument useful instead of vague.

Hum... The biggest hype on process engineering by the 00s was buffer size reduction. Because those buffers interfere with each other in chaotic ways, and they tend to turn "small problems that blow up soon, with small consequences" into "huge accumulated problem, that blows up hours after it appeared, with business-risking consequences".

Queues are pretty similar.

That's a very good analogy. Queues are not there to solve overload (and they never were), they are there as an *architecture tool* that allows decoupling and *can* (not always) ease the scaling of the queue process (workers).

I think the back-pressure should always be implemented from the very beginning, as it also helps with defining the requirements of what the service should be able to handle

You're essentially describing what a silicon engineer would call independent "clock domains" (the stations) and "clock-domain-crossing signals" (the workpieces.) And, indeed, you would also tend to handle clock-domain-crossing signals by sticking an async FIFO between the two clock domains.

In manufacturing you have mass. Stuff has weight to it and sometimes I think it would be best to imagine data has mass.

In the widget factory there is the option to put stuff in the warehouse until you need it. Great in principle, but if you are the guy having to do the heavy lifting to get stuff crammed into the warehouse, and retrieved, then you can end up wondering why you are in the job, which promised so much more than spending all day in the warehouse rather than making stuff.

With web applications we will gladly get gigabytes of stuff from the other side of the world, just in case we need it. If all of that data weighed grams or even tonnes, then we would do things very differently, to be more like the Toyota Way, with just-in-time and the rest of it.

Hence my suggestion when building for the web, imagine every byte has mass. Design accordingly.

Queues are pretty similar.

Reducing buffer size puts back pressure on the whole system, which can be valuable to manage load (but often throttles faster stages and that throttling makes people uncomfortable). A meaningful metric is how much of the buffer is used at any given time and the throughout. If the buffer is backed up, that says there's a bottle neck on the consumption side of the buffer and more bandwidth is needed there. For whatever reason, adjusting buffer sizes is the more common action taken. A buffer provides throughput management but it also provides info/metrics about the operation of the system.

You can also observe this in games like Dyson Sphere Program, (which is all workers and queues and buffers) where adding a buffer storage section of a belt only hides the fact that you are under-producing one of the components required.

The buffer smooths out bursty flow but you don't want that in the middle of the pipeline, as it actually represents mid-pipeline inefficiency. You should actually be fixing the upstream or downstream problem.

[1] or other automation games like Factorio, Mindustry

Reminds me of this other queuing theory blogpost classic https://medium.com/swlh/fifo-considered-harmful-793b76f98374

TLDR LIFO (stack, not queue) is often a better choice for many workloads, despite violating our sense of fairness.

Huh. That's so obvious but I never would have thought of it.

This feels like an unfinished gripe—summarize the key point now rather than promising a future deep dive. Backing claims with a concise example would make the argument useful instead of vague.

Or you could just play Factorio.

Two past submissions with notable discussions:

https://news.ycombinator.com/item?id=39041477 - 18 Jan 2024, 153 comments

https://news.ycombinator.com/item?id=8632043 - 19 Nov 2014, 60 comments

I don't think this is a good resource for an intro tbh. Unless you are interested in proofs and have some probability basics covered, it feels quite dense.

I liked Principles of Product Development Flow a lot more because it was easier to digest, although it's a different application of queuing theory.

Even if you just read the first few chapters of this you will not come out unchanged.

Author clearly has a wealth of real experience, but I have trouble reconciling some of it to the “real world.”

Their point about the bottleneck being the real problem that must be addressed if loads are going to be sustained at such a high level is indisputable, though.

I think I would define the necessary rule as: the queue’s maximum size just needs to be greater than the spikes you expect, but that’s of course no insight, just a definition.

> the queue’s maximum size just needs to be greater than the spikes you expect

There is one truth I have come to know, said by someone far wiser tha me: A queue is either empty or full. Which is to say a queue can either handle all the data coming in, or it can’t. When it can’t it will fill to capacity. This is a probabilistic thing, and you can only decide how many nines to plan for. And it’s worse than it looks at first, because queuing theory is very non intuitive with non linearities that make it very hard to reason about wothout having your nose rubbed in it.

So that means, that yes, you can keep doubling the size of your queue. And no, you can’t ever make it big enough to deal with a poisson distribution. And while you’re at it you will likely need to add workers. And you’re still back to capacity planning and deciding how much money to throw at the problem.

What you may be getting at, and what the article sort of failed at, is that queues are still super valuable for smoothing small spikes, or even large predictable ones. But a queue alone, without backpressure, or overflow will likely cause systems to fall over. Sometimes in ways that are hard to recover from, especially if you have some kind of microservices inspired architecture, where one thing going offline causes another queue elsewhere to fill. Or worse, bringing a failed service back online stresses another system causing it to fall offline. (not meant to be a dig on microservices by the way)

I agree with this - queues may not be the end all solution but it is a valuable tool in our kit.

And in the right situations, it can be enough.

In manufacturing you have mass. Stuff has weight to it and sometimes I think it would be best to imagine data has mass.

Hence my suggestion when building for the web, imagine every byte has mass. Design accordingly.

I think the back-pressure should always be implemented from the very beginning, as it also helps with defining the requirements of what the service should be able to handle

Huh. That's so obvious but I never would have thought of it.

Even if you just read the first few chapters of this you will not come out unchanged.

I'm not sure how you came away with that impression. Three out of three reviewers say they overall enjoyed the book. The complaints fall mostly into four buckets:

- "I wish the book was simpler" (Jesse)

- "I wish the book was more advanced" (Murat)

- "I wish software engineering was more advanced" (Andrew)

- "I didn't understand the arguments the author made for why studying single-server exponential response time systems helps with drawing conclusions for time-sharing, heavy-tailed response time systems" (Jesse)

None of these paint the book in a bad colour, as far as I can tell. They say more about the reader's expectations than the book itself.

I don't think this is a good resource for an intro tbh. Unless you are interested in proofs and have some probability basics covered, it feels quite dense.

I liked Principles of Product Development Flow a lot more because it was easier to digest, although it's a different application of queuing theory.

That is also a good book containing a few practical applications of queueing theory, but it won't do anything to help you analyse your own systems on a more fundamental level.

I agree with this - queues may not be the end all solution but it is a valuable tool in our kit.

And in the right situations, it can be enough.

[1] or other automation games like Factorio, Mindustry

I'll note that speedrunners absolutely buffer mid-pipeline in Factorio, and not just for hand-crafting purpouses. Sometimes you're waiting for R&D, sometimes you're just running half the machines for twice as long, giving you the same output while saving on build costs. The actual bottlenecks are constantly shifting. "I'm not speedrunning!" you might say, but every regular game could've started as a speedrun that could've gotten you to where you are faster.

Understanding the tendency of mid-pipeline buffers to hide problems is useful, but scorning them entirely is also suboptimal.

> the queue’s maximum size just needs to be greater than the spikes you expect

I'm not sure how you came away with that impression. Three out of three reviewers say they overall enjoyed the book. The complaints fall mostly into four buckets:

- "I wish the book was simpler" (Jesse)

- "I wish the book was more advanced" (Murat)

- "I wish software engineering was more advanced" (Andrew)

None of these paint the book in a bad colour, as far as I can tell. They say more about the reader's expectations than the book itself.

Sure. But I'm trying to connect what you said:

> you'll come out with superpowers you didn't have before.

with the impressions from the reviewers.

I don't think they got super powers from the book. In fact their outcomes mirrors my own outcomes when going deep into some math topics and then bringing them to work.

That is also a good book containing a few practical applications of queueing theory, but it won't do anything to help you analyse your own systems on a more fundamental level.

Understanding the tendency of mid-pipeline buffers to hide problems is useful, but scorning them entirely is also suboptimal.

Sure. But I'm trying to connect what you said:

> you'll come out with superpowers you didn't have before.

with the impressions from the reviewers.

I don't think they got super powers from the book. In fact their outcomes mirrors my own outcomes when going deep into some math topics and then bringing them to work.

OK, queues.

People misuse queues all the time. The most egregious case being to fix issues with slow apps, and consequently, with overload. But to say why, I'll need to take bits of talks and texts I have around the place, and content that I have written in more details about in Erlang in Anger.

To oversimplify things, most of the projects I end up working on can be visualized as a very large bathroom sink. User and data input are flowing from the faucet, down 'till the output of the system:

So under normal operations, your system can handle all the data that comes in, and carry it out fine:

Water goes in, water goes out, everyone's happy. However, from time to time, you'll see temporary overload on your system. If you do messaging, this is going to be around sporting events or events like New Year's Eve. If you're a news site, it's gonna be when a big thing happens (Elections in the US, Royal baby in the UK, someone says they dislike French as a language in Quebec).

During that time, you may experience that temporary overload:

The data that comes out of the system is still limited, and input comes in faster and faster. Web people will use stuff like caches at that point to make it so the input and output required gets to be reduced. Other systems will use a huge buffer (a queue, or in this case, a sink) to hold the temporary data.

The problem comes when you inevitably encounter prolonged overload. It's when you look at your system load and go "oh crap", and it's not coming down ever. Turns out Obama doesn't want to turn in his birth certificate, the royal baby doesn't look like the father, and someone says Quebec should be better off with Parisian French, and the rumor mill is going for days and weeks at a time:

All of a sudden, the buffers, queues, whatever, can't deal with it anymore. You're in a critical state where you can see smoke rising from your servers, or if in the cloud, things are as bad as usual, but more!

The system inevitably crashes:

Woops, everyone is dead, you're in the office at 3am (who knew so many people in the US, disgusted with their "Kenyan" president, now want news on the royal baby, while Quebec people look up 'royale with cheese baby' for some reason) trying to keep things up.

You look at your stack traces, at your queue, at your DB slow queries, at the APIs you call. You spend weeks at a time optimizing every component, making sure it's always going to be good and solid. Things keeps crashing, but you hit the point where every time, it takes 2-3 days more.

At the end of it, you see a crapload of problems still happening, but they're a week apart between each failure, which slows down your optimizing in immense ways because it's incredibly hard to measure things when they take weeks to go bad.

You go "okay I'm all out of ideas, let's buy a bigger server." The system in the end looks like this, and it's still failing:

Except now it's an unmaintainable piece of garbage full of dirty hacks to make it work that cost 5 times what it used to, and you've been paid for months optimizing it for no god damn reason because it still dies when overloaded.

The problem? That red arrow there. You're hitting some hard limit that even through all of your profiling, you didn't consider properly. This can be a database, an API to an external service, disk speed, bandwidth or general I/O limits, paging speed, CPU limits, whatever.

You've spent months optimizing your super service only to find out at some point in time, you went past its optimal speed without larger changes, and the day your system got to have an operational speed greater than this hard limit, you've doomed yourself to an everlasting series of system failures.

The disheartening part about it is that you discover that once your system is popular, has people using it and its APIs, and changing it to be better is very expensive and hard. Especially since you'll probably have to revisit assumptions you've made in its core design. Woops.

So what do you need? You'll need to pick what has to give whenever stuff goes bad. You'll have to pick between blocking on input (back-pressure), or dropping data on the floor (load-shedding). And that happens all the time in the real world, we just don't want to do it as developers, as if it were an admission of failure.

Bouncers in front of a club, water spillways to go around dams, the pressure mechanism that keeps you from putting more gas in a full tank, and so on. They're all there to impose a system-wide flow control to keep operations safe.

In [non-critical] software? Who cares! We never shed load because that makes stakeholders angry, and we never think about back-pressure. Usually the back-pressure in the system is implicit: 'tis slow.

A function/method call to something ends up taking longer? It's slow. Not enough people think of it as back-pressure making its way through your system. In fact, slow distributed systems are often the canary in the overload coal mine. The problem is that everyone just stands around and goes "durr why is everything so slow??" and devs go "I don't know! It just is! It's hard, okay!"

That's usually because somewhere in the system (possibly the network, or something that is nearly impossible to observe without proper tooling, such as TCP incast), something is clogged and everything else is pushing it back to the edge of your system, to the user.

And that back-pressure making the system slower? It slows down the rate at which users can input data. It's what is likely keeping your whole stack alive. And you know when people start using queues? Right there. When operations take too long and block stuff up, people introduce a freaking queue in the system.

And the effects are instant. The application that was sluggish is now fast again. Of course you need to redesign the whole interface and interactions and reporting mechanisms to become asynchronous, but man is it fast!

Except at some point the queue spills over, and you lose all of the data. There's a serious meeting that then takes place where everyone discusses how this could possibly have happened. Dev #3 suggests more workers are added, Dev #6 recommends the queue gets persistency so that when it crashes, no requests are lost.

"Cool," says everyone. Off to work. Except at some point, the system dies again. And the queue comes back up, but it's already full and uuugh. Dev #5 goes in and thinks "oh yeah, we could add more queues" (I swear I've seen this unfold back when I didn't know better). People say "oh yeah, that increases capacity" and off they go.

And then it dies again. And nobody ever thought of that sneaky red arrow there:

Maybe they do it without knowing, and decide to go with MongoDB because it's "faster than Postgres" (heh). Who knows.

The real problem is that everyone involved used queues as an optimization mechanism. With them, new problems are now part of the system, which is a nightmare to maintain. Usually, these problems will come in the form of ruining the end-to-end principle by using a persistent queue as a fire-and-forget mechanisms or assuming tasks can't be replayed or lost. You have more places that can time out, require new ways to detect failures and communicate them back to users, and so on.

Those can be worked around, don't get me wrong. The issue is that they're being introduced as part of a solution that's not appropriate for the problem it's built to solve. All of this was just premature optimization. Even when everyone involved took measures, reacted to real failures in real pain points, etc. The issue is that nobody considered what the true, central business end of things is, and what its limits are. People considered these limits locally in each sub-component, more or less, and not always.

But someone should have picked what had to give: do you stop people from inputting stuff in the system, or do you shed load. Those are inescapable choices, where inaction leads to system failure.

And you know what's cool? If you identify these bottlenecks you have for real in your system, and you put them behind proper back-pressure mechanisms, your system won't even have the right to become slow.

Step 1. Identify the bottleneck. Step 2: ask the bottleneck for permission to pile more data in:

Depending on where you put your probe, you can optimize for different levels of latency and throughput, but what you're going to do is define proper operational limits of your system.

When people blindly apply a queue as a buffer, all they're doing is creating a bigger buffer to accumulate data that is in-flight, only to lose it sooner or later. You're making failures more rare, but you're making their magnitude worse.

When you shed load and define proper operational limits to your system, you don't have these. What you may have is customers that are as unhappy (because in either case, they can't do what your system promises right), but with proper back-pressure or load-shedding, you gain:

Proper metrics of your quality of service
An API that will be designed with either in mind (back-pressure lets you know when you're in an overload situation, and when to retry or whatever, and load-shedding lets the user know that some data was lost so they can work around that)
Fewer night pages
Fewer critical rushes to get everything fixed because it's dying all the time
A way to monetize your services through varying account limits and priority lanes
You act as a more reliable endpoint for everyone who depends on you

To make stuff usable, a proper idempotent API with end-to-end principles in mind will make it so these instances of back-pressure and load shedding should rarely be a problem for your callers, because they can safely retry requests and know if they worked.

So when I rant about/against queues, it's because queues will often be (but not always) applied in ways that totally mess up end-to-end principles for no good reason. It's because of bad system engineering, where people are trying to make an 18-wheeler go through a straw and wondering why the hell things go bad. In the end the queue just makes things worse. And when it goes bad, it goes really bad, because everyone tried to close their eyes shut and ignore the fact they built a dam to solve flooding problems upstream of the dam.

And then of course, there's the use case where you use the queue as a messaging mechanism between front-end threads/processes (think PHP, Ruby, CGI apps in general, and so on) because your language doesn't support inter-process communications. It's marginally better than using a MySQL table (which I've seen done a few times and even took part in), but infinitely worse than picking a tool that supports the messaging mechanisms you need to implement your solution right.

Hacker Times

Hacker Times

Queues Don't Fix Overload (2014)

Discussion

Discussion