Epoll vs. io_uring in Linux

> But my students weren’t as happy as I was - they wanted to build something genuinely useful, and they were really disappointed that our “product” had strong architectural limits and couldn’t outperform titans like nginx and haproxy.

I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.

You can probably eke (thanks) out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).

If you also cpu align your outgoing sockets, you should get a significant boost, but afaik, there's no great api for that. Linux does have an api for compatible NICs (traffic steering/flow steering) which can work, but if you know what hash your NIC uses (it's probably toeplitz) and you manage source port selection to your backend, you can pick ports that will hash properly.

The goal is for your proxy to be able to handle packets without any cross cpu communication.

[1] https://github.com/sibexico/TinyGate

If you write one with DPDK, it'll be infinitely more complex but you'll have the opportunity to blow away nginx in performance.

If you make one run on an FPGA you it'll be even more complex.

The lesson is that cutting through abstraction like a hot knife through butter is a necessary mindset for performance but also makes things more difficult. Sockets and thread-per-connection were good approaches when networks were very slow relative to CPUs, and they're still often the simplest approach today.

Such a great article!

This sent me through a rabbit hole of uring, kernel development and C. I've been a rust and c++ dev for quite a few years now, but there's such a simplicity and even artistic feel to small(ish) C programs.

I've not yet tested the shared buffers for my io uring based web server, but that's because instead of reading from a file and writing, i send directly from a mmaped region.

But really, I want to sendfile with io_uring, but that's not supported yet.

My writeup, with extra buzzwords like Rust and kTLS: https://blog.habets.se/2025/04/io-uring-ktls-and-rust-for-ze...

It was on HN too: https://news.ycombinator.com/item?id=44980865

Take a look at https://github.com/concurrencykit/ck and https://github.com/microsoft/mimalloc, it will fit well for a zero-copy and mem aligned reverse proxy. Also, if you want to add a DDoS protection and more advanced L4 stuff check out https://docs.ebpf.io/ebpf-library/libxdp/libxdp/

Boost asio if you love C++ and asynchronous networking.

Yes, io_uring is significantly faster than epoll (I think I had like 20% faster req/s with io_uring.) The catch is that its kernel opt-in and disabled just about everywhere for security reasons. I think that it has direct memory sharing between the kernel and user-land which is kind of yikes. There's been multiple exploits that hit io_uring in recent times. It's because of this that even engineering projects that try to reach the highest performance possible (like Go) don't really bake io_uring in as a sane default. Though if you want to take the risk you can always run it yourself for your favourite language. It is faster but the cost is possible exploits.

The year is 2050; there are 20 different ways to poll a socket on Linux.

The author takes a very benchmark focus on this topic which only says part of the story particularly for complex systems. Noticed that there are a number of very similar interface that exist on other platform like windows long before io_uring, but that does make Linux’s I/O system worse or slow than these platforms. A fast server is likely fast in either multiplexing or async API if implemented correctly in almost all cases.

If you write one with DPDK, it'll be infinitely more complex but you'll have the opportunity to blow away nginx in performance.

If you make one run on an FPGA you it'll be even more complex.

Such a great article!

I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.

You can probably eke (thanks) out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).

The goal is for your proxy to be able to handle packets without any cross cpu communication.

[1] https://github.com/sibexico/TinyGate

Basically, v0 and v1 of the repo is completely different implementations, written almost from scratch. Now working on the 3rd one implementation, I believe the last one. :) Completely different architectural choices was made.

I would be interested to see benchmarks for that patch

I've not yet tested the shared buffers for my io uring based web server, but that's because instead of reading from a file and writing, i send directly from a mmaped region.

But really, I want to sendfile with io_uring, but that's not supported yet.

My writeup, with extra buzzwords like Rust and kTLS: https://blog.habets.se/2025/04/io-uring-ktls-and-rust-for-ze...

It was on HN too: https://news.ycombinator.com/item?id=44980865

FYI you can use sendfile ish with uring, since splice(2) is implemented. Not as user friendly as sendfile, but should work fairly similarly.

Yeah, the plan was to apply optimizations at the other levels, then we will go to allocators. Studying the allocators rn with my students, the previous post in the blog was about custom allocator on the Zig lang.

Boost asio if you love C++ and asynchronous networking.

I switched out asio's epoll backend for its io_uring in a database server and CPU utilization shot up. Probably depends on usage and the specifics of how it's integrated into the event code.

Boost is so inconvenient, they're huge dynamic libraries that are a pain to build and use. Even when I was already using CMake, getting Boost installed in a way where it could be discovered was super annoying. (I was on Mac, though)

The main reason why it gets disabled is fixed now, the latest RC got cBPF support and as such you can restrict what OPs can be run now instead of just fully disabling it.

Quite depends, I had times when my posix emulation of io_uring (with poll, not epoll) was faster than io_uring. For large zero-copy buffers, io_uring is king however. Also io_uring is useful even for non asynchronous IO as it can implement chain of operations as single atomic operation (mkdir + open it for example).

For something like networking, if you are maximizing packets per second, you'll hit kernel limits[1] very quickly and instead have to start leveraging features like GSO/GRO or completely bypass the network stack.

1: https://github.com/axboe/liburing/discussions/1346

RHEL 9 and 10 now fully support io_uring by default. It is very recent, but this covers a lot of corporate Linux installs. Gemini 'said' Ubuntu and SuSE support it as well, but did not provide any links to prove it.

https://access.redhat.com/solutions/4723221

Go should reconsider support. They should have a 'go' at it.

For a project like Go, wouldn't it be an option to do one-time iouring feature detection in the runtime startup? Exploits are an issue for the entire OS, not the program choosing to use iouring, yeah?

Any kind of poll mode networking:

Rdma, dpdk, io_uring it’s really kind of up to the user to do the memory isolation

In io_urings case tho, you can’t do much because the rings are in the kernel.

I’m hopeful though that with Llm things will get better.

But it’s just hard problem to solve . Very difficult to do in the kernel itself, and folks don’t really even understand tuning for it.

The year is 2050; there are 20 different ways to poll a socket on Linux.

Yes, even for io_uring. io_uring singshot and then multishot to go even faster.

I would be interested to see benchmarks for that patch

I'm now a Windows developer, mostly working with Linux and FreeBSD. Thx for the point, I'll look how it works in Windows systems.

There is no benchmark in the post. There is analysis, discussion and code examples for epoll and io_uring usage.

If it's still running on more than a single core, and your students want it to go faster, aligning the work to cpus will almost certainly be useful.

I saw you mentioned windows development elsewhere. You might be interested to know that Microsoft pionered Receive Side Scaling and Send Side Scaling. If you try your proxy out on Windows, be sure to hook into those systems there.

The less work your proxy does, the more important avoiding cross core communication is.

FYI you can use sendfile ish with uring, since splice(2) is implemented. Not as user friendly as sendfile, but should work fairly similarly.

On the Linux side sendfile is implemented via splice. So it is a more generic API that covers the sendfile case.

I switched out asio's epoll backend for its io_uring in a database server and CPU utilization shot up. Probably depends on usage and the specifics of how it's integrated into the event code.

No async io framework exists which utilizes everything io_uring can, they are all build around the poll model. As such io_uring will always be worse than the poll like abstractions.

The two things that make io_uring fast are chaining of operations and zero syscall mode, the former would require that all async io frameworks/libs would need to be rewritten to make use of that and then all user facing apps would also need to be rewritten since all you’d get now are completions to operations instead of waiting if you can run a operation.

That’s paradoxically what you can expect on a busy server - your CPU can spend time doing work that would have been previously IO wait time. Of course, it could be a bug in the implementation where you’re spinning doing no work erroneously, but depends on the details.

In addition to the other discussion. It's important to measure outcomes and not just look at the cpu meter...

At the same load, how did latency look for A vs B.

What was throughput and latency at maximum load like for A vs B. For whichever one had the smaller max throughput, what did latency look like for the other option.

For bonus points while testing: is there another observable metric to indicate available capacity, if cpu % free is less useful.

Asio also comes in standalone form and both versions are header-only. Not necessarily directly related to your comment but adding it on, anyway.

You can statically link boost

1: https://github.com/axboe/liburing/discussions/1346

Also it’s nice for things like SPI which have no user space non-blocking API.

The main reason why it gets disabled is fixed now, the latest RC got cBPF support and as such you can restrict what OPs can be run now instead of just fully disabling it.

Well the reason it's disabled now is the recent history of pretty bad vulnerabilities. It probably needs to go a while without new vulnerabilities before it makes sense to enable by default. It's pretty complex completely unsafe C code, after all.

I'm now a Windows developer, mostly working with Linux and FreeBSD. Thx for the point, I'll look how it works in Windows systems.

There is no benchmark in the post. There is analysis, discussion and code examples for epoll and io_uring usage.

On the Linux side sendfile is implemented via splice. So it is a more generic API that covers the sendfile case.

No async io framework exists which utilizes everything io_uring can, they are all build around the poll model. As such io_uring will always be worse than the poll like abstractions.

Asio also comes in standalone form and both versions are header-only. Not necessarily directly related to your comment but adding it on, anyway.

In addition to the other discussion. It's important to measure outcomes and not just look at the cpu meter...

At the same load, how did latency look for A vs B.

What was throughput and latency at maximum load like for A vs B. For whichever one had the smaller max throughput, what did latency look like for the other option.

For bonus points while testing: is there another observable metric to indicate available capacity, if cpu % free is less useful.

Also it’s nice for things like SPI which have no user space non-blocking API.

You can statically link boost

Yes, even for io_uring. io_uring singshot and then multishot to go even faster.

If it's still running on more than a single core, and your students want it to go faster, aligning the work to cpus will almost certainly be useful.

The less work your proxy does, the more important avoiding cross core communication is.

Pin threads to cores, and make sure threads different cores aren’t writing to the same 64 or 128 byte block. Lookup “false sharing”

Yeah, the explanation that I've usually heard for this sort of thing is that it's intended to get back CPU time that's lost when too many system threads are blocking to keep something on every core even during I/O (or pay for it in terms of the context switching overhead if you compensate for this with an extremely large number of system threads). The theory is that you'll avoid idle CPU compared to the common "one thread per core" way of doing things due to some of them being idle during I/O, at the cost of using some extra CPU to handle more things in user space. Obviously how much this helps can vary between use cases, but the measure of how much it's helping (or if it's maybe not helping at all!) is throughput, not CPU utilization.

This makes no sense. Epoll is already non-blocking, you never waste time waiting for I/O as long as there is work to do. Io_uring only boosts CPU efficiency (batching of syscalls, for example), it does not reduce blocking.

https://access.redhat.com/solutions/4723221

Go should reconsider support. They should have a 'go' at it.

It's still seccomp'd off in most environments because io-uring is still a seccomp bypass that doesn't play well with kernel security systems (audit subsystem), even if it weren't also like the #1 or #2 exploit vector for privesc.

Any kind of poll mode networking:

Rdma, dpdk, io_uring it’s really kind of up to the user to do the memory isolation

In io_urings case tho, you can’t do much because the rings are in the kernel.

I’m hopeful though that with Llm things will get better.

But it’s just hard problem to solve . Very difficult to do in the kernel itself, and folks don’t really even understand tuning for it.

Pin threads to cores, and make sure threads different cores aren’t writing to the same 64 or 128 byte block. Lookup “false sharing”

The ring buffers are in shared memory not kernel private. The ring buffers (submission and completion) are shared between kernel space and user space. User publishes requests via submission queue entries (updates tail of buffer while kernel reads head of the buffer), kernel shifts the submission queue buffer on its end and returns a completion queue event by publishing to completion buffer. User pulls from this buffer (specifically the head, kernel updates tail of buffer) in user space.

That is an imprecise explanation being conveyed to you - thread per core is still valuable in an io_uring world. The reason for that is how computers are built - its inherent in the kinds of operations that are cheap and what happens otherwise, the I/O model doesn’t matter.

Specifically, not thread per core code has the following issues:

* you have to use atomics/locks to synchronize data access. This involves expensive HW operations to implement the semantics of what an atomic operation is

* you have to deal with lock contention and cache contention

* when an OS migrates the core that is executing your code, you’ve suddenly got cold caches all over the place (icache, dcache, and TLB).

There’s also a bunch of related things that pop up - even if you do thread per core, the processor interrupts for events probably land on a different CPU resulting in extra overhead within the OS to deliver the event to you.

Io_uring doesn’t “handle more things in user space”. It specifically avoids a bunch of overheads; you’re context switching less (other cores can execute the OS code to process your request) and you can pipeline I/O (you can tell the OS “do IO A, then B, then C and tell me when that’s all done”) and you get fewer memory copies (the kernel reads into your buffer directly without needing to create another copy although this is more nuanced).

Anyway, the better mental model is specifically io_uring is more efficient and thus CPUs spend less time standing around waiting for things to happen at the hardware level (context switching, waiting for locks, etc). If the CPUs weren’t actually spending much time waiting, then you don’t get much benefit. This is the same phenomenon as Jevons paradox in economics; IO gets cheaper so you can do more of it within a given time unit and thus your CPUs end up more often having real work to do.

That’s solved as of last week, you can use cBPF now to disable functionality.

Specifically, not thread per core code has the following issues:

* you have to use atomics/locks to synchronize data access. This involves expensive HW operations to implement the semantics of what an atomic operation is

* you have to deal with lock contention and cache contention

* when an OS migrates the core that is executing your code, you’ve suddenly got cold caches all over the place (icache, dcache, and TLB).

To clarify, I'm not talking specifically about io_uring but (multithreaded) async concurrency models in general. The explanation needs to be imprecise because not all of them work the same way, so it's impossible to say anything correct about all of them without intentionally leaving out some specifics.

That’s solved as of last week, you can use cBPF now to disable functionality.

How solved? AFAIK it's not meaningfully shipped but happy to hear otherwise.

First, I want to tell you how exactly I got to this point and why I started researching different options for handling asynchronous I/O on Linux… Last year, my students and I built a reverse proxy server called TinyGate. It was super simple, worker-based, and it basically worked well. Of course, I didn’t expect it to be very fast, but it was an educational project, and since we’d made a real, kind of production-ready tool, I was really proud of it. But my students weren’t as happy as I was - they wanted to build something genuinely useful, and they were really disappointed that our “product” had strong architectural limits and couldn’t outperform titans like nginx and haproxy. So they literally forced me to research together how those tools work under the hood and how to handle asynchronous I/O to cut down on the heavy overhead… Long story short, we made a second version of TinyGate, based on epoll. It still lost to nginx/haproxy in benchmarks, but it had a dramatic performance boost compared to the first version. But epoll isn’t perfect either (as I’ll explain below), and we eventually switched to io_uring, which led to a full rewrite of our project from scratch, again… So it’s a really interesting topic, and today I’ll share an overview of the two queueing systems Linux gives you for asynchronous I/O.

epoll heritage

When I just started developing for Linux, epoll was a new feature, and basically it had no alternatives. Everyone used it to manage asynchronous execution - there was no other choice. The problem is, epoll relies heavily on syscalls: it tells you when I/O is possible, but you still have to call read()/write() yourself afterward - that’s two syscalls per I/O event, on top of the one-time epoll_ctl registration. Each of these syscalls causes a context switch between user and kernel mode, which creates HUGE overhead once you’re handling a lot of connections. But we have a solution! About 17 years after epoll landed in the Linux kernel (2002), io_uring appeared (2019)! Instead of telling you when I/O is possible, it tells you when I/O is done - no polling loop, and far less associated syscalls.

The kernel consumes submissions from memory shared between your app and the kernel, and posts completions back into that same shared memory - both live in ring buffers, hence the name. The catch: by default you still have to call io_uring_enter() to tell the kernel “go check the submission queue” - but one call can submit a whole batch of operations and reap a whole batch of completions, instead of one syscall pair per operation like with epoll + read. If you want close to zero syscalls during steady state, there’s IORING_SETUP_SQPOLL, which spins up a dedicated kernel thread that polls the submission queue for you - at the cost of that thread burning CPU (more on this below).

A little comparison

Basic architecture: as I said before, epoll notifies you when I/O is possible, io_uring notifies you when I/O is done. Where epoll makes every I/O operation cross the kernel boundary, io_uring lets you pay a small “setup fee” once (creating the ring) plus a per-batch fee (the io_uring_enter() call) instead of a fee per operation. So instead of a syscall pair per I/O, you get a syscall per batch of I/Os - or, with SQPOLL, close to none at all. As you can see, with a ton of I/O happening, this saves a lot of syscalls.

On relatively new systems where io_uring is supported (kernel v5.1+, released in 2019), there’s often not much reason to reach for epoll. The shift from a readiness model to a completion model is a huge architectural change - it moves a big part of the work out of your application and into the kernel.

Let’s code!

Of course, I won’t leave you without some code showing how both systems work. We’ll use C. (The io_uring example uses liburing, the userspace helper library - install it via liburing-dev/liburing-devel, or drop down to the raw io_uring_setup/io_uring_enter syscalls if you want zero dependencies.)

epoll

Let’s make a simple example of how epoll works. We’ll create the instance, register a file descriptor (stdin, in our case), and process the incoming event.

As you can see, this example uses three syscalls in total: epoll_ctl (a one-time registration), then epoll_wait and read for the event - so two syscalls per actual I/O event, like I mentioned above. The code itself is pretty easy to follow.

io_uring

Now let’s do the same thing with io_uring instead of epoll.

What can we see here?

Similar instance creation step.
No epoll_ctl registration step needed.
No readiness check needed before submission.
No separate read() call at completion.

Yeah, io_uring takes way fewer resources for this - though, as noted above, there’s still one io_uring_enter() call hiding inside io_uring_submit() and io_uring_wait_cqe() unless you’re running with SQPOLL.

When you test these examples, keep in mind that for the sake of simplicity, some important parts are missing. For example, it will block forever if stdin never produces any data, and the io_uring example skips checking for a NULL sqe (which io_uring_get_sqe() can return if the submission queue is full).

Something additional about io_uring

Zero-copy. For real zero-copy I/O, register your buffers ahead of time with io_uring_register_buffers() - this avoids the kernel re-mapping memory on every single operation. For network sends specifically, look at IORING_OP_SEND_ZC (kernel 6.0+ needed), which skips copying the buffer into the kernel entirely.
SQPOLL uses CPU. Even when your queue is empty, IORING_SETUP_SQPOLL keeps a kernel thread spinning and polling, which burns CPU. There’s an idle timeout (sq_thread_idle) after which it backs off to sleeping, but it’s not free.
Asynchronous error handling. Errors come back (and must be handled) asynchronously, as part of the cqe’s res field - not as a direct return value like a normal synchronous syscall.

Summary

io_uring is the new standard for async I/O in the modern Linux world, and honestly, I don’t see much reason to still reach for epoll on a system that has it. For a from-scratch project on a modern Linux server, like our TinyGate rewrite, io_uring is absolutely the way to go. I’m a die-hard supporter of dropping support for old systems as soon as it’s reasonable - if you’re still running a kernel released more than 7 years ago, in my opinion, that’s not a great idea…

Hacker Times