If you make one run on an FPGA you it'll be even more complex.
The lesson is that cutting through abstraction like a hot knife through butter is a necessary mindset for performance but also makes things more difficult. Sockets and thread-per-connection were good approaches when networks were very slow relative to CPUs, and they're still often the simplest approach today.
This sent me through a rabbit hole of uring, kernel development and C. I've been a rust and c++ dev for quite a few years now, but there's such a simplicity and even artistic feel to small(ish) C programs.
I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.
You can probably eke (thanks) out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).
If you also cpu align your outgoing sockets, you should get a significant boost, but afaik, there's no great api for that. Linux does have an api for compatible NICs (traffic steering/flow steering) which can work, but if you know what hash your NIC uses (it's probably toeplitz) and you manage source port selection to your backend, you can pick ports that will hash properly.
The goal is for your proxy to be able to handle packets without any cross cpu communication.
But really, I want to sendfile with io_uring, but that's not supported yet.
My writeup, with extra buzzwords like Rust and kTLS: https://blog.habets.se/2025/04/io-uring-ktls-and-rust-for-ze...
It was on HN too: https://news.ycombinator.com/item?id=44980865
For something like networking, if you are maximizing packets per second, you'll hit kernel limits[1] very quickly and instead have to start leveraging features like GSO/GRO or completely bypass the network stack.
The two things that make io_uring fast are chaining of operations and zero syscall mode, the former would require that all async io frameworks/libs would need to be rewritten to make use of that and then all user facing apps would also need to be rewritten since all you’d get now are completions to operations instead of waiting if you can run a operation.
At the same load, how did latency look for A vs B.
What was throughput and latency at maximum load like for A vs B. For whichever one had the smaller max throughput, what did latency look like for the other option.
For bonus points while testing: is there another observable metric to indicate available capacity, if cpu % free is less useful.
I saw you mentioned windows development elsewhere. You might be interested to know that Microsoft pionered Receive Side Scaling and Send Side Scaling. If you try your proxy out on Windows, be sure to hook into those systems there.
The less work your proxy does, the more important avoiding cross core communication is.
https://access.redhat.com/solutions/4723221
Go should reconsider support. They should have a 'go' at it.
Rdma, dpdk, io_uring it’s really kind of up to the user to do the memory isolation
In io_urings case tho, you can’t do much because the rings are in the kernel.
I’m hopeful though that with Llm things will get better.
But it’s just hard problem to solve . Very difficult to do in the kernel itself, and folks don’t really even understand tuning for it.
Specifically, not thread per core code has the following issues:
* you have to use atomics/locks to synchronize data access. This involves expensive HW operations to implement the semantics of what an atomic operation is
* you have to deal with lock contention and cache contention
* when an OS migrates the core that is executing your code, you’ve suddenly got cold caches all over the place (icache, dcache, and TLB).
There’s also a bunch of related things that pop up - even if you do thread per core, the processor interrupts for events probably land on a different CPU resulting in extra overhead within the OS to deliver the event to you.
Io_uring doesn’t “handle more things in user space”. It specifically avoids a bunch of overheads; you’re context switching less (other cores can execute the OS code to process your request) and you can pipeline I/O (you can tell the OS “do IO A, then B, then C and tell me when that’s all done”) and you get fewer memory copies (the kernel reads into your buffer directly without needing to create another copy although this is more nuanced).
Anyway, the better mental model is specifically io_uring is more efficient and thus CPUs spend less time standing around waiting for things to happen at the hardware level (context switching, waiting for locks, etc). If the CPUs weren’t actually spending much time waiting, then you don’t get much benefit. This is the same phenomenon as Jevons paradox in economics; IO gets cheaper so you can do more of it within a given time unit and thus your CPUs end up more often having real work to do.
First, I want to tell you how exactly I got to this point and why I started researching different options for handling asynchronous I/O on Linux… Last year, my students and I built a reverse proxy server called TinyGate. It was super simple, worker-based, and it basically worked well. Of course, I didn’t expect it to be very fast, but it was an educational project, and since we’d made a real, kind of production-ready tool, I was really proud of it. But my students weren’t as happy as I was - they wanted to build something genuinely useful, and they were really disappointed that our “product” had strong architectural limits and couldn’t outperform titans like nginx and haproxy. So they literally forced me to research together how those tools work under the hood and how to handle asynchronous I/O to cut down on the heavy overhead… Long story short, we made a second version of TinyGate, based on epoll. It still lost to nginx/haproxy in benchmarks, but it had a dramatic performance boost compared to the first version. But epoll isn’t perfect either (as I’ll explain below), and we eventually switched to io_uring, which led to a full rewrite of our project from scratch, again… So it’s a really interesting topic, and today I’ll share an overview of the two queueing systems Linux gives you for asynchronous I/O.
When I just started developing for Linux, epoll was a new feature, and basically it had no alternatives. Everyone used it to manage asynchronous execution - there was no other choice. The problem is, epoll relies heavily on syscalls: it tells you when I/O is possible, but you still have to call read()/write() yourself afterward - that’s two syscalls per I/O event, on top of the one-time epoll_ctl registration. Each of these syscalls causes a context switch between user and kernel mode, which creates HUGE overhead once you’re handling a lot of connections. But we have a solution! About 17 years after epoll landed in the Linux kernel (2002), io_uring appeared (2019)! Instead of telling you when I/O is possible, it tells you when I/O is done - no polling loop, and far less associated syscalls.
The kernel consumes submissions from memory shared between your app and the kernel, and posts completions back into that same shared memory - both live in ring buffers, hence the name. The catch: by default you still have to call io_uring_enter() to tell the kernel “go check the submission queue” - but one call can submit a whole batch of operations and reap a whole batch of completions, instead of one syscall pair per operation like with epoll + read. If you want close to zero syscalls during steady state, there’s IORING_SETUP_SQPOLL, which spins up a dedicated kernel thread that polls the submission queue for you - at the cost of that thread burning CPU (more on this below).
Basic architecture: as I said before, epoll notifies you when I/O is possible, io_uring notifies you when I/O is done. Where epoll makes every I/O operation cross the kernel boundary, io_uring lets you pay a small “setup fee” once (creating the ring) plus a per-batch fee (the io_uring_enter() call) instead of a fee per operation. So instead of a syscall pair per I/O, you get a syscall per batch of I/Os - or, with SQPOLL, close to none at all. As you can see, with a ton of I/O happening, this saves a lot of syscalls.
On relatively new systems where io_uring is supported (kernel v5.1+, released in 2019), there’s often not much reason to reach for epoll. The shift from a readiness model to a completion model is a huge architectural change - it moves a big part of the work out of your application and into the kernel.
Of course, I won’t leave you without some code showing how both systems work. We’ll use C. (The io_uring example uses liburing, the userspace helper library - install it via liburing-dev/liburing-devel, or drop down to the raw io_uring_setup/io_uring_enter syscalls if you want zero dependencies.)
Let’s make a simple example of how epoll works. We’ll create the instance, register a file descriptor (stdin, in our case), and process the incoming event.
As you can see, this example uses three syscalls in total: epoll_ctl (a one-time registration), then epoll_wait and read for the event - so two syscalls per actual I/O event, like I mentioned above. The code itself is pretty easy to follow.
Now let’s do the same thing with io_uring instead of epoll.
What can we see here?
Yeah, io_uring takes way fewer resources for this - though, as noted above, there’s still one io_uring_enter() call hiding inside io_uring_submit() and io_uring_wait_cqe() unless you’re running with SQPOLL.
When you test these examples, keep in mind that for the sake of simplicity, some important parts are missing. For example, it will block forever if stdin never produces any data, and the io_uring example skips checking for a NULL sqe (which io_uring_get_sqe() can return if the submission queue is full).
io_uring_register_buffers() - this avoids the kernel re-mapping memory on every single operation. For network sends specifically, look at IORING_OP_SEND_ZC (kernel 6.0+ needed), which skips copying the buffer into the kernel entirely.IORING_SETUP_SQPOLL keeps a kernel thread spinning and polling, which burns CPU. There’s an idle timeout (sq_thread_idle) after which it backs off to sleeping, but it’s not free.cqe’s res field - not as a direct return value like a normal synchronous syscall.io_uring is the new standard for async I/O in the modern Linux world, and honestly, I don’t see much reason to still reach for epoll on a system that has it. For a from-scratch project on a modern Linux server, like our TinyGate rewrite, io_uring is absolutely the way to go. I’m a die-hard supporter of dropping support for old systems as soon as it’s reasonable - if you’re still running a kernel released more than 7 years ago, in my opinion, that’s not a great idea…