Async Rust never left the MVP state

> Futures aren't (trivially) inlined

In my programming language I wrote custom pass for inlining async function calls within other async functions. It generally works and allows to remove some boilerplate, but it increases result binary size a lot.

Technically Rust can do the same.

Great article! Love these types of deep dives into optimizations. Hope the project goal works out!

I've felt before that compilers often don't put much effort into optimizing the "trivial" cases.

Overly dramatic title for the content, though. I would have clicked "Async Rust Optimizations the Compiler Still Misses" too you know

Agree with the other commenters that the title is a bit too dramatic. The content was well written and got the point across.

I still don’t have enough experience to have a strong opinion on Rust async, but some things did standout.

On the good side, it’s nice being able to have explicit runtimes. Instead of polluting the whole project to be async, you can do the opposite. Be sync first and use the runtime on IO “edges”. This was a great fit to a project that I’m working on and it seems like a pretty similar strategy to what zig is doing with IO code. This largely solved the function colloring problem in this particular case. Strict separation of IO and CPU bound code was a requirement regardless of the async stuff, so using the explicit IO runtime was natural.

On the bad side, it seems crazy to me how much the whole ecosystem depends on tokio. It’s almost like Java’s GC was optional, but in practice everyone just used the same third party GC runtime and pulling any library forced you to just use that runtime. This sort of central dependency is simply not healthy.

Async seems like an underbaked idea across the board. Regular code was already async. When you need to wait for an async operation, the thread sleeps until ready and the kernel abstracts it away. But We didn’t like structuring code into logical threads, so we added callback systems for events. Then realized callbacks are very hard to reason about and that sequential control is better.

So threads was the right programming model.

Now language runtimes prefer “green threads” for portability and performance but most languages don’t provide that properly. Instead we have awkward coloring of async/non-async and all these problems around scheduling, priority, and no-preemption. It’s a worse scheduling and process model than 1970.

I like this article already because it took me to the goals of Rust for 2026. We use the language in our team, but we haven't needed to go very deep to do the stuff we need. Yet, I really enjoy witnessing the development of a language from ground up with so much community feedback.

I somehow miss noticing that in C++ and I have no idea how it is working in other domains.

My only gripe is that a lot of it is feeling a bit kick-starter-y, with each of the goals needing specific funding. Is that the best model we've found so far?

The duplicate-state collapse (hoisting the match out of the await branches like in his process_command example) is the single easiest pattern anyone can apply to existing async code today. No compiler work needed, just a refactor.

I recently started working with Rust async. The main issue I am currently facing is code duplication: I have to duplicate every function that I want to support both asynchronous and blocking APIs. This could be great to have a `maybe-async`. I took a look at the available crates to work around this (maybe-async, bisync), but they all have issues or hard limitations.

This is the type of ugly but necessary discussions that have been happening in c++ for a while.

I never really liked the viral nature of async in rust when it was introduced.

I wish rust the best of luck and with more people like this rust could have a brighter future.

Response to title: so you’re saying it’s viable

Does this kind of thing make noticeable difference when applied to more complicated async functions?

Examples in the blog seem too simple make any conclusions

This has been on my mind lately too with the talk of the new CPUs. Zen 7 sounds like it'll be a beast & coding against 1 out of dozens of cores would be a pity

Async Rust on small embedded chips like ESP32 feels revolutionary. This project looks promising.

There are much more problems, like async drop.

I like it more how Zig is approaching async with the new IO. It avoids function coloring.

Love Rust. They simply missed the mark with async. Swing and a miss.

The risk they took was very calculated. Unfortunately they’re bad at math and chose the wrong trade-offs.

Ah well. Shit happens.

It's so funny that people will do anything to hate on Rust, including nitpicking a few bytes of overhead for a future while they reach for an entire thread or runtime to handle async in their favourite language.

> Futures aren't (trivially) inlined

Technically Rust can do the same.

Response to title: so you’re saying it’s viable

Agree with the other commenters that the title is a bit too dramatic. The content was well written and got the point across.

I still don’t have enough experience to have a strong opinion on Rust async, but some things did standout.

What's the alternative? I'm happy to use tokio, but i'm happy other folks can enjoy other executors (smol, async-std, glommio, etc). I think the situation is OK because tokio is well-maintained, even though it's not part of the standard library, and i'm afraid making it part of the standard library would make it harder to use other executors, and harder to port the standard library to other platforms.

But maybe my fears are unfounded.

As you mentioned Java, it’s interesting to notice that it has had similar problems throughout its history: logging (now it’s settled on slf4j but you still find libraries using something else), commons (first Apache Commons, now Guava), JSON (it has settled on Jackson but things like Gson and Simple-json are not uncommon to see), nullability annotations ( first with unofficial distributions of JSR-305 which never became official, then checker framework , and lately with everything migrating to JSpecify). All this basic stuff needs to be provided by the language to avoid this fragmentation and quasi de facto libraries from appearing.

Great article! Love these types of deep dives into optimizations. Hope the project goal works out!

I've felt before that compilers often don't put much effort into optimizing the "trivial" cases.

Overly dramatic title for the content, though. I would have clicked "Async Rust Optimizations the Compiler Still Misses" too you know

So threads was the right programming model.

There are much more problems, like async drop.

This has been on my mind lately too with the talk of the new CPUs. Zen 7 sounds like it'll be a beast & coding against 1 out of dozens of cores would be a pity

This is the type of ugly but necessary discussions that have been happening in c++ for a while.

I never really liked the viral nature of async in rust when it was introduced.

I wish rust the best of luck and with more people like this rust could have a brighter future.

Async Rust on small embedded chips like ESP32 feels revolutionary. This project looks promising.

I like it more how Zig is approaching async with the new IO. It avoids function coloring.

There is work happening on keyword generics[0], which would let a function be generic over keywords like `async` and `const`.

For now the best option to write code that wants to live in both worlds is sans-io. Thomas Eizinger at Fireguard has written a good article about this[1] pattern. Not only does it nicely solve the sync/async issue, but it also makes testing easier and opens the door to techniques like DST[2]

I have my own writing on the topic[3], which highlights that the problem is wider than just async vs sync due to different executors.

0: https://github.com/rust-lang/effects-initiative

1: https://www.firezone.dev/blog/sans-io

2: https://notes.eatonphil.com/2024-08-20-deterministic-simulat...

3: https://hugotunius.se/2024/03/08/on-async-rust.html

It'll depend immensely on what you're actually doing, but if it's simple enough you may be able to make a macro that subs out the types & awaits

So on the title, I picked this because it's simply the truth. Since async landed in 2019 or so, not much has changed.

Yes, we can have async in traits and closures now. But those are updates to the typesystem, not to the async machinery itself. Wakers are a little bit easier to work with, but that's an update to std/core.

As I understand it, the people who landed async Rust were quite burnt out and got less active and no one has picked up the torch. (Though there's 1 PR open from some google folk that will optimize how captured variables are laid out in memory, which is really nice to have) Since I and the people I work with are heavy async users, I think it's maybe up to me to do it or at least start it. Free as in puppy I guess.

So yeah, the title is a little baitey, but I do stand behind it.

Agree on title. Too dramatic.

The author seems to be obsessing about the overhead for trivial functions. He's bothered by overhead for states for "panicked" and "returned". That's not a big problem. Most useful async blocks are big enough that the overhead for the error cases disappears.

He may have a point about lack of inlining. But what tends to limit capacity for large numbers of activities is the state space required per activity.

> Regular code was already async. When you need to wait for an async operation, the thread sleeps until ready and the kernel abstracts it away

Not really. I’ve observed async code often is written in such a way that it doesn’t maximize how much concurrency can be expressed (eg instead of writing “here’s N I/O operations to do them all concurrently” it’s “for operation X, await process(x)”). However, in a threaded world this concurrency problem gets worse because you have no way to optimize towards such concurrency - threads are inherently and inescapably too heavy weight to express concurrency in an efficient way.

This is is not a new lesson - work stealing executors have long been known to offer significantly lower latency with more consistent P99 than traditional threads. This has been known since forever - in the early 00s this is why Apple developed GCD. Threads simply don’t provide any richer information it needs in the scheduler to the kernel about the workload and kernel threads are an insanely heavy mechanism for achieving fine grained concurrency and even worse when this concurrency is I/O or a mixed workload instead of pure compute that’s embarrassingly easily to parallelize.

Do all programs need this level of performance? No, probably not. But it is significantly more trivial to achieve a higher performance bar and in practice achieve a latency and throughput level that traditional approaches can’t match with the same level of effort.

You can tell async is directionally kind of correct in that io_uring is the kernel’s approach to high performance I/O and it looks nothing like traditional threading and syscalls and completion looks a lot closer to async concurrency (although granted exploiting it fully is much harder in an async world because async/await is an insufficient number of colors to express how async tasks interrelate)

> the thread sleeps until ready and the kernel abstracts it away.

Sure, but once you involve the kernel and OS scheduler things get 3 to 4 orders of magnitude slower than what they should be.

The last time I was working on our coroutine/scheduling code creating and joining a thread that exited instantly was ~200us, and creating one of our green threads, scheduling it and waiting for it was ~400ns.

You don't need to wait 10 years for someone else to design yet another absurdly complex async framework, you can roll your own green threads/stackful coroutines in any systems language with 20 lines of ASM.

I think that callbacks are actually easier to reason about:

When it comes time to test your concurrent processing, to ensure you handle race conditions properly, that is much easier with callbacks because you can control their scheduling. Since each callback represents a discrete unit, you see which events can be reordered. This enables you to more easily consider all the different orderings.

Instead with threads it is easy to just ignore the orderings and not think about this complexity happening in a different thread and when it can influence the current thread. It isn't simpler, it is simplistic. Moreover, you cannot really change the scheduling and test the concurrent scenarios without introducing artificial barriers to stall the threads or stubbing the I/O so you can pass in a mock that you will then instrument with a callback to control the ordering...

The problem with callbacks is that the call stack when captured isn't the logical callstack unless you are in one of the few libraries/runtimes that put in the work to make the call stacks make sense. Otherwise you need good error definitions.

You can of course mix the paradigms and have the worst of both worlds.

What is kernel in this context?

The problem comes from trying to sit on both chairs: we want async but want to be able to opt out. This is what causes most of the ugliness, including function colouring. Just look at golang, where everything is async with no way to change it, it's great. It's, probably, not well-suited for things like microcontrollers, where every byte matters, but if you can afford the overhead, it's so much better than Rust async. Before async Rust was an interesting and reasonable language, now it's just a hot mess that makes your eyes bleed for no reason.

Threads are neither better or worse than async+callbacks. They are different. There are problems which map nicely to threads and there are problems which are much nicer to express with async.

> So threads was the right programming model.

For problems that aren't overly concerned with performance/memory, yes. You should probably reach for threads as a default, unless you know a priori that your problem is not in this common bucket.

Unfortunately there is quite a lot of bookkeeping overhead in the kernel for threads, and context switches are fairly expensive, so in a number of high performance scenarios we may not be able to afford kernel threading

As I understand, "green threads" are also expensive, for example you either need to allocate a large stack for each "thread", or hook stack allocation to grow the stack dynamically (like Go does), and if you grow the stack, you might have to move it and cannot have pointers to stack objects.

Proper modern languages offer both, you can keep your threads and reach out to async only when it makes sense to do.

Now the languages that don't offer choice is another matter.

I think you are correct, in so far that often N:M threading is overkill for the problem at hand. However, some IO bound problems truly do require it. I haven't kept up with the details, but AFAIK the fallout from Spectre and Meltdown also means context switches are more expensive than they were historically, which is another downside with regular threads.

I also want to address something that I've seen in several sub-threads here: Rust's specific async implementation. The key limitation, compared to the likes of Go and JS, is that Rust attempts to implement async as a zero-cost abstraction, which is a much harder problem than what Go and JS does. Saying some variant of "Rust should just do the same thing as Go", is missing the point.

Does this kind of thing make noticeable difference when applied to more complicated async functions?

Examples in the blog seem too simple make any conclusions

Hi, author here. I mention in the blog that I've tried to quickly hack two of the simplest optimizations in the compiler and it resulted in 2%-5% binary size savings in real embedded (async) codebases. And a quick and probably deeply flawed synthetic benchmark on the desktop showed a 3% perf increase.

So yes, it does really matter. Keep in mind that optimizations stack. We're preventing LLVM from doing it's thing. So if we make the futures themselves smaller, LLVM will be able to optimize more. So small changes really compound.

I somehow miss noticing that in C++ and I have no idea how it is working in other domains.

My only gripe is that a lot of it is feeling a bit kick-starter-y, with each of the goals needing specific funding. Is that the best model we've found so far?

> I somehow miss noticing that in C++ and I have no idea how it is working in other domains.

There seems to be some consensus even within the C++ ISO committee that the evolution process of that language is somewhat broken, mostly due to its size and the way it is organized.

> My only gripe is that a lot of it is feeling a bit kick-starter-y, with each of the goals needing specific funding. Is that the best model we've found so far?

Sadly, this seems to be the way things go once a technology catches on, commercially. Can't blame large donors for sponsoring only the parts they are interested in. Fortunately, considerable funding of TweedeGolf comes from (Dutch) government, I think.

In open source I guess there's two types of work: 1. features 2. maintenance

You can 'sell' new features. They cost money to create, but they solve real problems. Those problems also cost money and if that's more than the cost of creating the feature, companies are willing to put in money (generally).

Maintenance is harder. But there are now some maintainer funds! Like the one from RustNL: https://rustnl.org/maintainers/ These are broader ongoing work and backed by many orgs chipping in a little bit.

Idk if it's the best model, but at least it seems to kinda work

Love Rust. They simply missed the mark with async. Swing and a miss.

The risk they took was very calculated. Unfortunately they’re bad at math and chose the wrong trade-offs.

Ah well. Shit happens.

I think Rust has a pretty solid async implementation, compared to other systems languages. I struggle to point out another systems language with a working and actually used async implementation.

> Unfortunately they’re bad at math and chose the wrong trade-offs

They chose the exact same tradeoffs as C++'s async/await (and the same overall model as Python/NodeJS), so I'm not sure what that says about programming as a whole.

I know the people and the company behind this article. They do anything but "hate on Rust".

You could've deduced that from the fact that someone who puts this amount of energy in a detailed article about intricacies of an area of "foo", quite certainly does not "hate on foo".

It's more that I and people I know love Rust, and enjoy it, and want it to be better. I want it to be relentlessly optimized.

You realize this article talks about Rust on embedded hardware specifically, where you don’t have threads or big runtimes? There is no hate going on here either, just attempts to make things better. Might I suggest you click through to the homepage and I think you’ll figure out the rest.

That's a bit rich given the abuse that Rust evangelists dish out to every other language in the world.

Nobody seriously tries to run Golang or Java on an MCU. But they do run Rust code.

I _love_ Rust and use it whenever I can. I still find the comments in here to be quite appropriate. Async Rust leaves me with a (subjective!) feeling that something isn't quite right. Not that I know how it _should_ be, but that feeling is very different from the non-async parts of the language that almost always leaves me with a warm fuzzy feeling of joy.

I don't know enough about the domain to be objectively helpful, so it's all wishy-washy feelings on my part. I keep reaching for orchestrating things with threads in Rust where most people would probably reach for async these days. The only language where I've felt fine embracing the blessed async system is Haskell and its green threads (which I understand come with their own host of problems).

But maybe my fears are unfounded.

As of now I don’t think there’s an alternative. I’m not a Rust expert but the core issue to me is that “async” goes beyond just having a Futures scheduler. Async stuff usually needs network, disk, os interaction, future utilities(spawn) and these are all things the runtime (tokio) provides. It’s pretty hard to be compatible with each other unless the language itself provides those.

> I somehow miss noticing that in C++ and I have no idea how it is working in other domains.

There seems to be some consensus even within the C++ ISO committee that the evolution process of that language is somewhat broken, mostly due to its size and the way it is organized.

> My only gripe is that a lot of it is feeling a bit kick-starter-y, with each of the goals needing specific funding. Is that the best model we've found so far?

In open source I guess there's two types of work: 1. features 2. maintenance

Idk if it's the best model, but at least it seems to kinda work

I think Rust has a pretty solid async implementation, compared to other systems languages. I struggle to point out another systems language with a working and actually used async implementation.

It'll depend immensely on what you're actually doing, but if it's simple enough you may be able to make a macro that subs out the types & awaits

What is kernel in this context?

Saw that but couldn't find what code it gives that improvement on. Is it some embedded application written with Embassy?

> Unfortunately they’re bad at math and chose the wrong trade-offs

They chose the exact same tradeoffs as C++'s async/await (and the same overall model as Python/NodeJS), so I'm not sure what that says about programming as a whole.

Async in Rust and C++ is nothing like it is in Python or NodeJS. Choose your own runtime is a very different model than having a default one.

Not to mention Tokio (most popular runtime for Rust) is multi-threaded by default. So you have to deal with multithreading bugs as well as normal async ones. That is not the case with most async languages. For example both Python and NodeJS use a single thread to execute async code.

There is work happening on keyword generics[0], which would let a function be generic over keywords like `async` and `const`.

I have my own writing on the topic[3], which highlights that the problem is wider than just async vs sync due to different executors.

0: https://github.com/rust-lang/effects-initiative

1: https://www.firezone.dev/blog/sans-io

2: https://notes.eatonphil.com/2024-08-20-deterministic-simulat...

3: https://hugotunius.se/2024/03/08/on-async-rust.html

> Regular code was already async. When you need to wait for an async operation, the thread sleeps until ready and the kernel abstracts it away

So on the title, I picked this because it's simply the truth. Since async landed in 2019 or so, not much has changed.

So yeah, the title is a little baitey, but I do stand behind it.

Agree on title. Too dramatic.

He may have a point about lack of inlining. But what tends to limit capacity for large numbers of activities is the state space required per activity.

> the thread sleeps until ready and the kernel abstracts it away.

Sure, but once you involve the kernel and OS scheduler things get 3 to 4 orders of magnitude slower than what they should be.

I think that callbacks are actually easier to reason about:

You can of course mix the paradigms and have the worst of both worlds.

Threads are neither better or worse than async+callbacks. They are different. There are problems which map nicely to threads and there are problems which are much nicer to express with async.

Considering the latest commits and issues in effects-initiative are about 2 years old, the keyword generics initiative seems effectively dead.

I may have missed something, but how does “sans-io” deal with CPU heavy code? For example, if there’s some heavy decoding/encoding required on the data? Does the event loop only drive the network side and the heavy part is done after the loop is finished?

I am not saying threads are the model for all programming problems. For example a dependency graph like an excel spreadsheet can be analyzed and parallelized.

But as you observed, async/await fails to express concurrency any better. It’s also a thread, it’s just a worse implementation.

> threads are inherently and inescapably too heavy weight to express concurrency in an efficient way

Your oremise is wrong. There are many counterexamples to this.

Some of the burnout no doubt being due to the catastrophizing of every decision by the community and the extreme rhetoric used across the board.

Great to see people wanting to get involved with the project, though. That’s the beauty of open source: if it aggravates you, you can fix it.

I think it's partially accurate, and partially a consequence of how async fractures the design space, so it will always feel like a somewhat separate thing, or at least until we figure out how to make APIs agnostic to async-ness.

> Most useful async blocks are big enough that the overhead for the error cases disappears.

Is it really though?

In my experience many Rust applications/libraries can be quite heavy on the indirection. One of the points from the article is that contrary to sync Rust, in async Rust each indirection has a runtime cost. Example from the article:

    async fn bar(blah: SomeType) -> OtherType {
       foo(blah).await
    }

I would naively expect the above to be a 'free' indirection, paying only a compile-time cost for the compiler to inline the code. But after reading the article I understand this is not true, and it has a runtime cost as well.

In my experience, it's not uncommon to have an async trait method for which many implementations are actually synchronous. For example, different tables in your DB need to perform some calculations, but only some tables reference other tables. In that case, the method needs to be async and take a handle to the DB as parameter, but many table entries can perform the calculation on their own without using the handle (or any async operation).

This may look like a case of over-optimization, but given how many times i've seen this pattern, i assume it builds up to a lot of unnecessary fluff in huge codebases. To be clear, in that case, the concern is not really about runtime speed (which is super fast), but rather about code bloat for compilation time and binary size.

> Most useful async blocks are big enough that the overhead for the error cases disappears.

Most useful async blocks are deeply nested, so the overhead compounds rapidly. Check the size of futures in a decently large Tokio codebase sometime

He's optimizing for embedded no-std situation. These things do matter in constrained environments.

> [...] That's not a big problem [...]

Depends somewhat on your expectations, I suppose. Compared to Python, Java, sure, but Rust off course strives to offer "zero-cost" high level concepts.

I think the critique is in the same realm of C++'s std::function. Convenience, sure, but far from zero-cost.

> Agree on title. Too dramatic.

not just too dramatic

given that all the things they list are

non essential optimizations,

and some fall under "micro optimizations I wouldn't be sure rust even wants",

and given how far the current async is away from it's old MVP state,

it's more like outright dishonest then overly dramatic

like the kind of click bait which is saying the author does cares neither about respecting the reader nor cares about honest communication, which for someone wanting to do open source contributions is kinda ... not so clever

through in general I agree rust should have more HIR/MIR optimizations, at least in release mode. E.g. its very common that a async function is not pub and in all places directly awaited (or other wise can be proven to only be called once), in that case neither `Returned` nor `Panicked` is needed, as it can't be called again after either. Similar `Unresumed` is not needed either as you can directly call the code up to the first await (and with such a transform their points about "inlining" and "asyncfns without await still having a state machine" would also "just go away"TM, at least in some places.). Similar the whole `.map_or(a,b)` family of functions is IMHO a anti-pattern, introducing more function with unclear operator ordering and removal of the signaling `unwrap_` and no benefits outside of minimal shortening a `.map(b).unwrap_or(a)` and some micro opt. is ... not productive on a already complicated language. Instead guaranteed optimizations for the kind of patterns a `.map(b).unwrap_or(a)` inline to would be much better.

1. Why can’t we have better green threads implementations with better scheduling models?

2. Unchecked array operations are a lot faster. Manual memory management is a lot faster. Shared memory is a lot faster.

Usually when you see someone reach for sharp and less expressive tools it’s justified by a hot code path. But here we jump immediately to the perf hack?

3. How many simultaneous async operations does your program have?

I agree. I don’t think callbacks are an underbaked language feature.

> It's, probably, not well-suited for things like microcontrollers, where every byte matters, but if you can afford the overhead, it's so much better than Rust async.

There is one hill I'll die on, as far as programming languages go, which is that more people should study Céu's structured synchronous concurrency model. It specifically was designed to run on microcontrollers: it compiles down to a finite state machine with very little memory overhead (a few bytes per event).

It has some limitations in terms of how its "scheduler" scales when there are many trails activated by the same event, but breaking things up into multiple asynchronous modules would likely alleviate that problem.

I'm certain a language that would suppprt the "Globally Asynchronous, Locally Synchronous" (GALS) paradigm could have their cake and eat it too. Meaning something that combines support for a green threading model of choice for async events, with structured local reactivity a la Céu.

F'Santanna, the creator of Céu, actually has been chipping away at a new programming language called Atmos that does support the GALS paradigm. However, it's a research language that compiles to Lua 5.4. So it won't really compete with the low-level programming languages there.

[0] https://ceu-lang.org/

[1] https://github.com/atmos-lang/atmos

Everything is not async in Go.

If your threads are "free" you can just run 400 copies of a synchronous code and blocking in one just frees the thread to work on other. async within same goroutine is still very much opt in (you have to manually create goroutine that writes to channel that you then receive on), it just isn't needed where "spawn a thread for each connecton" costs you barely few kb per connection.

> not well-suited for things like microcontrollers, where every byte matters

except when a RAM fetch is so expensive a load is basically an async call - and it's a single machine code instruction at the same time

Such as? The entire premise of async is that callbacks were a mistake because they broke sequential reasoning and control.

Every explanation of the feature starts with managing callback hell.

Saw that but couldn't find what code it gives that improvement on. Is it some embedded application written with Embassy?

Considering the latest commits and issues in effects-initiative are about 2 years old, the keyword generics initiative seems effectively dead.

Async in Rust and C++ is nothing like it is in Python or NodeJS. Choose your own runtime is a very different model than having a default one.

> Async in Rust and C++ is nothing like it is in Python or NodeJS. Choose your own runtime is a very different model than having a default one.

Python still has pluggable eventloops - this is sort of mandatory to interact with weird things like GUI toolkits, and Python's standard event loop was standardised pretty late in the game. Early on there was even an ecosystem split between Twisted and competing event loops implementations.

> For example both Python and NodeJS use a single thread to execute async code

I'd argue this is more a historical artefact of how the languages functioned before futures were introduced, rather than an inherent limitation.

> threads are inherently and inescapably too heavy weight to express concurrency in an efficient way

Your oremise is wrong. There are many counterexamples to this.

> Most useful async blocks are big enough that the overhead for the error cases disappears.

Most useful async blocks are deeply nested, so the overhead compounds rapidly. Check the size of futures in a decently large Tokio codebase sometime

> Most useful async blocks are big enough that the overhead for the error cases disappears.

Is it really though?

    async fn bar(blah: SomeType) -> OtherType {
       foo(blah).await
    }

He's optimizing for embedded no-std situation. These things do matter in constrained environments.

> Agree on title. Too dramatic.

not just too dramatic

given that all the things they list are

non essential optimizations,

and some fall under "micro optimizations I wouldn't be sure rust even wants",

and given how far the current async is away from it's old MVP state,

it's more like outright dishonest then overly dramatic

I agree. I don’t think callbacks are an underbaked language feature.

Some of the burnout no doubt being due to the catastrophizing of every decision by the community and the extreme rhetoric used across the board.

Great to see people wanting to get involved with the project, though. That’s the beauty of open source: if it aggravates you, you can fix it.

As an example of this, i remember a huge debate at the time about `await foo()` vs `foo().await` syntax. The community was really divided on that one, and there was a lot of drama because that's the kind of design decision you can't really walk back from.

Retrospectively, i think everyone is satisfied with the adopted syntax.

I am a beginner to Rust but I've coded with gevent in Python for many years and later moved to Go. Goroutines and gevent greenlets work seamlessly with synchronous code, with no headache. I know there've been tons of blog posts and such saying they're actually far inferior and riskier but I've really never had any issues with them. I am not sure why more languages don't go with a green thread-like approach.

I am not saying threads are the model for all programming problems. For example a dependency graph like an excel spreadsheet can be analyzed and parallelized.

But as you observed, async/await fails to express concurrency any better. It’s also a thread, it’s just a worse implementation.

That’s incorrect. Even when expressed suboptimally, it still tends to result in overall higher throughput and consistently lower latency (work stealing executors specifically). And when you’re in this world, you can always do an optimization pass to better express the concurrency. If you’ve not written it async to start with, then you’re boned and have no easy escape hatch to optimize with.

> [...] That's not a big problem [...]

Depends somewhat on your expectations, I suppose. Compared to Python, Java, sure, but Rust off course strives to offer "zero-cost" high level concepts.

I think the critique is in the same realm of C++'s std::function. Convenience, sure, but far from zero-cost.

1. Why can’t we have better green threads implementations with better scheduling models?

2. Unchecked array operations are a lot faster. Manual memory management is a lot faster. Shared memory is a lot faster.

Usually when you see someone reach for sharp and less expressive tools it’s justified by a hot code path. But here we jump immediately to the perf hack?

3. How many simultaneous async operations does your program have?

Everything is not async in Go.

Proper modern languages offer both, you can keep your threads and reach out to async only when it makes sense to do.

Now the languages that don't offer choice is another matter.

Nobody seriously tries to run Golang or Java on an MCU. But they do run Rust code.

That's a bit rich given the abuse that Rust evangelists dish out to every other language in the world.

It's more that I and people I know love Rust, and enjoy it, and want it to be better. I want it to be relentlessly optimized.

To the point it got replaced by std::function_ref() in C++26.

Well, if you offload heavy compute into an async task, then usually it depends strictly on how many concurrent inputs you are given. But even something as “simple” as a performance editor benefits from this if done well - that’s why JS text editors have reasonably acceptable performance whereas Java IDEs always struggled (historically anyway since even Java has adopted green threads).

What GP meant - what everyone means when they say this - is that goroutines are always M:N threading and so there is no such thing as function coloring. In Rust to get M:N threading you have to use async and in practice every library you use has to use async. Hence function coloring, and two separate ecosystems of libraries in the same language.

>and if you grow the stack, you might have to move it

Most stacks are tiny and have bounded growth. Really large stacks usually happen with deep recursion, but it's not a very common pattern in non-functional languages (and functional languages have tail call optimization). OS threads allocate megabytes upfront to accommodate the worst case, which is not that common. And a tiny stack is very fast to copy. The larger the stack becomes, the less likely it is to grow further.

>cannot have pointers to stack objects

In Go, pointers that escape from a function force heap allocation, because it's unsafe to refer to the contents of a destroyed stack frame later on in principle. And if we only have pointers that never escape, it's relatively trivial to relocate such pointers during stack copying: just detect that a pointer is within the address range of the stack being relocated and recalculate it based on the new stack's base address.

works fine in Go.

Yes, you're not getting Rust performance (tho good part of it is their own compiler vs using all LLVM goodness) but performance is good enough and benefits for developers are great, having goroutines be so cheap means you don't even need to do anything explicitly async to get what you want

> So threads was the right programming model.

For problems that aren't overly concerned with performance/memory, yes. You should probably reach for threads as a default, unless you know a priori that your problem is not in this common bucket.

In that sentence I’m referring to the abstract idea of a thread of execution as a model of programming, not OS threads. A green thread implementation could do it too.

But what you said about kernel implementation is true. But are we really saying that the primary motivation for async/await is performance? How many programmers would give that answer? How many programs are actually hitting that bottleneck?

Doesn’t that buck the trend of every other language development in the past 20 years, emphasizing correctness and expressively over raw performance?

I know the people and the company behind this article. They do anything but "hate on Rust".

You could've deduced that from the fact that someone who puts this amount of energy in a detailed article about intricacies of an area of "foo", quite certainly does not "hate on foo".

Not the article, the comments here man.

The article is fine besides the bait title.

> Async in Rust and C++ is nothing like it is in Python or NodeJS. Choose your own runtime is a very different model than having a default one.

> For example both Python and NodeJS use a single thread to execute async code

I'd argue this is more a historical artefact of how the languages functioned before futures were introduced, rather than an inherent limitation.

> It's, probably, not well-suited for things like microcontrollers, where every byte matters, but if you can afford the overhead, it's so much better than Rust async.

[0] https://ceu-lang.org/

[1] https://github.com/atmos-lang/atmos

> not well-suited for things like microcontrollers, where every byte matters

except when a RAM fetch is so expensive a load is basically an async call - and it's a single machine code instruction at the same time

It is an inherent limitation. Multithreading is not free after all. One of the big pros of async programming is the concurrency you get within a single thread. When you make the async runtime multithreaded by default (like Tokio) you don't get this advantage anymore.

Retrospectively, i think everyone is satisfied with the adopted syntax.

> Retrospectively, i think everyone is satisfied with the adopted syntax.

Maybe it’s a case of agree and commit, since it can’t really be walked back.

Such as? The entire premise of async is that callbacks were a mistake because they broke sequential reasoning and control.

Every explanation of the feature starts with managing callback hell.

Beware, they are different concepts.

Threads offer concurrent execution, async (futures) offer concurrent waiting. Loosely speaking, threads make sense for CPU bound problems, while async makes sense for IO bound problems.

The callbacks should be just hidden from programmer, that's what async/await are for.

Why can’t you do the same optimization? Are you maxing out you OS system resources on thread overhead?

Are you sure Java's UI issues are caused by threading and not just Swing being a glitchy pile of junk?

For example, if you don't explicitly call the java.awt.Toolkit.sync() method after updating the UI state (which according to the docs "is useful for animation"), Swing will in my experience introduce seemingly random delays and UI lag because it just doesn't bother sending the UI updates to the window system.

You think IDEs are written in JS because of the performance benefits of the threading model?

I thought it was because they could copy chromium.

Maybe you remember performance of IDEs from 15 years ago because that definitely isn't my experience.

To the point it got replaced by std::function_ref() in C++26.

Exactly. And I guess that is also the gist of the article: async Rust needs additional TLC.

>and if you grow the stack, you might have to move it

>cannot have pointers to stack objects

Not the article, the comments here man.

The article is fine besides the bait title.

In that sentence I’m referring to the abstract idea of a thread of execution as a model of programming, not OS threads. A green thread implementation could do it too.

Doesn’t that buck the trend of every other language development in the past 20 years, emphasizing correctness and expressively over raw performance?

> But are we really saying that the primary motivation for async/await is performance?

The original motivation for not using OS threads was indeed performance. Async/await is mostly syntax sugar to fix some of the ergonomic problems of writing continuation-based code (Rust more or less skipped the intermediate "callback hell" with futures that Javascript/Python et al suffered through).

> But are we really saying that the primary motivation for async/await is performance?

Of course - what else would it be? The whole async trend started because moving away from each http request spawning (or being bound to) an OS thread gave quite extreme improvements in requests/second metrics, didn't it?

Importantly though, performance might be worse depending on use case and program. Specifically with scheduling in user space it can negatively impact branch prediction as your CPU is already hyper optimized for doing things differently.

It's all nuanced and what to choose requires careful evaluation.

works fine in Go.

Rust chose a different design space for their async implementation though, so what works well for Go wouldn't work well for Rust. In particular, the Rust devs wanted zero-cost FFI that external code doesn't need to know about, which precludes Go-like green threads.

Beware, they are different concepts.

Threads offer concurrent execution, async (futures) offer concurrent waiting. Loosely speaking, threads make sense for CPU bound problems, while async makes sense for IO bound problems.

The callbacks should be just hidden from programmer, that's what async/await are for.

Why can’t you do the same optimization? Are you maxing out you OS system resources on thread overhead?

You can put tokio in single-threaded mode if you prefer - it's an explicit performance tradeoff. The multithreaded work-stealing executor has higher throughput at the expense of needing more synchronisation.

Or you can schedule your thread-local tasks in a LocalSet to run them all on the owning thread, while keeping the other threads around to handle tasks that are intentionally parallel.

The general theme here is that tokio (and C++ equivalents) provide you the flexibility to do more things than the native Python/Node runtime does (and yes, the defaults take advantage of this). But the underlying intention is the same (and post-GIL we expect to see some movement in this direction on the Python front as well).

> Retrospectively, i think everyone is satisfied with the adopted syntax.

Maybe it’s a case of agree and commit, since it can’t really be walked back.

Various prominent people have said years after that .await was the correct choice after all

Are you sure Java's UI issues are caused by threading and not just Swing being a glitchy pile of junk?

Maybe you remember performance of IDEs from 15 years ago because that definitely isn't my experience.

Exactly. And I guess that is also the gist of the article: async Rust needs additional TLC.

You think IDEs are written in JS because of the performance benefits of the threading model?

I thought it was because they could copy chromium.

Why do you think they don’t struggle with input latency? Because the non blocking nature built into the browser model is so powerful and you cannot get that with threads.

> But are we really saying that the primary motivation for async/await is performance?

In some languages, yes, in others (js/python) async is just workaround about not having proper threading.

> But are we really saying that the primary motivation for async/await is performance?

It's all nuanced and what to choose requires careful evaluation.

I agree. Managing many http requests or responses was a motivating problem.

What I question is whether 1. Most programs resemble that, so that they make it an invasive feature of every general purpose language. 2. Whether programmers are making a conscious choice because they ruled out the perf overhead of the simpler model we have by default.

Or you can schedule your thread-local tasks in a LocalSet to run them all on the owning thread, while keeping the other threads around to handle tasks that are intentionally parallel.

Various prominent people have said years after that .await was the correct choice after all

Why do you think they don’t struggle with input latency? Because the non blocking nature built into the browser model is so powerful and you cannot get that with threads.

I disagree with the premise. I cannot imagine a better latency experience than blocking loop IDEs like VS6.

Which inputs are getting latency? The keyboard? The files?

> the non blocking nature

https://youtu.be/bzkRVzciAZg?si=BuBXxHTgN0OqsAhI

Are you sure that latency-sensitive parts are written in async JS instead of having a separate UI thread (pool)? I have no idea myself, but without knowing the details it's hard to argue. Note, that browsers themselves, are usually written in languages like C++ or Rust. They run JS, but aren't written in it

In some languages, yes, in others (js/python) async is just workaround about not having proper threading.

Python used multiple threads to handle I/O long before async/await was a glimmer in anyone's mind (despite the GIL). nodejs is one of the very few languages I can think of that was born single-threaded and used an asynchronous runtime from the get-go

I agree. Managing many http requests or responses was a motivating problem.

That is why we have the function colouring problem and a split ecosystem in the first place - if it were obviously better in all cases, we'd make async the default, and get rid of the split altogether (and there are languages, like Erlang, that fall on this side of the fence)

I disagree with the premise. I cannot imagine a better latency experience than blocking loop IDEs like VS6.

Which inputs are getting latency? The keyboard? The files?

> the non blocking nature

https://youtu.be/bzkRVzciAZg?si=BuBXxHTgN0OqsAhI

Yes they are, the UI layer is mostly JS, outside the rendering and layout engines.

If you implement threads and code that reacts to an input queue (e.g. PostMessage, queue_push, mq_send, ...), you've implemented (probably a bad version of) async threads. And yes, that's exactly what Windows 1.0 did and what made it great.

But God help you if you have to change the code. Async threads are a way to organize it and make it workable for humans.

Yes they are, the UI layer is mostly JS, outside the rendering and layout engines.

I've previously explained async bloat and some work-arounds for it, but would much prefer to solve the issue at the root, in the compiler. I've submitted a Project Goal, and am looking for help to fund the effort.

I love me some async Rust! It's amazing how we can write executor agnostic code that can run concurrently on huge servers and tiny microcontrollers.

But especially on those tiny microcontrollers we notice that async Rust is far from the zero cost abstractions we were promised. That's because every byte of binary size counts and async introduces a lot of bloat. This bloat exists on desktops and servers as well, but it's much less noticable when you have substantially more memory and compute available.

I've previously explained some work-arounds for this issue, but would much prefer to get to the root of the problem, and work on improving async bloat in the compiler. As such I have submitted a Project Goal.

This is part 2 of my blog series on this topic. See part 1 for the initial exploration of the topic and what you can do when writing async code to avoid some of the bloat. In this second part we'll dive into the internals and translate the methods of blog 1 into optimizations for the compiler.

What I won't be talking about is the often discussed problem of futures becoming bigger than necessary and them doing a lot of copying. People are aware of that already. In fact, there is an open PR that tackles part of it: https://github.com/rust-lang/rust/pull/135527

Anatomy of a generated future

We're going to be looking at this code:

fn foo() -> impl Future<Output = i32> {
    async { 5 }
}

fn bar() -> impl Future<Output = i32> {
    async {
        foo().await + foo().await
    }
}

godbolt

We're using the desugared syntax for futures because it's easier to see what's happening.

So what does the bar future look like?

There are two await points, so the state machine must have at least two states, right?

Well, yes. But there's more.

Luckily we can ask the compiler to dump MIR for us at various passes. An interesting pass is the coroutine_resume pass. This is the last async-specific MIR pass. Why is this important? Well, async is a language feature that still exists in MIR, but not in LLVM IR. So the transformation of async to state machine happens as a MIR pass.

The bar function generates 360 lines of MIR. Pretty crazy, right? Although this gets optimized somewhat later on, the non-async version uses only 23 lines for this.

The compiler also outputs the CoroutineLayout. It's basically an enum with these states (comments my own):

variant_fields: {
    Unresumed(0): [], // Starting state
    Returned (1): [],
    Panicked (2): [],
    Suspend0 (3): [_s1], // At await point 1, _s1 = the foo future
    Suspend1 (4): [_s0, _s2], // At await point 2, _s0 = result of _s1, s2 = the second foo future
},

So what are Returned and Panicked?

Well, Future::poll is a safe function. Calling it must not induce any UB, even when the future is done. So after Suspend1 the future returns Ready and the future is changed to the Returned state. Once polled again in that state, the poll function will panic.

The Panicked state exists so that after an async fn has panicked, but the catch-unwind mechanism was used to catch it, the future can't be polled anymore. Polling a future in the Panicked state will panic. If this mechanism wasn't there, we could poll the future again after a panic. But the future may be in an incomplete state and so that could cause UB. This mechanism is very similar to mutex poisoning.

(I'm 90% sure I'm correct about the Panicked state, but I can't really find any docs that actually describe this.)

Cool, this seems reasonable.

Why panic?

But is it reasonable? Futures in the Returned state will panic. But they don't have to. The only thing we can't do is cause UB to happen.

Panics are relatively expensive. They introduce a path with a side-effect that's not easily optimized out. What if instead, we just return Pending again? Nothing unsafe going on, so we fulfill the contract of the Future type.

I've hacked this in the compiler to try it out and saw a 2%-5% reduction in binary size for async embedded firmware.

So I propose this should be a switch, just like overflow-checks = false is for integer overflow. In debug builds it would still panic so that wrong behavior is immediately visible, but in release builds we get smaller futures.

Similarly, when panic=abort is used, we might be able to get rid of the Panicked state altogether. I want to look into the repercussions of that.

Always a state machine

We've looked at bar, but not yet at foo.

fn foo() -> impl Future<Output = i32> {
    async { 5 }
}

Let's implement it manually, to see what the optimal solution would be.

struct FooFut;
impl Future for FooFut {
    type Output = i32;
    
    fn poll(self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
        Poll::Ready(5)
    }
}

Easy right? We don't need any state. We just return the number.

Let's see what the generated MIR is for the version the compiler gives us:

// MIR for `foo::{closure#0}` 0 coroutine_resume
/* coroutine_layout = CoroutineLayout {
    field_tys: {},
    variant_fields: {
        Unresumed(0): [],
        Returned (1): [],
        Panicked (2): [],
    },
    storage_conflicts: BitMatrix(0x0) {},
} */

fn foo::{closure#0}(_1: Pin<&mut {async block@src\main.rs:5:5: 5:10}>, _2: &mut Context<'_>) -> Poll<i32> {
    debug _task_context => _2;
    let mut _0: core::task::Poll<i32>;
    let mut _3: i32;
    let mut _4: u32;
    let mut _5: &mut {async block@src\main.rs:5:5: 5:10};

    bb0: {
        _5 = copy (_1.0: &mut {async block@src\main.rs:5:5: 5:10});
        _4 = discriminant((*_5));
        switchInt(move _4) -> [0: bb1, 1: bb4, otherwise: bb5];
    }

    bb1: {
        _3 = const 5_i32;
        goto -> bb3;
    }

    bb2: {
        _0 = Poll::<i32>::Ready(move _3);
        discriminant((*_5)) = 1;
        return;
    }

    bb3: {
        goto -> bb2;
    }

    bb4: {
        assert(const false, "`async fn` resumed after completion") -> [success: bb4, unwind unreachable];
    }

    bb5: {
        unreachable;
    }
}

Yikes! That's a lot of code!

Notice at line 4 that we still have the 3 default states and at line 22 that we're still switching on it. There's a big optimization opportunity here that we're not using, i.e. to have no states and always return Poll::Ready(5) on every poll.

I've also hacked this in the compiler and it saved 0.2% of binary size. Not a lot, but it's quite a simple optimization, so likely still worthwhile.

It does change the behavior a bit, but only for executors that aren't compliant. It means that the future always returns Ready. The behavior in the compiler right now is that any subsequent polls will panic.

LLVM to the rescue?

Ok, so the MIR output isn't great. But LLVM will pick up the pieces right?

Well, sometimes, yeah. But only when the futures are simple enough and you're running opt-level=3. If the future grows too complex (which happens fast because futures nest very deeply in idiomatic async Rust code) or you're optimizing for size (which we often do with embedded or wasm), LLVM doesn't optimize this all away.

Here's an example in godbolt: https://godbolt.org/z/58ahb3nne

If you look through the generated assembly, you'll notice that it does know that foo returns 5, but that it doesn't optimize the answer of bar to be 10. The poll function of foo is also still called. This is done because of the potential panic the compiler can't fully account for. It doesn't realize foo is only called once and won't ever panic in practice.

If we comment out the panicking branch in the IR, we see it gets optimized better: https://godbolt.org/z/38KqjsY8E

LLVM is not our savior here sadly. We really do need to give it good inputs.

It does better with opt-level=3, but eventually can't keep up either when the code gets less trivial. And that's because we're relying on LLVM to spot that it should optimize out the things we're asking it to do.

Futures aren't (trivially) inlined

Inlining is great since it enables further optimization passes. Sadly, generated Rust futures are never inlined. After each future gets its implementation, then LLVM and the linker get an opportunity to do inlining. But as we've seen above, that's too late.

The prime opportunity for inlining is this:

async fn foo(blah: SomeType) -> OtherType {
    // ...
}

async fn bar(blah: SomeType) -> OtherType {
    foo(blah).await
}

This is a pattern that happens a lot when creating abstractions using traits. With the current compiler, bar gets its own state machine that calls the foo state machine, which is very wasteful. Instead,bar could also become foo by just returning the foo future.

Things get a little more difficult when we add a preamble and a postamble to the example.

async fn foo(blah: bool) -> i32 {
    // ...
}

async fn bar(input: u32) -> i32 {
    let blah = input > 10; // Preamble
    let result = foo(blah).await;
    result * 2 // Postamble
}

This pattern is common when translating an async function from one signature to another, which happens for trait impls.

Note that bar doesn't need any async state of its own here either. No data is kept over the single await point that isn't captured by foo. bar can't simply become foo, but we can mostly rely on the state of foo. The manual implementation would be something like:

enum BarFut {
    Unresumed { input: u32 },
    Inlined { foo: FooFut }
}
impl Future for BarFut {
    type Output = i32;
    
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        // Ignoring pin projection here
        loop {
            match self {
                Unresumed { input } => {
                    let blah = input > 10; // Preamble
                    *self = BarFut::Inlined { foo: foo(blah) };
                },
                Inlined { foo } => {
                    break foo
                        .poll(cx)
                        .map(|result| result * 2) // Postamble
                },
            }
        }
    }
}

That's a lot better than what's currently generated. If only we were allowed to execute the code up to the first await point, then we could get rid of the Unresumed state. But "futures don't do anything unless polled" is guaranteed, so we can't change that.

There are more optimizations you could do with inlining if you were able to query properties of the futures you're polling. I don't think this is possible, at least not with the current architecture in rustc. Every async block is transformed individually and no data is kept about it afterwards.

For example, if you could query if a future always returns ready at the first poll, you wouldn't have to create a state for the await point in the future of the caller. If that were possible and you could apply these optimizations recursively, you could collapse a lot of futures into much simpler state machines.

I haven't tested out inlining yet, but I think this would significantly help binary size and performance.

Collapsing states

The state machine gets an extra state for each await point in the async block. But there's code where multiple states could be collapsed into 1.

Take this example:

pub async fn process_command() {
    match get_command() {
        CommandId::A => send_response(123).await,
        CommandId::B => send_response(456).await,
    }
}

It's very natural to write it that way. But what happens is that we're getting two identical states:

/* coroutine_layout = CoroutineLayout {
    field_tys: {
        _s0: CoroutineSavedTy { // Identical to _s1
            ty: Coroutine(
                DefId(0:11 ~ mir_test[b831]::send_response::{closure#0}),
                [
                    (),
                    std::future::ResumeTy,
                    (),
                    (),
                    (u32,),
                ],
            ),
            source_info: SourceInfo {
                span: src/main.rs:13:25: 13:49 (#14),
                scope: scope[0],
            },
            ignore_for_traits: false,
        },
        _s1: CoroutineSavedTy { // Identical to _s0
            ty: Coroutine(
                DefId(0:11 ~ mir_test[b831]::send_response::{closure#0}),
                [
                    (),
                    std::future::ResumeTy,
                    (),
                    (),
                    (u32,),
                ],
            ),
            source_info: SourceInfo {
                span: src/main.rs:14:25: 14:49 (#16),
                scope: scope[0],
            },
            ignore_for_traits: false,
        },
    },
    variant_fields: {
        Unresumed(0): [],
        Returned (1): [],
        Panicked (2): [],
        Suspend0 (3): [_s0], // 2 states
        Suspend1 (4): [_s1],
    },
    storage_conflicts: BitMatrix(2x2) {
        (_s0, _s0),
        (_s1, _s1),
    },
} */

The MIR for this function is 456 lines long and many basic blocks are essentially duplicates.

We can refactor the code manually to:

pub async fn process_command() {
    let response = match get_command() {
        CommandId::A => 123,
        CommandId::B => 456,
    };
    send_response(response).await;
}

Here we don't get the duplicate states:

/* coroutine_layout = CoroutineLayout {
    field_tys: {
        _s0: CoroutineSavedTy {
            ty: Coroutine(
                DefId(0:11 ~ mir_test[b831]::send_response::{closure#0}),
                [
                    (),
                    std::future::ResumeTy,
                    (),
                    (),
                    (u32,),
                ],
            ),
            source_info: SourceInfo {
                span: src/main.rs:16:5: 16:34 (#14),
                scope: scope[1],
            },
            ignore_for_traits: false,
        },
    },
    variant_fields: {
        Unresumed(0): [],
        Returned (1): [],
        Panicked (2): [],
        Suspend0 (3): [_s0],
    },
    storage_conflicts: BitMatrix(1x1) {
        (_s0, _s0),
    },
} */

The total MIR length is now 302 lines and nothing is duplicated.

So it seems like a good optimization pass to search for identical code paths and states and collapse them into one. This optimization probably stacks well with the inlining pass.

Some testing results

Replace Returned' panic with Poll::Pending: 2-5% binary size savings on embedded.
When no await, no statemachine: 0.2% binary size savings on embedded.
Together: ~3% perf increase on x86 in synthetic benchmark with smol executor.

Future inlining should have a greater effect again.

Ultimately it's hard to know the improvements until after it can be benched in real systems.

Summary

Hopefully this article shed some light on some of the async Rust issues!

I would love to work on these items in the compiler:

Returned state no longer panics in release mode
Async blocks without awaits should not get state machines, but just return ready every time
Future inlining for futures with a single await
Collapse identical states

Links to my hacks:

No panics in poll after ready: https://github.com/rust-lang/rust/compare/main...diondokter:rust:resume-pending
No await, no statemachine: https://github.com/rust-lang/rust/compare/main...diondokter:rust:no-statemachine-when-no-await

Supporting this Project Goal

I want to work on this in the compiler and as such have submitted it as a Project Goal: https://rust-lang.github.io/rust-project-goals/2026/async-statemachine-optimisation.html

But I need your help because I can't do much without funding.

If you're with a company or organization that would benefit from this work and would be willing to (partially) fund it, please contact me at dion@tweedegolf.com. The scope is flexible and so is the amount of funding required. However, I have estimated that €30k could get all or at least a lot of this work done.

Hacker Times

Hacker Times

Async Rust never left the MVP state

Discussion

Discussion

Anatomy of a generated future

Why panic?

Always a state machine

LLVM to the rescue?

Futures aren't (trivially) inlined

Collapsing states

Some testing results

Summary

Supporting this Project Goal