Cancellations in async Rust

I think the send/recv with a timeout example is very interesting, because in a language where futures start running immediately without being polled, I think the situation is likely to be the opposite way around. send with a timeout is probably safe (you may still send if the timeout happened, which you might be sad about, but the message isn't lost), while recv with a timeout is probably unsafe, because you might read the message out of the channel but then discard it because you selected the timeout completion instead. And the fix is similar, you want to select either the timeout or 'something is available' from the channel, and if you select the latter you can peek to get the available data.

Some other material that has been written by me on that topic:

- Proposal from 2020 about async functions which are forced to run to completion (and thereby would use graceful cancellation if necessary). Quite old, but I still feel that no better idea has come up so far. https://github.com/Matthias247/rfcs/pull/1

- Proposal for unified cancellation between sync and async Rust ("A case for CancellationTokens" - https://gist.github.com/Matthias247/354941ebcc4d2270d07ff0c6...)

- Exploration of an implementation of the above: https://github.com/Matthias247/min_cancel_token

I'm not understanding what the supposed problem with these futures getting cancelled is. Since futures are not tasks, as the post itself acknowledges, does it not logically follow that one should not expect futures to complete if the future is not driven to completion, for one reason or another? What else could even be expected to happen?

The examples presented for "cancel unsafe" futures seem to me like the root of the problem is some sort of misalignment of expectations to the reality:

Example 1: one future cancelled on error in the other

let res = tokio::try_join!( do_stuff_async(), more_async_work(), );

Example 2: data not written out on cancellation

let buffer: &[u8] = /* ... */; writer.write_all(buffer)?;

Both of these cases are claimed to not be cancel-safe, because the work gets interrupted and so not driven to completion. But again, what else is supposed to happen? If you want the work to finish regardless of the async context being cancelled, then don't put it in the same async context but spawn a task instead.

I feel like I must be missing something obvious that keeps me from understanding the author's issue here. I thought work getting dropped on cancellation is exactly how futures are supposed to work. What's the nuance that I'm missing?

This was one of my favorite talks from RustConf this year! The distinction between cancel safety and cancel correctness is really helpful.

Glad to see it converted to a blog post. Talks are great, but blogs are much easier to share and reference.

https://sunshowers.io/posts/cancelling-async-rust/#the-pain-... was the most interesting part of this for me, as I can totally see making mistakes like this.

In the initial example, it's not clear what the desired behavior is. If the queue is full, the basic options are drop something, block and wait, or panic. Timing out on a block is usually deadlock detection. He writes "It turns out that this code is often incorrect, because not all messages make their way to the channel." Well, yes. You're out of resources. Now what?

What's he trying to do? Get a clean program shutdown? That's moderately difficult in threaded programs, and async has problems, too. The use case here is unclear.

The real use cases involve when you're sending messages back and forth to a remote site, and the remote site goes away. Now you need to dispose of the state on your end.

One should always keep in mind that await is always a potential return point. So, using await between two actions which always should be performed together should be avoided.

Rust's Future is somehow like move semantics in C++, where you may leave a Future in an invalid state after it finishes. Besides, Rust adopts a stackless coroutine design, so you need to maintain the state in your struct if you would like to implement a poll-based async structure manually.

These are all common traps. And now cancellations in async Rust are a new complement to state management in async Rust (Futures).

When I'm developing the mea (Make Easy Async) [1] library, I document the cancel safety attribute when it's non-trivial.

Additionally, I recall [2] an instance where a thoughtless async cancellation can disrupt the IO stack.

[1] https://github.com/fast/mea

[2] https://www.reddit.com/r/rust/comments/1gfi5r1/comment/luido...

Great talk! One thing that would have been nice to call out for n00bs like myself is how in SOP Futures can't cancelled. I knew that .await took ownership of the future so that drop() could not be called on it, so given how futures are lazy, it wasn't clear to me how to cancel a future after .await had been called. I later researched how select! and Abortable() did this, but could be nice to include a callout in the beginning of your talk if you ever do it again. Otherwise, nice work!

Timely! Was grumbling about this today as I added a "this function is cancel safe" to a new function's doc comment.

I really hope we get async drop soon.

I think that async in Rust has a significant devex/velocity cost. Unfortunately, nearly all of the effort in Rust libraries has gone into async code, so the async libraries have outpaced the threaded libraries.

There was only one threaded web server, https://lib.rs/crates/rouille . It has 1.1M lines of code (including deps). Its hello-world example reaches only 26Krps on my machine (Apple M4 Pro). It also has a bug that makes it problematic to use in production: https://github.com/tiny-http/tiny-http/issues/221 .

I wrote https://lib.rs/crates/servlin threaded web server. It uses async internally. It has 221K lines of code. Its hello-world example reaches 102Krps on my machine.

https://lib.rs/crates/ehttpd is another one but it has no tests and it seems abandoned. It does an impressive 113Krps without async, using only 8K lines of code.

For comparison, the popular Axum async web server has 4.3M lines of code and its hello-world example reaches 190Krps on my machine.

The popular threaded Postgres client uses Tokio internally and has 1M lines of code: http://lib.rs/postgres .

Recently a threaded Postgres client was released. It has 500K lines of code: https://lib.rs/crates/postgres_sync .

There was no ergonomic way to signal cancellation to threads, so I wrote one: https://crates.io/crates/permit .

Rust's threaded libraries are starting to catch up to the async libraries!

---

I measured lines of code with `rm -rf deps.filtered && cargo vendor-filterer --platform=aarch64-apple-darwin --exclude-crate-path='*#tests' deps.filtered && tokei deps.filtered`.

I ran web servers with `cargo run --release --example hello-world` and measured throughput with `rewrk -c 1000 -d 10s -h http://127.0.0.1:3000/`.

My Rust knowledge is too low to understand. However, thank you very much for the article. I hope AI will learn it and explain to me in a few years :-)

doesn't rust have raii?

Less clickbaity title: Cancellations in async Rust.

It's really not about "cancelling async Rust" which is what I expected, even if it didn't make much sense.

i am honestly glad i don't write rust anymore

for a sec i thought DEI was going too far

is the title like that on purpose?

Some other material that has been written by me on that topic:

- Proposal for unified cancellation between sync and async Rust ("A case for CancellationTokens" - https://gist.github.com/Matthias247/354941ebcc4d2270d07ff0c6...)

- Exploration of an implementation of the above: https://github.com/Matthias247/min_cancel_token

These are all common traps. And now cancellations in async Rust are a new complement to state management in async Rust (Futures).

When I'm developing the mea (Make Easy Async) [1] library, I document the cancel safety attribute when it's non-trivial.

Additionally, I recall [2] an instance where a thoughtless async cancellation can disrupt the IO stack.

[1] https://github.com/fast/mea

[2] https://www.reddit.com/r/rust/comments/1gfi5r1/comment/luido...

I wrote https://lib.rs/crates/servlin threaded web server. It uses async internally. It has 221K lines of code. Its hello-world example reaches 102Krps on my machine.

https://lib.rs/crates/ehttpd is another one but it has no tests and it seems abandoned. It does an impressive 113Krps without async, using only 8K lines of code.

For comparison, the popular Axum async web server has 4.3M lines of code and its hello-world example reaches 190Krps on my machine.

The popular threaded Postgres client uses Tokio internally and has 1M lines of code: http://lib.rs/postgres .

Recently a threaded Postgres client was released. It has 500K lines of code: https://lib.rs/crates/postgres_sync .

There was no ergonomic way to signal cancellation to threads, so I wrote one: https://crates.io/crates/permit .

Rust's threaded libraries are starting to catch up to the async libraries!

---

I measured lines of code with `rm -rf deps.filtered && cargo vendor-filterer --platform=aarch64-apple-darwin --exclude-crate-path='*#tests' deps.filtered && tokei deps.filtered`.

I ran web servers with `cargo run --release --example hello-world` and measured throughput with `rewrk -c 1000 -d 10s -h http://127.0.0.1:3000/`.

My Rust knowledge is too low to understand. However, thank you very much for the article. I hope AI will learn it and explain to me in a few years :-)

Thanks, that is a great point.

Isn't this exactly what cancellation-safety is all about?

The examples presented for "cancel unsafe" futures seem to me like the root of the problem is some sort of misalignment of expectations to the reality:

Example 1: one future cancelled on error in the other

let res = tokio::try_join!( do_stuff_async(), more_async_work(), );

Example 2: data not written out on cancellation

let buffer: &[u8] = /* ... */; writer.write_all(buffer)?;

You're absolutely right! The problem is that this has introduced many bugs in our experience at Oxide. If you've already fully internalized the idea that futures are passive and can be cancelled at any await point, the talk is just a bunch of details.

I don't write Rust often, and I think part of the issue is that the async/await style is encouraging you to write code that looks a lot like synchronous code, and so it makes it really easy to forget that your code is cancelable at any await point.

I'm sure experienced async Rust programmers always have this things in mind, but Rust is also about preventing these kinds of missable behaviour, be it via the type system or otherwise.

This was one of my favorite talks from RustConf this year! The distinction between cancel safety and cancel correctness is really helpful.

Glad to see it converted to a blog post. Talks are great, but blogs are much easier to share and reference.

"Cancel correctness" makes a lot of sense, because it puts the cancellation in some context.

I don't like the "cancel safety" term. Not only it's unrelated to the Rust's concept of safety, it's also unnecessarily judgemental.

Safe/unsafe implies there's a better or worse behavior, but what is desirable for cancellation to do is highly context-dependent.

Futures awaiting spawned tasks are called "cancellation safe", because they won't stop the task when dropped. But that's not an inherently safe behavior – leaving tasks running after their spawner has been cancelled could be a bug: piling up work that won't be used, and even interfering with the rest of the program by keeping locks locked or ports used. OTOH a spawn handle that stops the task when dropped would be called "cancellation unsafe", despite being a very useful construct specifically for propagating cleanup to dependent tasks.

Thanks! I definitely prefer reading blog posts over watching talks as well.

https://sunshowers.io/posts/cancelling-async-rust/#the-pain-... was the most interesting part of this for me, as I can totally see making mistakes like this.

I'm a Go developer and this was still useful for me! Obviously Rust devs are more accustomed to more assistance from their tools than Go devs, but just about every gotcha listed is something that can happen in Go with goroutines, channels, select, and other shared concurrency primitives.

What's he trying to do? Get a clean program shutdown? That's moderately difficult in threaded programs, and async has problems, too. The use case here is unclear.

The real use cases involve when you're sending messages back and forth to a remote site, and the remote site goes away. Now you need to dispose of the state on your end.

Ideally, what you would like to do is buffer up the message until there's space in the channel. I cover this later in the talk under "What can be done".

It's in the example isn't it? The example is logging "No space for 5 seconds". It's just a helpful diagnostic that subtly turned into data loss.

Maybe it's a bit contrived, but it's also the kind of code you'd sprinkle through your system in response to "nothing seems to be happening and I don't know why".

> He writes

They go by they/she

https://sunshowers.io/about/

One should always keep in mind that await is always a potential return point. So, using await between two actions which always should be performed together should be avoided.

That… seems bad? Like I guess it is what it is and you just have to deal with it but what if your "critical section" has two await calls? The code can be paused between them but it's such that it must eventually resume. Say making a change in the database and emitting an audit edit for that change. Is your only option to either not do that or put a big do not cancel sign on the function docs?

Wait, how does this work in practice?

Let's say my code looks like this

    async fn a() {
      b().await
    }

    async fn b() {
      c().await
      d().await
    }

    async fn c() {
    }

    async fn d() {
    }

Where does an issue occur which causes `d` to not to be called? Is it some sort of cancellation in c? Or some upstream action in a?

Thanks! What does SOP mean in this context?

Timely! Was grumbling about this today as I added a "this function is cancel safe" to a new function's doc comment.

I really hope we get async drop soon.

I'm curious. Can you talk a little about that function?

Less clickbaity title: Cancellations in async Rust.

It's really not about "cancelling async Rust" which is what I expected, even if it didn't make much sense.

As the author of the talk/blog post, I was definitely going for a bit of a moral valence in the title, in the sense that future cancellations are very hard to reason about and what I call the least Rusty part of Rust. But it admittedly is a bit clickbaity too.

It made sense to me, because I imagine a thread or coroutine as something that runs code as though it were interpreting something like psuedocode, whether it's doing that or not. So from my point of view an instance of async Rust is being cancelled - not the feature of the Rust project, but instances of code.

This abstraction has served me well and facilitates stepping through code in a debugger, though I jump out of thinking it at that level when I need to think of it at a lower level.

As in the pop-culture concept of cancelling? That's what you assumed the topic "cancelling async <language name>" was going to be about??

Or am I missing context?

for a sec i thought DEI was going too far

is the title like that on purpose?

i am honestly glad i don't write rust anymore

Thanks, that is a great point.

Isn't this exactly what cancellation-safety is all about?

I'm sure experienced async Rust programmers always have this things in mind, but Rust is also about preventing these kinds of missable behaviour, be it via the type system or otherwise.

"Cancel correctness" makes a lot of sense, because it puts the cancellation in some context.

I don't like the "cancel safety" term. Not only it's unrelated to the Rust's concept of safety, it's also unnecessarily judgemental.

Safe/unsafe implies there's a better or worse behavior, but what is desirable for cancellation to do is highly context-dependent.

Thanks! I definitely prefer reading blog posts over watching talks as well.

I'm curious. Can you talk a little about that function?

Most common scenario in article: select!. I split out a "wait for X to be ready" from "X" so that the former could be on the left side of a select ARM, and the rest on the right side.

I initially skipped reading it because i thought it was another drama post about maintainers a la all the nixos stuff lately.

This abstraction has served me well and facilitates stepping through code in a debugger, though I jump out of thinking it at that level when I need to think of it at a lower level.

As in the pop-culture concept of cancelling? That's what you assumed the topic "cancelling async <language name>" was going to be about??

Or am I missing context?

Async in Rust isn't exactly universally loved, partly because async Rust is perceived to spread progressively to Rust libraries making it less optional to use. See eg https://bitbashing.io/async-rust.html, "Async Rust Is A Bad Language"

That's what I assumed. I know languages have to handle cancellations in async code, but Rust has had a fair amount of drama over the years, and I assumed the title was accurate and reflected that some drama was happening.

they probably assumed something like some_running_async_task.cancel()

doesn't rust have raii?

It does, but you can only run synchronous code on drop. This is what "async drop" is supposed to handle — things like issuing ROLLBACK statements to the database on cancellation.

It also wouldn't help when you have no valid state to restore to, as in the mutex example in the post.

I see. Do you suppose that the origin of these bugs is more about the difficulty of reasoning about the execution of deep async stacks, or does it come down to the developers holding an incorrect mental model of the Rust futures in their minds?

I am asking because I've noticed that many developers with previous experience from "task-based" languages (specifically the JS/TS world) tend to grasp the basics of Rust async quickly enough, but then run into expectation-misalignment related problems similar to the examples that you used in your post. That in turn has made want to understand whether it is the Rust futures that are themselves difficult or strange, or whether it's a case of the Rust futures appearing simple and familiar, even though they are completely different in very subtle ways. I suppose that it's a combination of both.

Ideally, what you would like to do is buffer up the message until there's space in the channel. I cover this later in the talk under "What can be done".

Is that ideal, though? I mean, the channel is the buffer. If you need more buffer, it should have been bigger to start with. Generally this reflects a resource exhaustion failure, which you don't handle by adding code. Fix the resource allocation.

The double-loop thing effectively creates a blocking operation. Something you can do directly. Why all the complexity?

It's in the example isn't it? The example is logging "No space for 5 seconds". It's just a helpful diagnostic that subtly turned into data loss.

Maybe it's a bit contrived, but it's also the kind of code you'd sprinkle through your system in response to "nothing seems to be happening and I don't know why".

It's definitely a bit contrived, but to me it's also emblematic of the issues with async Rust.

The note on mpsc::Sender::send losing the message on drop [1] was actually added by me [2], after I wrote the Oxide RFD on cancellations [3] that this talk is a distilled form of. So even the great folks on the Tokio project hadn't documented this particular landmine.

[1] https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.h...

[2] https://github.com/tokio-rs/tokio/pull/5947

[3] https://rfd.shared.oxide.computer/rfd/0400

> He writes

They go by they/she

https://sunshowers.io/about/

It's definitely a bit contrived, but to me it's also emblematic of the issues with async Rust.

[1] https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.h...

[2] https://github.com/tokio-rs/tokio/pull/5947

[3] https://rfd.shared.oxide.computer/rfd/0400

Thanks! What does SOP mean in this context?

Most common scenario in article: select!. I split out a "wait for X to be ready" from "X" so that the former could be on the left side of a select ARM, and the rest on the right side.

I initially skipped reading it because i thought it was another drama post about maintainers a la all the nixos stuff lately.

To balance the universe a bit, I read it expecting a drama post - then read it right through because it was at least as interesting as the drama post I'd expected. I also discovered Oxide through this, which looks interesting, except for the complete lack of pricing I can find on their site - which puts in my head as probably in the "If you need to ask the price you can't afford it" category...

Appreciate the feedback here — definitely don't want the title to overshadow the work itself. Will keep this in mind for next time.

they probably assumed something like some_running_async_task.cancel()

It does, but you can only run synchronous code on drop. This is what "async drop" is supposed to handle — things like issuing ROLLBACK statements to the database on cancellation.

It also wouldn't help when you have no valid state to restore to, as in the mutex example in the post.

tokio-postgres handles this by just dispatching the "ROLLBACK" command in impl Drop and ignoring the response. https://github.com/rust-postgres/rust-postgres/blob/a7a49a90...

Is this not enough? What could go wrong? If the network connection dies or the task is cancelled, I'm assuming the database server cleans up the connection state and does a rollback automatically.

And adding async Drop will probably add a whole new set of footguns.

Yeah, it's a combination of both in my experience. I think even to experienced async Rust programmers, things like Tokio mutexes being really hard to use correctly can be a bit surprising.

Also, as another comment on the thread points out [1], languages where futures are active by default can have the opposite problem.

[1] https://news.ycombinator.com/item?id=45467188

It depends on how tolerant you are to losing messages under backpressure. In some cases at work we set a large channel size, and then panic if it's exceeded.

The double-loop thing effectively creates a blocking operation. Something you can do directly. Why all the complexity?

Agreed that in the narrow case of a timeout it doesn't buy you much (and things like network sockets often let you do timeouts in synchronous code). But often you do want the power to do selects and more complex state machines. I wrote a blog post a couple years ago talking about why a project I'm the author of, cargo-nextest, switched from sync Rust to async. https://sunshowers.io/posts/nextest-and-tokio/

To this day I'm not aware of a better way to express what's become a set of increasingly complex state machines (the most recent improvement being to make the state machines responsive to user input). Nextest's runner loop is structured mostly like a GUI event loop, but without explicit state machines. It's quite nice being able to write code that's this complex in a bug-free manner.

Even if you guaranteed the calling code would always logically continue running the function till completion, you wouldn’t have the guarantee the code would actually resume — eg the program crashes between the two calls, network dies, etc.

If you want to tie multiple actions together as an atomic unit, you need the other side to have some concept of transactions; — and you need to utilize it.

A DB action and audit emission have to run transactionally anyway.

So on cancellation, the transaction times out and nothing is written. Bad but safe.

The problem is the same on other platforms. For example, what if writing to the DB throws an exception if you’re on Python? Your app just dies, the transaction times out. Unfortunate but safe.

If it does not run transactionally you have a problem in any execution scenario.

In general I think people end up gravitating towards using message passing or the actor model for this.

Wait, how does this work in practice?

Let's say my code looks like this

    async fn a() {
      b().await
    }

    async fn b() {
      c().await
      d().await
    }

    async fn c() {
    }

    async fn d() {
    }

Where does an issue occur which causes `d` to not to be called? Is it some sort of cancellation in c? Or some upstream action in a?

Ah, I see it now in the article. I just missed it.

`d` not being called would happen because of actions in `a`.

If `a` were rewritten as

    async fn a() {
      try_join!(b(), c(), d())
    }

Then if `c` ends up failing in the try_join then process on `b` will be halted and thus the `d` in `b` won't be executed.

Yeah, it's a combination of both in my experience. I think even to experienced async Rust programmers, things like Tokio mutexes being really hard to use correctly can be a bit surprising.

Also, as another comment on the thread points out [1], languages where futures are active by default can have the opposite problem.

[1] https://news.ycombinator.com/item?id=45467188

It depends on how tolerant you are to losing messages under backpressure. In some cases at work we set a large channel size, and then panic if it's exceeded.

If you want to tie multiple actions together as an atomic unit, you need the other side to have some concept of transactions; — and you need to utilize it.

A DB action and audit emission have to run transactionally anyway.

So on cancellation, the transaction times out and nothing is written. Bad but safe.

The problem is the same on other platforms. For example, what if writing to the DB throws an exception if you’re on Python? Your app just dies, the transaction times out. Unfortunate but safe.

If it does not run transactionally you have a problem in any execution scenario.

In general I think people end up gravitating towards using message passing or the actor model for this.

Ah, I see it now in the article. I just missed it.

`d` not being called would happen because of actions in `a`.

If `a` were rewritten as

    async fn a() {
      try_join!(b(), c(), d())
    }

Then if `c` ends up failing in the try_join then process on `b` will be halted and thus the `d` in `b` won't be executed.

Appreciate the feedback here — definitely don't want the title to overshadow the work itself. Will keep this in mind for next time.

+1 this.

IMHO async is an anti-pattern, and probably the final straw that will prevent me from ever finishing learning Rust. Once one learns pass-by-value and copy-on-write semantics (Clojure, PHP arrays), the world starts looking like a spreadsheet instead of spaghetti code. I feel that a Rust-like language could be built with no borrow checker, simply by allocating twice the memory. Since that gets ever-less expensive, I'm just not willing to die on the hill of efficiency anymore. I predict that someday Rust will be relegated to porting scripting languages to a bare-metal runtime, but will not be recommended for new work.

That said, I think that Rust would make a great teaching tool in an academic setting, as the epitome of imperative languages. Maybe something great will come of it, like Swift from Objective-C or Kotlin from Java. And having grown up on C++, I have a soft spot in my heart for solving the hard problems in the fastest way possible. Maybe a voxel game in Rust, I dunno.

I just don't really know where the ecosystem is going with async these days. I see a lot of changes in the language, many of which seem more complex than are typical for the justification, some of which have broader utility but generally wouldn't be done if it weren't for them being necessary for async... A hydra of complexity and honestly, where does it end? When will async be "solved"? What will the language look like when it is? Is it really all justified? Did we know that the road would be this long when we started it? For me, you could resolve this complexity by just letting me flip a switch and have things operate in a really basic blocking mode with no state machine or runtime and I'll sling this stuff on a thread if I have to, the way I used to. My use cases never needed this. I can get by just fine solving the circumstances I would need something like this with a purpose build mechanism, not a language native feature trying to solve all the different flavors of asynchronous problems I have + a million I don't.

of course... Its obviously not as simple as "just give me a way to turn it off", but more importantly, I just don't see this concern being addressed by the Powers That Be. Am I just not looking hard enough? Did I miss the rust blog post titled "hey - so you didn't want to use async but the libraries that you did want to use ship with async so you're up shit creek.. Here's what our plan for that is"?

I'm sorry. I generally lurk because I don't consider myself up to the caliber of others on this website, but nonetheless the few posts I make do end up being about async because it does make me feel quite hopeless at times. Hopefully someone can look passed my ignorance/incompetence/selfishness/immaturity and tell me its all going to be okay.

tokio-postgres handles this by just dispatching the "ROLLBACK" command in impl Drop and ignoring the response. https://github.com/rust-postgres/rust-postgres/blob/a7a49a90...

Is this not enough? What could go wrong? If the network connection dies or the task is cancelled, I'm assuming the database server cleans up the connection state and does a rollback automatically.

And adding async Drop will probably add a whole new set of footguns.

> What could go wrong?

LoL, an insane amount of things. TCP connections are an illusion of safely, for the purpose of database commits use UDP packets as a model instead, it'll be much closer to reality.

So, regarding transactions, absolutely you can throw them away on cancellation. But there's an interesting wrinkle here: if you use a connection pool like most users, and you were going to do the ROLLBACK at the end of your future on error, then that ROLLBACK wouldn't run it the future is cancelled! Then future operations reusing the same connection would be stuck in transaction la-la land.

(This is related to the fact that Rust doesn't have async drop — you can't run async code on drop, other than spawning a new task to do the cleanup.)

This is prong 3 of my cancel correctness framework (that the cancellation violates a system property, in this case a cleanup property.) The solution here is to ensure the connection is in a pristine state before handing it out the next time it's used.

Maybe I’m thick, but I’m not seeing what is the problem in your first codeblock?

I think Oxide should be renting out time on their hardware racks, as well as selling them to big orgs.

Oxide looks to be superb engineering up and down the whole stack, and if it drives more rust code into linux all the better.

Now that linode has been consumed by Akamai, we need an alternative.

+1 this.

> Since that gets ever-less expensive,

That kind of thinking made sense in the 90s when things followed Moore’s law. But DRAM was one of the first things to fail to keep up: https://ourworldindata.org/grapher/historical-cost-of-comput... and barely gets cheaper anymore. Thats why mobile phones still only have 16gb of memory despite having 4gib a decade ago.

And there’s all sorts of problems that Rust doesn’t necessarily make a great fit for. But Rust’s target marketplace is where you’d otherwise use a low level language like C or C++. If you can just heap allocate everything and aggressively create copies all over the place, then why would you ever use those languages in the first place.

And for what it’s worth Rust is finding a lot of success even replacing all the tooling in other language ecosystems like Ruby, Python, and JS precisely because the tools in those ecosystems written in the native language end up being horribly slow. And memory allocation and randomly deep copying arrays are the kinds of things that add up and make things slow (in addition to GC pauses, slow startups, interpreter costs etc).

And you can always choose not to do async in Rust although personally I’m a huge fan as it makes it really clear where you have sprinkled in I/O in places you shouldn’t have.

Author here -- I'd recommend reading my blog post about how cargo-nextest uses Tokio + async Rust to handle very complex state machines: https://sunshowers.io/posts/nextest-and-tokio/

The rust ecosystem is very invested into making every library that touches the network async. But if the program you are writing doesn't touch the network you don't have to think about async. Or you can banish network code onto one thread with an async runtime, and communicate via flume queues/channels with it from normal threaded code running in another thread

In your view, which languages / ecosystems have a better general approach for handling task cancellations than async rust?

There is a voxel game in Rust, btw: https://veloren.net/

lean4.

it analyses code. if it finds raii/linearity/single-ownership, it does exactly like rust mem mgmt.

but if it js not, it does rc.

so it does what rust, but automagically without polluting code.

so cow or pbw or 2mem are not only options to improve rust.

> I feel that a Rust-like language could be built with no borrow checker, simply by allocating twice the memory.

If that's what you're looking for, have you considered OCaml?

This reads hella uninformed

By allocating twice the memory of ...?

There is a voxel game in Rust, btw: https://veloren.net/

lean4.

it analyses code. if it finds raii/linearity/single-ownership, it does exactly like rust mem mgmt.

but if it js not, it does rc.

so it does what rust, but automagically without polluting code.

so cow or pbw or 2mem are not only options to improve rust.

> I feel that a Rust-like language could be built with no borrow checker, simply by allocating twice the memory.

If that's what you're looking for, have you considered OCaml?

> Since that gets ever-less expensive,

And you can always choose not to do async in Rust although personally I’m a huge fan as it makes it really clear where you have sprinkled in I/O in places you shouldn’t have.

Before adopting Rust, I also found it silly for high-level tasks where e.g. Clojure or Java would suffice. However, the results of using Rust changed my mind.

I used to write web backends in Clojure, and justified it with the fact that the JVM has some of the best profiling tools available (I still believe this), and the JVM itself exposes lots of knobs to not only fine-tune the GC, but even choose a GC! (This cannot be understated; garbage collectors tend to be deeply integrated into a language's runtime, and it's amazing to me that the Java platform manages to ship several garbage collectors, each of which are optimal in their own specific situations).

After rewriting an NLP-heavy web app in Rust, I saw massive performance gains over the original Clojure version, even though both aggressively copy data and the Rust version is full of atomic refcounts (atomic refcounting is not the fastest GC out there...)

The binary emitted by rustc is also much smaller. ~10 MB static binary vs. GraalVM's ~80 MB native images (and longer build times, since classpath analysis and reflection scanning require a lot of work)

What surprised me the most is how high-level Rust feels in practice. I can use pattern matching, async/await, functional programming idioms, etc., and it ends up being fast anyway. Coming from Clojure, Rust syntax trying its best to be expression-oriented is a key differentiator from other languages in its target domain (notably, C++). I sometimes miss TypeScript's anonymous enums, but Rust's type system can express a lot of of runtime behavior, and it's partly why many jokingly state "if it compiles, it's likely correct". Then there's the little things, like how Rust's Futures don't immediately start in the background. In contrast, JavaScript Promises are immediately pushed to a microtask queue, so cancelling a Promise is impossible by design.

Overall, it's the little things like this -- and the toolchain (cargo, clippy, rustfmt) -- that have kept me using Rust. I can write high-level code and still compile down to a ~5 MB binary and outperform idiomatic code in other languages I'm familiar with (e.g. Clojure, Java, and TypeScript).

It isn’t as dramatic a decrease as other types of storage, but $4,000 to $1,000 per terabyte in a decade is still a big drop.

Author here -- I'd recommend reading my blog post about how cargo-nextest uses Tokio + async Rust to handle very complex state machines: https://sunshowers.io/posts/nextest-and-tokio/

Ah cool, a couple of kudos to you:

1) I learned about pin in Rust to prevent values from moving in memory.

2) I learned about the html <summary> tag (the turndown arrows in your article that work with Javascript disabled) hah.

I can see how dealing with stream and resource cleanup in async code could be a chore. It sounds like you were able to do that in a fairly declarative manner, which is what I always strive for as well.

I think my hesitation with async is that I already went down that road early in my programming life with cooperative threads/multitasking on Mac OS 9 and earlier. There always seems to be yet another brittle edge case to deal with, so it can feel infuriating playing whack-a-mole until they're all nailed down.

For example, pinning memory looks a lot like locking handles in Mac OS. Handles were pointers to pointers, so it was a bare hands way to implement a memory defragmenter before runtimes were smart enough to handle it. If apps used handles, then blocks of data could be unlocked, moved somewhere else in memory, and then re-locked. Code had to do an extra hop through each handle to get to the original pointer, which was a frequent source of bugs because one async process might be working on a block, yield, and then have another async process move the handle out from under it.

The lock's state was stored in a flag in the memory manager, basically a small bit of metadata. I haven't investigated, but I suspect that Rust may be able to handle locking more efficiently, perhaps more like reference counting or the borrow checker where it can infer whether a pointer is locked without storing that flag somewhere (but I could be wrong).

Apple abandoned handles when it migrated to OS 10 and Darwin inherited protected memory and better virtual memory from FreeBSD. Although now that I write this out, I'm not sure that they solved in-process fragmentation. I think they just give apps the full 32 or 64 bit address space so that effectively there is always another region available for the next allocation, and let the virtual memory subsystem consolidate 4k memory blocks into contiguous strips internally. The dereferencing of memory step became implicit rather than explicit, as well as hidden from apps, so that whole classes of bugs became unreachable.

Anyway, that's why I prefer the runtime to handle more of this. I want strong guarantees that I can terminate a process and all locks inside it will get freed as well. I can pretty much rely on that even in hacky languages like PHP.

My frustration with all of this is that we could/should have demanded better runtimes. We could have had realtime unixes where task switching and memory allocation were effectively free. Unfortunately the powers that be (Mac OS and Windows) had runtimes that were too entrenched with too many users relying on quirks and so they dragged their feet and never did better. Languages like Rust were forced to get very clever and go to the ends of the earth to work around that. Then when companies like Google and Facebook won the internet lottery, they pulled the ladder up behind them by unilaterally handing down decrees from on high that developers should use bare hands techniques, rather than putting real resources into reforming the fundamentals so that we wouldn't have to.

What I'm trying to say is that your solution is clever and solves a common pattern in about the simplest way possible, but is not as simple as synchronous-blocking unix pipes to child processes in shell scripts. That's in no way a criticism. I have similar feelings about stuff like Docker and Kubernetes after reading about Podman. If we could magically go back and see the initial assumptions that led us down the road we're on, we might have tried different approaches. It's all of those roads not taken that haunt me, because they represent so much of my workload each day.

> The rust ecosystem is very invested into making every library that touches the network async.

Right, and that is one of the absolute worst things about the Rust ecosystem. Most programs don't benefit from async, and should use plain old threads because they are much easier to work with.

In your view, which languages / ecosystems have a better general approach for handling task cancellations than async rust?

Elixir via the Task module https://hexdocs.pm/elixir/Task.html

Well, synchronous blocking approaches (as opposed to asynchronous nonblocking) provide that stuff for free. It would basically be the functional programming and unix ecosystems. Arguably the Go language's goroutines strike a good balance between cooperate and preemptive threads/multitasking. Although that distinction is not fundamental, because if we had realtime unix runtimes, then spawning an isolated process would have no more overhead than spawning a thread (this is the towering failure of all mainstream OSs today IMHO):

https://kushallabs.com/understanding-concurrency-in-go-green...

So lots of concepts are worth learning like atomicity, ACID compliance, write ahead logs (WALs), statically detecting livelocks and deadlocks (or making them unreachable), consensus algorithms like Raft and Paxos, state transfer algorithms like software transaction memory (STM), connectionless state transfer like hash trees and Merkle trees, etc.

The key insight is that manual management of tasks is, for the most part, not tenable by humans. It's better to take a step back and work at a higher level of abstraction. For example, declarative programming works in terms of goals/specifications/tests, so that the runner has more freedom to cancel and restart/retry tasks arbitrarily. That way the user can fire off a workload and wait until all of the tasks match a success criteria, and even treat that process as idempotent so it can all be run again without harm. In this way, trees of success criteria can be composed to manage a task pool.

I'd probably point to CockroachDB as one of the best task-cancellers, since it doesn't have a shutdown procedure. Its process can simply be terminated by the user with control-c, then it reconciles any outstanding transactions the next time it's booted, which just adds some latency. If an entire database can do that, then "this is the way".

Ada has very well thought out and proven tasking features, including clean methods of task cancellation.

This reads hella uninformed

You're right, but not in the usual way hah. I started programming in the late 1980s with HyperCard, then used mostly C++ in the 90s, and have seen the rise and fall of various paradigms that felt eternal. I mean at one time, Java felt untouchable.

I think that Rust is making an admiral attempt to attack challenges that have already been solved better in other ways. I just don't have much use for its arsenal.

For example, I wasted 2 years of my life trying to write a NAT-punching peer to peer networking framework for games around 2005, but was first exposed to synchronous blocking vs asynchronous nonblocking networking in the late 90s when I read Beej's Guide to Network Programming:

https://beej.us/guide/bgnet/

I was hopelessly trying to mimic the functionality of libraries like RakNet and Zoidcom without knowing some fundamentals that I wouldn't fully understand for years:

https://www.reddit.com/r/gamedev/comments/93kr9h/recommended...

20 years later, Rust has iroh:

https://github.com/n0-computer/iroh

I realize there is some irony in pointing to a Rust library as a final solution.

But my point is that when developers reached high levels of financial success and power, they didn't go back to address the fundamentals. NAT was always an abomination to me. And as far as I know, they kept it in IPv6. Someone like Google should have provided a way to get around it that's not as heavy as WebRTC. So many developer years of work have been wasted due to the mistakes of the status quo. So that we wander in the desert for years using lackluster paradigms because we don't know that better stuff exists.

Knowing what I know now, I would have created open source C (portable) libraries to solve NAT punching, state transfer with a software transactional memory (STM) or Raft, entity state machines (like in Unity), movement prediction/dead reckoning, etc etc etc to form the basis of a distributed computing network for virtual worlds and let the developer community solve that. Someone will do that in a year or two with AI now I assume.

Ok you kinda got me. I realize after writing this out that I wouldn't use Rust for new work, but it's not so much about the language itself as building upon proven layers to "get real work done". The lower the level of abstraction, the harder that is to do. So it's hard for me to see the problem which Rust is trying to solve.

By allocating twice the memory of ...?

Everything, everywhere, all the time! It’s so simple, why has no one ever thought of just increasing a finite resource!?

Sorry, I should have elaborated. I believe that copy-on-write with virtual memory (VM) can be used to achieve a runtime that appears to use copy-by-value everywhere with near-zero overhead when the VM block size is small, like 4k.

If we imagine a function passing a block of memory to sub functions which may write bytes to it randomly, then each of those writes may allocate another block. If those allocations are similar in size to the VM block size, then each invocation can potentially double the amount of memory used.

A do-one-thing-and-do-it-well (DOTADIW?) program works in a one-shot fashion where the main process fires off child processes that return and free the memory that was passed by value. Surrounded by pipes, so that data is transmuted by each process and sent to the next one. VM usage may grow large temporarily per-process, but overall we can think of each concurrent process as roughly doubling the amount of memory.

Writing this out, I realized that the worst case might be more like every byte changing in a 4k block, so a 4096 times increase in memory. Which still might be reasonable, since we accept roughly a 200x speed decrease for scripting languages. It might be worth profiling PHP to see how much memory increases when every byte in a passed array is modified. Maybe they use a clever tree or refcount strategy to reduce the amount of storage needed when arrays are modified. Or maybe they just copy the entire array?

Another avenue of research might be determining whether a smarter runtime could work with "virtual" VMs (VVMs?) to use a really small block size, maybe 4 or 8 bytes to match the memory bus. I'd be willing to live with a 4x or 8x increase in memory to avoid borrow checkers, refcounts or garbage collection.

Edit: after all these years, I finally looked up how PHP handles copy-on-write, and it does copy the whole array on write unfortunately:

http://hengrui-li.blogspot.com/2011/08/php-copy-on-write-how...

If I were to write something like this today, I'd maybe use "smart" associative arrays of some kind instead of contiguous arrays, so that only the modified section would get copied. Internally that might be a B-Tree with perhaps 8 bytes per leaf to hold N primitives like 1 double, 2 floats, etc. In practice, a larger size like 16-256 bytes per leaf might improve performance at the cost of memory.

Looks like ZFS deduplication only copies the blocks within the file that changed, not the entire file. Their strategy could be used for a VM so that copy-on-write between processes only copies the 4k blocks that change. Then if it was a realtime unix, functions could be synchronous blocking processes that could be called with little or no overhead.

This is the level of work that would be required to replace Rust with simpler metaphors, and why it hasn't happened yet.

> What could go wrong?

LoL, an insane amount of things. TCP connections are an illusion of safely, for the purpose of database commits use UDP packets as a model instead, it'll be much closer to reality.

(This is related to the fact that Rust doesn't have async drop — you can't run async code on drop, other than spawning a new task to do the cleanup.)

I think Oxide should be renting out time on their hardware racks, as well as selling them to big orgs.

Oxide looks to be superb engineering up and down the whole stack, and if it drives more rust code into linux all the better.

Now that linode has been consumed by Akamai, we need an alternative.

Elixir via the Task module https://hexdocs.pm/elixir/Task.html

> an insane amount of things

List a couple

> TCP connections are an illusion of safely

Why?

Maybe I’m thick, but I’m not seeing what is the problem in your first codeblock?

There's nothing wrong in my first comment, it's the second that clarifies adding a `try_join` at the top of the stack can break things below (which is what I was trying to figure out in my initial comment).

Because rust is ultimately constructing a state machine which is ran by the caller, the execution of that state machine can be interrupted or partially executed at any of the `await` points. Or more accurately the caller can simply not advance the state machine.

So, the `try_join` macro can start work on the various functions and if any of them fail, the others are ultimately cancelled. Which can happen before those functions finish fully executing.

This is particularly bad if there's a partial state change.

I'm not entirely sure what that means for memory allocation.

Before adopting Rust, I also found it silly for high-level tasks where e.g. Clojure or Java would suffice. However, the results of using Rust changed my mind.

Speaking personally, that is what first attracted me to Rust — that you can write high-level idiomatic code and still get roughly optimal performance.

I also like Clojure, but I have to wonder how that would have compared in Java, which I think is more performant.

It isn’t as dramatic a decrease as other types of storage, but $4,000 to $1,000 per terabyte in a decade is still a big drop.

Not big enough to hand wave away being careless with RAM. That worked for CPU cycles until ~2010 but the failure to continue scaling traditional computing paradigms exponentially is a huge reason why good performance engineering is still really important for large scale tasks.

Ah cool, a couple of kudos to you:

1) I learned about pin in Rust to prevent values from moving in memory.

2) I learned about the html <summary> tag (the turndown arrows in your article that work with Javascript disabled) hah.

> The rust ecosystem is very invested into making every library that touches the network async.

Right, and that is one of the absolute worst things about the Rust ecosystem. Most programs don't benefit from async, and should use plain old threads because they are much easier to work with.

There is a very reasonable argument that an entire language feature shouldn't be oriented towards making high-complexity state machines easy to write, since they're relatively rare in production. But speakingly purely selfishly, I'm happy I can write something like cargo-nextest using async Rust in a bug-free manner.

So, the `try_join` macro can start work on the various functions and if any of them fail, the others are ultimately cancelled. Which can happen before those functions finish fully executing.

This is particularly bad if there's a partial state change.

I'm not entirely sure what that means for memory allocation.

Speaking personally, that is what first attracted me to Rust — that you can write high-level idiomatic code and still get roughly optimal performance.

I also like Clojure, but I have to wonder how that would have compared in Java, which I think is more performant.

Thanks for the kind words.

It is not as simple as synchronous pipes, but it also has far better edge case and error handling.

For example, on Unix, if you press ctrl-Z to pause execution, nextest will send SIGTSTP to test processes and also pause its internal timers (resuming them when you type in fg or bg). That kind of bookkeeping is pretty hard to do with linear code, and especially hard to coordinate across subprocesses.

State machines with message passing (as seen in GUI apps) are very helpful at handling this, but they're quite hard to write by hand.

The async keyword in Rust allows you to write state machines that look somewhat like linear code (though with the big cancellation asterisk).

Ada has very well thought out and proven tasking features, including clean methods of task cancellation.

Everything, everywhere, all the time! It’s so simple, why has no one ever thought of just increasing a finite resource!?

Edit: after all these years, I finally looked up how PHP handles copy-on-write, and it does copy the whole array on write unfortunately:

http://hengrui-li.blogspot.com/2011/08/php-copy-on-write-how...

This is the level of work that would be required to replace Rust with simpler metaphors, and why it hasn't happened yet.

> an insane amount of things

List a couple

> TCP connections are an illusion of safely

Why?

Thanks for the kind words.

It is not as simple as synchronous pipes, but it also has far better edge case and error handling.

State machines with message passing (as seen in GUI apps) are very helpful at handling this, but they're quite hard to write by hand.

The async keyword in Rust allows you to write state machines that look somewhat like linear code (though with the big cancellation asterisk).

https://kushallabs.com/understanding-concurrency-in-go-green...

> Well, synchronous blocking approaches (as opposed to asynchronous nonblocking) provide that stuff for free.

Not really. The talk describes problems that can show up in any environment where you have concurrency and cancellation. To adapt some examples: a thread that consumes a message from a channel but is killed before it can process it, has still resulted in that message being lost. A synchronous task that needs to temporarily violate invariants in some data structure that can't be updated atomically, has still left that data structure in an invalid state when it gets killed part way through.

> Arguably the Go language's goroutines strike a good balance between cooperate and preemptive threads/multitasking.

Goroutines are pretty nice. It's especially nice that Go has avoided the function colouring problem. I'm not convinced that having to litter your code with select's if you need to make your goroutine's cancel-able is good though. And if you don't care about being able to cancel tasks, you can write async rust in a way that ensures they won't be cancelled by accident fairly easily. Unless there's some better way to write cancel-able goroutines that I'm not familiar with.

> The key insight is that manual management of tasks is, for the most part, not tenable by humans. It's better to take a step back and work at a higher level of abstraction.

Of course it's always important to look at systems as a whole. But to build larger systems out of smaller components you need to actually build the small components.

> I'd probably point to CockroachDB as one of the best task-cancellers, since it doesn't have a shutdown procedure. Its process can simply be terminated by the user with control-c, then it reconciles any outstanding transactions the next time it's booted, which just adds some latency. If an entire database can do that, then "this is the way".

I'm not familiar with CockroachDB specifically, but I do think a database should generally have a more involved happy-path shutdown procedure than that. In particular, I would like the database not to begin processing new transactions if it is not going to be able to finish them before it needs to shut down, even if not finishing them wouldn't violate ACID or any of my invariants.

I think that Rust is making an admiral attempt to attack challenges that have already been solved better in other ways. I just don't have much use for its arsenal.

https://beej.us/guide/bgnet/

I was hopelessly trying to mimic the functionality of libraries like RakNet and Zoidcom without knowing some fundamentals that I wouldn't fully understand for years:

https://www.reddit.com/r/gamedev/comments/93kr9h/recommended...

20 years later, Rust has iroh:

https://github.com/n0-computer/iroh

I realize there is some irony in pointing to a Rust library as a final solution.

I appreciate the reply. My short response would be that Rust is trying to solve multiple problems, not just one. Memory safety is the best known, but it's not specifically The One Reason to Rule Them All.

I'm a big fan of the type system and how expressive I feel with Rust. The compiler is incredibly helpful too. rust-analyzer is a superpower. Just yesterday I embarked on a pretty big refactor and all it took was changing a couple of types—and then fixing the 500 problems vscode was pointing out.

Being able to jump in at the deep end like this in a ~90kloc codebase is only feasible (to me) because I know the tooling has my back.

It's not the perfect tool for every project. But it's a really great choice for a really large number of projects. I encourage to try it a little more on a variety of domains to see if it clicks

> Well, synchronous blocking approaches (as opposed to asynchronous nonblocking) provide that stuff for free.

> Arguably the Go language's goroutines strike a good balance between cooperate and preemptive threads/multitasking.

> The key insight is that manual management of tasks is, for the most part, not tenable by humans. It's better to take a step back and work at a higher level of abstraction.

Of course it's always important to look at systems as a whole. But to build larger systems out of smaller components you need to actually build the small components.

Being able to jump in at the deep end like this in a ~90kloc codebase is only feasible (to me) because I know the tooling has my back.

It's not the perfect tool for every project. But it's a really great choice for a really large number of projects. I encourage to try it a little more on a variety of domains to see if it clicks

This is an edited, written version of my RustConf 2025 talk about cancellations in async Rust. Like the written version of my RustConf 2023 talk, I’ve tried to retain the feel of a talk while making it readable as a standalone blog entry. Some links:

Video of the talk on YouTube.
Slides on Google Slides.
Repository with links and notes on GitHub.
Coverage on Linux Weekly News.

Introduction#

Let’s start with a simple example – you decide to read from a channel in a loop and gather a bunch of messages:

loop {
    match rx.recv().await {
        Ok(msg) => process(msg),
        Err(_) => return,
    }
}

All good, nothing wrong with this, but you realize sometimes the channel is empty for long periods of time, so you add a timeout and print a message:

loop {
    match timeout(Duration::from_secs(5), rx.recv()).await {
        Ok(Ok(msg)) => process(msg),
        Ok(Err(_)) => return,
        Err(_) => println!("no messages for 5 seconds"),
    }
}

There’s nothing wrong with this code—it behaves as expected.

Now you realize you need to write a bunch of messages out to a channel in a loop:

loop {
    let msg = next_message();
    match tx.send(msg).await {
        Ok(_) => println!("sent successfully"),
        Err(_) => return,
    }
}

But sometimes the channel gets too full and blocks, so you add a timeout and print a message:

loop {
    let msg = next_message();
    match timeout(Duration::from_secs(5), tx.send(msg)).await {
        Ok(Ok(_)) => println!("sent successfully"),
        Ok(Err(_)) => return,
        Err(_) => println!("no space for 5 seconds"),
    }
}

It turns out that this code is often incorrect, because not all messages make their way to the channel.

Hi, I’m Rain, and this post is about cancelling async Rust. This post is split into three parts:

What is cancellation? It’s an extremely powerful part of async Rust but also one that is very hard to reason thoroughly about.
Analyzing cancellations: Going deep into their mechanics and providing some helpful ways to think about them.
What can be done? Solutions, including practical guidance, and real bugs we’ve found and fixed in production codebases.

Before we begin, I want to lay my cards on the table – I really love async Rust!

Me speaking at RustConf 2023. Beyond Ctrl-C: The dark corners of Unix signal handling.

Me speaking at RustConf 2023.

I gave a talk at RustConf a couple years ago talking about how async Rust is a great fit for signal handling in complex applications.
I’m also the author of cargo-nextest, a next-generation test runner for Rust, where async Rust is the best way I know of to express some really complex algorithms that I wouldn’t know how to express otherwise. I wrote a blog post about this a few years ago.

Now, I work at Oxide Computer Company, where we make cloud-in-a-box computers. We make vertically integrated systems where you provide power and networking on one end, and the software you want to run on the other end, and we take care of everything in between.

Of course, we use Rust everywhere, and in particular we use async Rust extensively for our higher-level software, such as storage, networking and the customer-facing management API. But along the way we’ve encountered a number of issues around async cancellation, and a lot of this post is about what we learned along the way.

1. What is cancellation?#

What does cancellation mean? Logically, a cancellation is exactly what it sounds like: you start some work, and then change your mind and decide to stop doing that work.

As you might imagine this is a useful thing to do:

You may have started a large download or a long network request
Maybe you’ve started reading a file, similar to the head command.

But then you change your mind: you want to cancel it rather than continue it to completion.

Cancellations in synchronous Rust#

Before we talk about async Rust, it’s worth thinking about how you’d do cancellations in synchronous Rust.

One option is to have some kind of flag you periodically check, maybe stored in an atomic:

while !should_cancel.load(Ordering::Relaxed) {
    expensive_operation();
}

The code that wishes to perform the cancellation can set that flag.
Then, the code which checks that flag can exit early.

This approach is fine for smaller bits of code but doesn’t really scale well to large chunks of code since you’d have to sprinkle these checks everywhere.

A related option, if you’re working with a framework as part of your work, is to panic with a special payload of some kind.

If that feels strange to you, you’re not alone! But the Salsa framework for incremental computation, used by—among other things—rust-analyzer, uses this approach.
Something I learned recently was that this only works on build targets which have a notion of panic unwinding, or being able to bubble up the panic. Not all platforms support this, and in particular, Wasm doesn’t. This means that Salsa cancellations don’t work if you build rust-analyzer for Wasm.

A third option is to kill the whole process. This is a very heavyweight approach, but an effective one in case you spawn processes to do your work.

Rather than kill the whole process, can you kill a single thread?

While some OSes have APIs to perform this action, they tend to warn very strongly against it. That’s because in general, most code is just not ready for a thread disappearing from underneath.
In particular, thread killing is not permitted by safe Rust, since it can cause serious corruption. For example, Rust mutexes would likely stay locked forever.

All of these options are suboptimal or of limited use in some way. In general, the way I think about it is that there isn’t a universal protocol for cancellation in synchronous Rust.

In contrast, there is such a protocol in async Rust, and in fact cancellations are extraordinarily easy to perform in async Rust.

Why is that so? To understand that, let’s look at what a future is.

What is a future?#

Here’s a simple example of a future:

// This creates a state machine.
let future = async {
    let data = request().await;
    process(data).await
};

// Nothing executes yet. `future` is just a struct in memory.

In this future, you first perform a network request which returns some data, and then you process it.

The Rust compiler looks at this future and generates a state machine, which is just a struct or enum in memory:

// The compiler generates something like:
enum MyFuture {
    Start,
    WaitingForNetwork(NetworkFuture),
    WaitingForProcess(ProcessFuture, Data),
    Done(Result),
}

// It's just data, no running code!

If you’ve written async Rust before the async and await keywords, you’ve probably written code like it by hand. It’s basically just an enum describing all the possible states the future can be in.

The compiler also generates an implementation of the Future trait for this future:

impl Future for MyFuture {
    fn poll(/* ... */) -> Poll<Self::Output> {
        match self {
            Start => { /* ... */ }
            WaitingForNetwork(fut) => { /* ... */ }
            // etc
        }
    }
}

and when you call .await on the future, it gets translated down to this underlying poll function. It is only when await or this poll function is called that something actually happens.

Note that this is diametrically opposed to how async works in other languages like Go, JavaScript, or C#. In those languages, when you create a future to await on, it starts doing its thing, immediately, in the background:

// JavaScript: starts running immediately
const promise = fetch('/api/data');

That’s regardless of whether you await it or not.

In Rust, this get call does nothing until you actually call .await on it:

// Rust: just data, does nothing!
let future = reqwest::get("/api/data");

I know I sound a bit like a broken record here, but if you can take away one thing from this post, it would be that futures are passive, and completely inert until awaited or polled.

The universal protocol#

So what does the universal protocol to cancel futures look like? It is simply to drop the future, or to not await it, or poll it any more. Since a future is just a state machine, you can throw it away at any time the poll function isn’t actively being called.

let future = some_async_work();
drop(future); // cancelled

The upshot of all this is that any Rust future can be cancelled at any await point.

Given how hard cancellation tends to be in synchronous environments, the ability to easily cancel futures in async Rust is extraordinarily powerful—in many ways its greatest strength!

But there is a flip side, which is that cancelling futures is far, far too easy. This is for two reasons.

First, it’s just way too easy to quietly drop a future. As we’re going to see, there are all kinds of code patterns that lead to silently dropping futures.
Now this wouldn’t be so bad, if not for the second reason: that cancellation of parent futures propagates down to child futures.

Because of Rust’s single ownership model, child futures are owned by parent ones. If a parent future is dropped or cancelled, the same happens to the child.

To figure out whether a child future’s cancellation can cause issues, you have to look at its parent, and grandparent, and so on. Reasoning about cancellation becomes a very complicated non-local operation.

2. Analyzing cancellations#

I’m going to cover some examples in a bit, but before we do that I want to talk about a couple terms, some of which you might have seen references to already.

Cancel safety and cancel correctness#

The first term is cancel safety. You might have seen mentions of this in the Tokio documentation. Cancel safety, as generally defined, means the property of a future that can be cancelled (i.e. dropped) without any side effects.

For example, a Tokio sleep future is cancel safe: you can just stop waiting on the sleep and it’s completely fine.

let future = tokio::time::sleep();
drop(future); // this has no side effects

An example of a future that is not cancel safe is Tokio’s MPSC send, which sends a message over a channel:

let message = /* ... */;
let future = sender.send(message);
drop(future); // message is lost!

If this future is dropped, the message is lost forever.

The important thing is that cancel safety is a local property of an individual future.

But cancel safety is not all that one needs to care about. What actually matters is the context the cancellation happens in, or in other words whether the cancellation actually causes some kind of larger property in the system to be violated.

For example, if you drop a future which sends a message, but for whatever reason you don’t care about the message any more, it’s not really a bug!

To capture this I tend to use a different term called cancel correctness, which I define as a global property of system correctness in the face of cancellations. (This isn’t a standard term, but it’s a framing I’ve found really helpful in understanding cancellations.)

When is cancel correctness violated? It requires three things:

The system has a cancel-unsafe future somewhere within it. As we’ll see, many APIs that are cancel-unsafe can be reworked to be cancel-safe. If there aren’t any cancel-unsafe futures in the system, then the system is cancel correct.
A cancel-unsafe future is actually cancelled. This may sound a bit trivial, but if cancel-unsafe futures are always run to completion, then the system can’t have cancel correctness bugs.
Cancelling the future violates some property of a system. This could be data loss as with Sender::send, some kind of invariant violation, or some kind of cleanup that must be performed but isn’t.

So a lot of making Rust async robust is about trying to tackle one of these three things.

I want to zoom in for a second on invariant violations and talk about an example of a Tokio API that is very prone to cancel correctness issues: Tokio mutexes.

The pain of Tokio mutexes#

The way Tokio mutexes work is: you create a mutex, you lock it which gives you mutable access to the data underneath, and then you unlock it by releasing the mutex.

let guard = mutex.lock().await;
// Access guard.data, protected by the mutex...
drop(guard);

If you look at the lock function’s documentation, in the “cancel safety” section it says:

This method uses a queue to fairly distribute locks in the order they were requested. Cancelling a call to lock makes you lose your place in the queue.

Okay, so not totally cancel safe, but the only kind of unsafety is fairness, which doesn’t sound too bad.

But the problems lie in what you actually do with the mutex. In practice, most uses of mutexes are in order to temporarily violate invariants that are otherwise upheld when a lock isn’t held.

I’ll use a real world example of a cancel correctness bug that we found at my job at Oxide: we had code to manage a bunch of data sent over by our computers, which we call sleds. The shared state was guarded by a mutex, and a typical operation was:

Obtain a lock on the mutex.
Obtain the sled-specific data by value, moving it to an invalid None state.
Perform an action.
Set the sled-specific data back to the next valid state.

Here’s a rough sketch of what that looks like:

let guard = mutex.lock().await;
// guard.data is Option<T>: Some to begin with
let data = guard.data.take(); // guard.data is now None

let new_data = process_data(data);
guard.data = Some(new_data); // guard.data is Some again

This is all well and good, but the problem is that the action being performed actually had an await point within it:

let guard = mutex.lock().await;
// guard.data is Option<T>: Some to begin with
let data = guard.data.take(); // guard.data is now None

// DANGER: cancellation here leaves data in None state!
let new_data = process_data(data).await;
guard.data = Some(new_data); // guard.data is Some again

If the code that operated on the mutex got cancelled at that await point, then the data would be stuck in the invalid None state. Not great!

And keep in mind the non-local reasoning aspect: when doing this analysis, you need to look at the whole chain of callers.

Cancellation patterns#

Now that we’ve talked about some of the bad things that can happen during cancellations, it’s worth asking what kinds of code patterns lead to futures being cancelled.

The most straightforward example, and maybe a bit of a silly one, is that you create a future but simply forget to call .await on it.

some_async_work(); // missing .await

Now Rust actually warns you if you don’t call .await on the future:

warning: unused implementer of `Future` that must be used
   |
11 |     some_async_work();
   |     ^^^^^^^^^^^^^^^^^
   |
   = note: futures do nothing unless you `.await` or poll them

But a code pattern I’ve sometimes made mistakes with is that the future returns a Result, and you want to ignore the result so you assign it to an underscore like so:

let _ = some_async_work(); // future returns Result

If I forget to call .await on the future, Rust doesn’t warn me about it at all, and then I’m left scratching my head about why this code didn’t run. I know this sounds really silly and basic, but I’ve made this mistake a bunch of times.

(After my talk, it was pointed out to me that Clippy 1.67 and above have a let_underscore_future warn-by-default lint for this. Hooray!)

Another example of futures being cancelled is try operations, such as Tokio’s try_join macro. For example:

async fn do_stuff_async() -> Result<(), &'static str> {
    // async work
}

async fn more_async_work() -> Result<(), &'static str> {
    // more here
}

let res = tokio::try_join!(
    do_stuff_async(),
    more_async_work(),
);

// ...

If you call try_join with a bunch of futures, and all of them succeed, it’s all good. But if one of them fails, the rest simply get cancelled.

In fact, at Oxide we had a pretty bad bug around this: we had code to stop a bunch of services, all expressed as futures. We used try_join:

try_join!(
    stop_service_a(),
    stop_service_b(),
    stop_service_c(),
)?;

If one of these operations failed for whatever reason, we would stop running the code to wait for the other services to exit. Oops!

But perhaps the most well-known source of cancellations is Tokio’s select macro. Select is this incredibly beautiful operation. It is called with a set of futures, and it drives all of them forward concurrently:

tokio::select! {
    result1 = future1 => handle_result1(result1),
    result2 = future2 => handle_result2(result2),
}

Each future has a code block associated with it (above, handle_result1 and handle_result2). If one of the futures completes, the corresponding code block is called. But also, all of the other futures are always cancelled!

For a variety of reasons, select statements in general, and select loops in particular, are particularly prone to cancel correctness issues. So a lot of the documentation about cancel safety talks about select loops. But I want to emphasize here that select is not the only source of cancellations, just a particularly notable one.

3. What can be done?#

So, now that we’ve looked at all of these issues with cancellations, what can be done about it?

First, I want to break the bad news to you – there is no general, fully reliable solution for this in Rust today. But in our experience there are a few patterns that have been successful at reducing the likelihood of cancellation bugs.

Going back to our definition of cancel correctness, there are three prongs all of which come together to produce a bug:

A cancel-unsafe future exists
This cancel-unsafe future is cancelled
The cancellation violates a system property

Most solutions we’ve come up with try and tackle one of these prongs.

Making futures cancel-safe#

Let’s look at the first prong: the system has a cancel-unsafe future somewhere in it. Can we use code patterns to make futures be cancel-safe? It turns out we can! I’ll give you two examples here.

The first is MPSC sends. Let’s come back to the example from earlier where we would lose messages entirely:

loop {
    let msg = next_message();
    match timeout(Duration::from_secs(5), tx.send(msg)).await {
        Ok(Ok(_)) => println!("sent successfully"),
        Ok(Err(_)) => return,
        Err(_) => println!("no space for 5 seconds"),
    }
}

Can we find a way to make this cancel safe?

In this case, yes, and we do so by breaking up the operation into two parts:

loop {
    match timeout(Duration::from_secs(5), tx.reserve()).await {
        Ok(Ok(permit)) => {
            permit.send(next_message());
            println!("sent successfully");
        }
        Ok(Err(_)) => return,
        Err(_) => println!("no space for 5 seconds"),
    }
}

The first component is the operation to reserve a permit or slot in the channel. This is an initial async operation that’s cancel-safe.
The second is to generate, then send the message, which is an operation that becomes infallible.

(I want to put an asterisk here that reserve is not entirely cancel-safe, since Tokio’s MPSC follows a first-in-first-out pattern and dropping the future means losing your place in line. Keep this in mind for now.)

Update 2025-10-24: The code sample now calls next_message after a permit has been reserved. Thanks to quad on Lobsters for the correction.

The second is with Tokio’s AsyncWrite.

If you’ve written synchronous Rust you’re probably familiar with the write_all method, which writes an entire buffer out:

use std::io::Write;

let buffer: &[u8] = /* ... */;
writer.write_all(buffer)?;

In synchronous Rust, this is a great API. But within async Rust, the write_all pattern is absolutely not cancel safe! If the future is dropped before completion, you have no idea how much of this buffer was written out.

use tokio::io::AsyncWriteExt;

let buffer: &[u8] = /* ... */;
writer.write_all(buffer).await?; // Not cancel-safe!

But there’s an alternative API that is cancel-safe, called write_all_buf. This API is carefully designed to enable the reporting of partial progress, and it doesn’t just accept a buffer, but rather something that looks like a cursor on top of it:

use tokio::io::AsyncWriteExt;

let mut buffer: io::Cursor<&[u8]> = /* ... */;
writer.write_all_buf(&mut buffer).await?;

When part of the buffer is written out, the cursor is advanced by that number of bytes. So if you call write_all_buf in a loop, you’ll be resuming from this partial progress, which works great.

Not cancelling futures#

Going back to the three prongs: the second prong is about actually cancelling futures. What code patterns can be used to not cancel futures? Here are a couple of examples.

The first one is, in a place like a select loop, resume futures rather than cancelling them each time. You’d typically achieve this by pinning a future, and then polling a mutable reference to that future. For example:

let mut future = Box::pin(channel.reserve());
loop {
    tokio::select! {
        permit = &mut future => break permit,
        _ = other_condition => continue,
    }
}

Coming back to our example of MPSC sends, the one asterisk with reserve is that cancelling it makes you lose your place in line. Instead, if you pin the reserve future and poll a mutable reference to it, you don’t lose your place in line.

(Does the difference here matter? It depends, but you can now have this strategy available to you.)

The second example is to use tasks. I mentioned earlier that futures are Rust are diametrically opposed to similar notions in languages like JavaScript. Well, there’s an alternative in async Rust that’s much closer to the JavaScript idea, and that’s tasks.

Unlike futures which are driven by the caller, tasks are driven by the runtime (such as Tokio).
With Tokio, dropping a handle to a task does not cause it to be cancelled, which means they’re a good place to run cancel-unsafe code.

A fun example is that at Oxide, we have an HTTP server called Dropshot. Previously, whenever an HTTP request came in, we’d use a future for it, and drop the future if the TCP connection was closed.

// Before: Future cancelled on TCP close
handle_request(req).await;

This was really bad because future cancellations could happen due to the behavior of not just the parent future, but of a process that was running across a network! This is a rather extreme form of non-local reasoning.

We addressed this by spinning up a task for each HTTP request, and by running the code to completion even if the connection is closed:

// After: Task runs to completion
tokio::spawn(handle_request(req));

Systematic solutions?#

The last thing I want to say is that this sucks!

The promise of Rust is that you don’t need to do this kind of non-local reasoning—that you can understand important behavior by looking at code directly around the behavior, then use the type system to scale that up to global correctness. Almost everything in Rust, from & and &mut to unsafe, is geared towards making that possible. However, future cancellations fly directly in the face of that, and I think they’re probably the least Rusty part of Rust. This is all really unfortunate.

Can we come up with something more systematic than this kind of ad-hoc reasoning?

There doesn’t exist anything in safe Rust today, but there are a few different ideas people have come up with. I wanted to give a nod to those ideas:

Async drop would let you run async code when a future is cancelled. This would handle some, though not all, of the cases we discussed today.
There’s also a couple different proposals for what are called linear types, where you could force some code to be run on drop, or mark a particular future as non-cancellable (once it’s been created it must be driven to completion).

All of these options have really significant implementation challenges, though. This blog post from boats covers some of these solutions, and the implementation challenges with them.

Conclusion#

In this post, we:

Saw that futures are passive
Introduced cancel safety and cancel correctness as concepts
Examined some bugs that can occur with cancellation
Looked at some recommendations you can use to mitigate the downsides of cancellation

Some of the recommendations are:

Avoid Tokio mutexes
Rewrite APIs to make futures cancel-safe
Find ways to ensure that cancel-unsafe futures are driven to completion

There’s a very deep well of complexity here, a lot more than I can cover in one blog post:

Why are futures passive, anyway?
Cooperative cancellation: cancellation tokens
Actor model as an alternative to Tokio mutexes
Task aborts
Structured concurrency
Relationship to panic safety and mutex poisoning

If you’re curious about any of these, check out this link where I’ve put together a collection of documents and blog posts about these concepts. In particular, I’d recommend reading these two Oxide RFDs:

RFD 397 Challenges with async/await in the control plane by David Pacheco
RFD 400 Dealing with cancel safety in async Rust by myself

Thank you for reading this post to the end! And thanks to many of my coworkers at Oxide for reviewing the talk and the RFDs linked above, and for suggestions and constructive feedback.

Hacker Times