- Proposal from 2020 about async functions which are forced to run to completion (and thereby would use graceful cancellation if necessary). Quite old, but I still feel that no better idea has come up so far. https://github.com/Matthias247/rfcs/pull/1
- Proposal for unified cancellation between sync and async Rust ("A case for CancellationTokens" - https://gist.github.com/Matthias247/354941ebcc4d2270d07ff0c6...)
- Exploration of an implementation of the above: https://github.com/Matthias247/min_cancel_token
These are all common traps. And now cancellations in async Rust are a new complement to state management in async Rust (Futures).
When I'm developing the mea (Make Easy Async) [1] library, I document the cancel safety attribute when it's non-trivial.
Additionally, I recall [2] an instance where a thoughtless async cancellation can disrupt the IO stack.
[1] https://github.com/fast/mea
[2] https://www.reddit.com/r/rust/comments/1gfi5r1/comment/luido...
There was only one threaded web server, https://lib.rs/crates/rouille . It has 1.1M lines of code (including deps). Its hello-world example reaches only 26Krps on my machine (Apple M4 Pro). It also has a bug that makes it problematic to use in production: https://github.com/tiny-http/tiny-http/issues/221 .
I wrote https://lib.rs/crates/servlin threaded web server. It uses async internally. It has 221K lines of code. Its hello-world example reaches 102Krps on my machine.
https://lib.rs/crates/ehttpd is another one but it has no tests and it seems abandoned. It does an impressive 113Krps without async, using only 8K lines of code.
For comparison, the popular Axum async web server has 4.3M lines of code and its hello-world example reaches 190Krps on my machine.
The popular threaded Postgres client uses Tokio internally and has 1M lines of code: http://lib.rs/postgres .
Recently a threaded Postgres client was released. It has 500K lines of code: https://lib.rs/crates/postgres_sync .
There was no ergonomic way to signal cancellation to threads, so I wrote one: https://crates.io/crates/permit .
Rust's threaded libraries are starting to catch up to the async libraries!
---
I measured lines of code with `rm -rf deps.filtered && cargo vendor-filterer --platform=aarch64-apple-darwin --exclude-crate-path='*#tests' deps.filtered && tokei deps.filtered`.
I ran web servers with `cargo run --release --example hello-world` and measured throughput with `rewrk -c 1000 -d 10s -h http://127.0.0.1:3000/`.
is the title like that on purpose?
The examples presented for "cancel unsafe" futures seem to me like the root of the problem is some sort of misalignment of expectations to the reality:
Example 1: one future cancelled on error in the other
let res = tokio::try_join!( do_stuff_async(), more_async_work(), );
Example 2: data not written out on cancellation
let buffer: &[u8] = /* ... */; writer.write_all(buffer)?;
Both of these cases are claimed to not be cancel-safe, because the work gets interrupted and so not driven to completion. But again, what else is supposed to happen? If you want the work to finish regardless of the async context being cancelled, then don't put it in the same async context but spawn a task instead.
I feel like I must be missing something obvious that keeps me from understanding the author's issue here. I thought work getting dropped on cancellation is exactly how futures are supposed to work. What's the nuance that I'm missing?
Glad to see it converted to a blog post. Talks are great, but blogs are much easier to share and reference.
What's he trying to do? Get a clean program shutdown? That's moderately difficult in threaded programs, and async has problems, too. The use case here is unclear.
The real use cases involve when you're sending messages back and forth to a remote site, and the remote site goes away. Now you need to dispose of the state on your end.
It's really not about "cancelling async Rust" which is what I expected, even if it didn't make much sense.
I really hope we get async drop soon.
I'm sure experienced async Rust programmers always have this things in mind, but Rust is also about preventing these kinds of missable behaviour, be it via the type system or otherwise.
I don't like the "cancel safety" term. Not only it's unrelated to the Rust's concept of safety, it's also unnecessarily judgemental.
Safe/unsafe implies there's a better or worse behavior, but what is desirable for cancellation to do is highly context-dependent.
Futures awaiting spawned tasks are called "cancellation safe", because they won't stop the task when dropped. But that's not an inherently safe behavior – leaving tasks running after their spawner has been cancelled could be a bug: piling up work that won't be used, and even interfering with the rest of the program by keeping locks locked or ports used. OTOH a spawn handle that stops the task when dropped would be called "cancellation unsafe", despite being a very useful construct specifically for propagating cleanup to dependent tasks.
This abstraction has served me well and facilitates stepping through code in a debugger, though I jump out of thinking it at that level when I need to think of it at a lower level.
Maybe it's a bit contrived, but it's also the kind of code you'd sprinkle through your system in response to "nothing seems to be happening and I don't know why".
Let's say my code looks like this
async fn a() {
b().await
}
async fn b() {
c().await
d().await
}
async fn c() {
}
async fn d() {
}
Where does an issue occur which causes `d` to not to be called? Is it some sort of cancellation in c? Or some upstream action in a?Or am I missing context?
It also wouldn't help when you have no valid state to restore to, as in the mutex example in the post.
I am asking because I've noticed that many developers with previous experience from "task-based" languages (specifically the JS/TS world) tend to grasp the basics of Rust async quickly enough, but then run into expectation-misalignment related problems similar to the examples that you used in your post. That in turn has made want to understand whether it is the Rust futures that are themselves difficult or strange, or whether it's a case of the Rust futures appearing simple and familiar, even though they are completely different in very subtle ways. I suppose that it's a combination of both.
The note on mpsc::Sender::send losing the message on drop [1] was actually added by me [2], after I wrote the Oxide RFD on cancellations [3] that this talk is a distilled form of. So even the great folks on the Tokio project hadn't documented this particular landmine.
[1] https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.h...
If you want to tie multiple actions together as an atomic unit, you need the other side to have some concept of transactions; — and you need to utilize it.
So on cancellation, the transaction times out and nothing is written. Bad but safe.
The problem is the same on other platforms. For example, what if writing to the DB throws an exception if you’re on Python? Your app just dies, the transaction times out. Unfortunate but safe.
If it does not run transactionally you have a problem in any execution scenario.
`d` not being called would happen because of actions in `a`.
If `a` were rewritten as
async fn a() {
try_join!(b(), c(), d())
}
Then if `c` ends up failing in the try_join then process on `b` will be halted and thus the `d` in `b` won't be executed.of course... Its obviously not as simple as "just give me a way to turn it off", but more importantly, I just don't see this concern being addressed by the Powers That Be. Am I just not looking hard enough? Did I miss the rust blog post titled "hey - so you didn't want to use async but the libraries that you did want to use ship with async so you're up shit creek.. Here's what our plan for that is"?
I'm sorry. I generally lurk because I don't consider myself up to the caliber of others on this website, but nonetheless the few posts I make do end up being about async because it does make me feel quite hopeless at times. Hopefully someone can look passed my ignorance/incompetence/selfishness/immaturity and tell me its all going to be okay.
Also, as another comment on the thread points out [1], languages where futures are active by default can have the opposite problem.
IMHO async is an anti-pattern, and probably the final straw that will prevent me from ever finishing learning Rust. Once one learns pass-by-value and copy-on-write semantics (Clojure, PHP arrays), the world starts looking like a spreadsheet instead of spaghetti code. I feel that a Rust-like language could be built with no borrow checker, simply by allocating twice the memory. Since that gets ever-less expensive, I'm just not willing to die on the hill of efficiency anymore. I predict that someday Rust will be relegated to porting scripting languages to a bare-metal runtime, but will not be recommended for new work.
That said, I think that Rust would make a great teaching tool in an academic setting, as the epitome of imperative languages. Maybe something great will come of it, like Swift from Objective-C or Kotlin from Java. And having grown up on C++, I have a soft spot in my heart for solving the hard problems in the fastest way possible. Maybe a voxel game in Rust, I dunno.
(This is related to the fact that Rust doesn't have async drop — you can't run async code on drop, other than spawning a new task to do the cleanup.)
This is prong 3 of my cancel correctness framework (that the cancellation violates a system property, in this case a cleanup property.) The solution here is to ensure the connection is in a pristine state before handing it out the next time it's used.
To this day I'm not aware of a better way to express what's become a set of increasingly complex state machines (the most recent improvement being to make the state machines responsive to user input). Nextest's runner loop is structured mostly like a GUI event loop, but without explicit state machines. It's quite nice being able to write code that's this complex in a bug-free manner.
Is this not enough? What could go wrong? If the network connection dies or the task is cancelled, I'm assuming the database server cleans up the connection state and does a rollback automatically.
And adding async Drop will probably add a whole new set of footguns.
That kind of thinking made sense in the 90s when things followed Moore’s law. But DRAM was one of the first things to fail to keep up: https://ourworldindata.org/grapher/historical-cost-of-comput... and barely gets cheaper anymore. Thats why mobile phones still only have 16gb of memory despite having 4gib a decade ago.
And there’s all sorts of problems that Rust doesn’t necessarily make a great fit for. But Rust’s target marketplace is where you’d otherwise use a low level language like C or C++. If you can just heap allocate everything and aggressively create copies all over the place, then why would you ever use those languages in the first place.
And for what it’s worth Rust is finding a lot of success even replacing all the tooling in other language ecosystems like Ruby, Python, and JS precisely because the tools in those ecosystems written in the native language end up being horribly slow. And memory allocation and randomly deep copying arrays are the kinds of things that add up and make things slow (in addition to GC pauses, slow startups, interpreter costs etc).
And you can always choose not to do async in Rust although personally I’m a huge fan as it makes it really clear where you have sprinkled in I/O in places you shouldn’t have.
it analyses code. if it finds raii/linearity/single-ownership, it does exactly like rust mem mgmt.
but if it js not, it does rc.
so it does what rust, but automagically without polluting code.
so cow or pbw or 2mem are not only options to improve rust.
If that's what you're looking for, have you considered OCaml?
Because rust is ultimately constructing a state machine which is ran by the caller, the execution of that state machine can be interrupted or partially executed at any of the `await` points. Or more accurately the caller can simply not advance the state machine.
So, the `try_join` macro can start work on the various functions and if any of them fail, the others are ultimately cancelled. Which can happen before those functions finish fully executing.
This is particularly bad if there's a partial state change.
I'm not entirely sure what that means for memory allocation.
Oxide looks to be superb engineering up and down the whole stack, and if it drives more rust code into linux all the better.
Now that linode has been consumed by Akamai, we need an alternative.
LoL, an insane amount of things. TCP connections are an illusion of safely, for the purpose of database commits use UDP packets as a model instead, it'll be much closer to reality.
I used to write web backends in Clojure, and justified it with the fact that the JVM has some of the best profiling tools available (I still believe this), and the JVM itself exposes lots of knobs to not only fine-tune the GC, but even choose a GC! (This cannot be understated; garbage collectors tend to be deeply integrated into a language's runtime, and it's amazing to me that the Java platform manages to ship several garbage collectors, each of which are optimal in their own specific situations).
After rewriting an NLP-heavy web app in Rust, I saw massive performance gains over the original Clojure version, even though both aggressively copy data and the Rust version is full of atomic refcounts (atomic refcounting is not the fastest GC out there...)
The binary emitted by rustc is also much smaller. ~10 MB static binary vs. GraalVM's ~80 MB native images (and longer build times, since classpath analysis and reflection scanning require a lot of work)
What surprised me the most is how high-level Rust feels in practice. I can use pattern matching, async/await, functional programming idioms, etc., and it ends up being fast anyway. Coming from Clojure, Rust syntax trying its best to be expression-oriented is a key differentiator from other languages in its target domain (notably, C++). I sometimes miss TypeScript's anonymous enums, but Rust's type system can express a lot of of runtime behavior, and it's partly why many jokingly state "if it compiles, it's likely correct". Then there's the little things, like how Rust's Futures don't immediately start in the background. In contrast, JavaScript Promises are immediately pushed to a microtask queue, so cancelling a Promise is impossible by design.
Overall, it's the little things like this -- and the toolchain (cargo, clippy, rustfmt) -- that have kept me using Rust. I can write high-level code and still compile down to a ~5 MB binary and outperform idiomatic code in other languages I'm familiar with (e.g. Clojure, Java, and TypeScript).
Right, and that is one of the absolute worst things about the Rust ecosystem. Most programs don't benefit from async, and should use plain old threads because they are much easier to work with.
1) I learned about pin in Rust to prevent values from moving in memory.
2) I learned about the html <summary> tag (the turndown arrows in your article that work with Javascript disabled) hah.
I can see how dealing with stream and resource cleanup in async code could be a chore. It sounds like you were able to do that in a fairly declarative manner, which is what I always strive for as well.
I think my hesitation with async is that I already went down that road early in my programming life with cooperative threads/multitasking on Mac OS 9 and earlier. There always seems to be yet another brittle edge case to deal with, so it can feel infuriating playing whack-a-mole until they're all nailed down.
For example, pinning memory looks a lot like locking handles in Mac OS. Handles were pointers to pointers, so it was a bare hands way to implement a memory defragmenter before runtimes were smart enough to handle it. If apps used handles, then blocks of data could be unlocked, moved somewhere else in memory, and then re-locked. Code had to do an extra hop through each handle to get to the original pointer, which was a frequent source of bugs because one async process might be working on a block, yield, and then have another async process move the handle out from under it.
The lock's state was stored in a flag in the memory manager, basically a small bit of metadata. I haven't investigated, but I suspect that Rust may be able to handle locking more efficiently, perhaps more like reference counting or the borrow checker where it can infer whether a pointer is locked without storing that flag somewhere (but I could be wrong).
Apple abandoned handles when it migrated to OS 10 and Darwin inherited protected memory and better virtual memory from FreeBSD. Although now that I write this out, I'm not sure that they solved in-process fragmentation. I think they just give apps the full 32 or 64 bit address space so that effectively there is always another region available for the next allocation, and let the virtual memory subsystem consolidate 4k memory blocks into contiguous strips internally. The dereferencing of memory step became implicit rather than explicit, as well as hidden from apps, so that whole classes of bugs became unreachable.
Anyway, that's why I prefer the runtime to handle more of this. I want strong guarantees that I can terminate a process and all locks inside it will get freed as well. I can pretty much rely on that even in hacky languages like PHP.
My frustration with all of this is that we could/should have demanded better runtimes. We could have had realtime unixes where task switching and memory allocation were effectively free. Unfortunately the powers that be (Mac OS and Windows) had runtimes that were too entrenched with too many users relying on quirks and so they dragged their feet and never did better. Languages like Rust were forced to get very clever and go to the ends of the earth to work around that. Then when companies like Google and Facebook won the internet lottery, they pulled the ladder up behind them by unilaterally handing down decrees from on high that developers should use bare hands techniques, rather than putting real resources into reforming the fundamentals so that we wouldn't have to.
What I'm trying to say is that your solution is clever and solves a common pattern in about the simplest way possible, but is not as simple as synchronous-blocking unix pipes to child processes in shell scripts. That's in no way a criticism. I have similar feelings about stuff like Docker and Kubernetes after reading about Podman. If we could magically go back and see the initial assumptions that led us down the road we're on, we might have tried different approaches. It's all of those roads not taken that haunt me, because they represent so much of my workload each day.
https://kushallabs.com/understanding-concurrency-in-go-green...
So lots of concepts are worth learning like atomicity, ACID compliance, write ahead logs (WALs), statically detecting livelocks and deadlocks (or making them unreachable), consensus algorithms like Raft and Paxos, state transfer algorithms like software transaction memory (STM), connectionless state transfer like hash trees and Merkle trees, etc.
The key insight is that manual management of tasks is, for the most part, not tenable by humans. It's better to take a step back and work at a higher level of abstraction. For example, declarative programming works in terms of goals/specifications/tests, so that the runner has more freedom to cancel and restart/retry tasks arbitrarily. That way the user can fire off a workload and wait until all of the tasks match a success criteria, and even treat that process as idempotent so it can all be run again without harm. In this way, trees of success criteria can be composed to manage a task pool.
I'd probably point to CockroachDB as one of the best task-cancellers, since it doesn't have a shutdown procedure. Its process can simply be terminated by the user with control-c, then it reconciles any outstanding transactions the next time it's booted, which just adds some latency. If an entire database can do that, then "this is the way".
I think that Rust is making an admiral attempt to attack challenges that have already been solved better in other ways. I just don't have much use for its arsenal.
For example, I wasted 2 years of my life trying to write a NAT-punching peer to peer networking framework for games around 2005, but was first exposed to synchronous blocking vs asynchronous nonblocking networking in the late 90s when I read Beej's Guide to Network Programming:
I was hopelessly trying to mimic the functionality of libraries like RakNet and Zoidcom without knowing some fundamentals that I wouldn't fully understand for years:
https://www.reddit.com/r/gamedev/comments/93kr9h/recommended...
20 years later, Rust has iroh:
https://github.com/n0-computer/iroh
I realize there is some irony in pointing to a Rust library as a final solution.
But my point is that when developers reached high levels of financial success and power, they didn't go back to address the fundamentals. NAT was always an abomination to me. And as far as I know, they kept it in IPv6. Someone like Google should have provided a way to get around it that's not as heavy as WebRTC. So many developer years of work have been wasted due to the mistakes of the status quo. So that we wander in the desert for years using lackluster paradigms because we don't know that better stuff exists.
Knowing what I know now, I would have created open source C (portable) libraries to solve NAT punching, state transfer with a software transactional memory (STM) or Raft, entity state machines (like in Unity), movement prediction/dead reckoning, etc etc etc to form the basis of a distributed computing network for virtual worlds and let the developer community solve that. Someone will do that in a year or two with AI now I assume.
Ok you kinda got me. I realize after writing this out that I wouldn't use Rust for new work, but it's not so much about the language itself as building upon proven layers to "get real work done". The lower the level of abstraction, the harder that is to do. So it's hard for me to see the problem which Rust is trying to solve.
If we imagine a function passing a block of memory to sub functions which may write bytes to it randomly, then each of those writes may allocate another block. If those allocations are similar in size to the VM block size, then each invocation can potentially double the amount of memory used.
A do-one-thing-and-do-it-well (DOTADIW?) program works in a one-shot fashion where the main process fires off child processes that return and free the memory that was passed by value. Surrounded by pipes, so that data is transmuted by each process and sent to the next one. VM usage may grow large temporarily per-process, but overall we can think of each concurrent process as roughly doubling the amount of memory.
Writing this out, I realized that the worst case might be more like every byte changing in a 4k block, so a 4096 times increase in memory. Which still might be reasonable, since we accept roughly a 200x speed decrease for scripting languages. It might be worth profiling PHP to see how much memory increases when every byte in a passed array is modified. Maybe they use a clever tree or refcount strategy to reduce the amount of storage needed when arrays are modified. Or maybe they just copy the entire array?
Another avenue of research might be determining whether a smarter runtime could work with "virtual" VMs (VVMs?) to use a really small block size, maybe 4 or 8 bytes to match the memory bus. I'd be willing to live with a 4x or 8x increase in memory to avoid borrow checkers, refcounts or garbage collection.
-
Edit: after all these years, I finally looked up how PHP handles copy-on-write, and it does copy the whole array on write unfortunately:
http://hengrui-li.blogspot.com/2011/08/php-copy-on-write-how...
If I were to write something like this today, I'd maybe use "smart" associative arrays of some kind instead of contiguous arrays, so that only the modified section would get copied. Internally that might be a B-Tree with perhaps 8 bytes per leaf to hold N primitives like 1 double, 2 floats, etc. In practice, a larger size like 16-256 bytes per leaf might improve performance at the cost of memory.
Looks like ZFS deduplication only copies the blocks within the file that changed, not the entire file. Their strategy could be used for a VM so that copy-on-write between processes only copies the 4k blocks that change. Then if it was a realtime unix, functions could be synchronous blocking processes that could be called with little or no overhead.
This is the level of work that would be required to replace Rust with simpler metaphors, and why it hasn't happened yet.
List a couple
> TCP connections are an illusion of safely
Why?
It is not as simple as synchronous pipes, but it also has far better edge case and error handling.
For example, on Unix, if you press ctrl-Z to pause execution, nextest will send SIGTSTP to test processes and also pause its internal timers (resuming them when you type in fg or bg). That kind of bookkeeping is pretty hard to do with linear code, and especially hard to coordinate across subprocesses.
State machines with message passing (as seen in GUI apps) are very helpful at handling this, but they're quite hard to write by hand.
The async keyword in Rust allows you to write state machines that look somewhat like linear code (though with the big cancellation asterisk).
Not really. The talk describes problems that can show up in any environment where you have concurrency and cancellation. To adapt some examples: a thread that consumes a message from a channel but is killed before it can process it, has still resulted in that message being lost. A synchronous task that needs to temporarily violate invariants in some data structure that can't be updated atomically, has still left that data structure in an invalid state when it gets killed part way through.
> Arguably the Go language's goroutines strike a good balance between cooperate and preemptive threads/multitasking.
Goroutines are pretty nice. It's especially nice that Go has avoided the function colouring problem. I'm not convinced that having to litter your code with select's if you need to make your goroutine's cancel-able is good though. And if you don't care about being able to cancel tasks, you can write async rust in a way that ensures they won't be cancelled by accident fairly easily. Unless there's some better way to write cancel-able goroutines that I'm not familiar with.
> The key insight is that manual management of tasks is, for the most part, not tenable by humans. It's better to take a step back and work at a higher level of abstraction.
Of course it's always important to look at systems as a whole. But to build larger systems out of smaller components you need to actually build the small components.
> I'd probably point to CockroachDB as one of the best task-cancellers, since it doesn't have a shutdown procedure. Its process can simply be terminated by the user with control-c, then it reconciles any outstanding transactions the next time it's booted, which just adds some latency. If an entire database can do that, then "this is the way".
I'm not familiar with CockroachDB specifically, but I do think a database should generally have a more involved happy-path shutdown procedure than that. In particular, I would like the database not to begin processing new transactions if it is not going to be able to finish them before it needs to shut down, even if not finishing them wouldn't violate ACID or any of my invariants.
I'm a big fan of the type system and how expressive I feel with Rust. The compiler is incredibly helpful too. rust-analyzer is a superpower. Just yesterday I embarked on a pretty big refactor and all it took was changing a couple of types—and then fixing the 500 problems vscode was pointing out.
Being able to jump in at the deep end like this in a ~90kloc codebase is only feasible (to me) because I know the tooling has my back.
It's not the perfect tool for every project. But it's a really great choice for a really large number of projects. I encourage to try it a little more on a variety of domains to see if it clicks
This is an edited, written version of my RustConf 2025 talk about cancellations in async Rust. Like the written version of my RustConf 2023 talk, I’ve tried to retain the feel of a talk while making it readable as a standalone blog entry. Some links:
Let’s start with a simple example – you decide to read from a channel in a loop and gather a bunch of messages:
loop {
match rx.recv().await {
Ok(msg) => process(msg),
Err(_) => return,
}
}
All good, nothing wrong with this, but you realize sometimes the channel is empty for long periods of time, so you add a timeout and print a message:
loop {
match timeout(Duration::from_secs(5), rx.recv()).await {
Ok(Ok(msg)) => process(msg),
Ok(Err(_)) => return,
Err(_) => println!("no messages for 5 seconds"),
}
}
There’s nothing wrong with this code—it behaves as expected.
Now you realize you need to write a bunch of messages out to a channel in a loop:
loop {
let msg = next_message();
match tx.send(msg).await {
Ok(_) => println!("sent successfully"),
Err(_) => return,
}
}
But sometimes the channel gets too full and blocks, so you add a timeout and print a message:
loop {
let msg = next_message();
match timeout(Duration::from_secs(5), tx.send(msg)).await {
Ok(Ok(_)) => println!("sent successfully"),
Ok(Err(_)) => return,
Err(_) => println!("no space for 5 seconds"),
}
}
It turns out that this code is often incorrect, because not all messages make their way to the channel.
Hi, I’m Rain, and this post is about cancelling async Rust. This post is split into three parts:
Before we begin, I want to lay my cards on the table – I really love async Rust!

Me speaking at RustConf 2023.
I gave a talk at RustConf a couple years ago talking about how async Rust is a great fit for signal handling in complex applications.
I’m also the author of cargo-nextest, a next-generation test runner for Rust, where async Rust is the best way I know of to express some really complex algorithms that I wouldn’t know how to express otherwise. I wrote a blog post about this a few years ago.
Now, I work at Oxide Computer Company, where we make cloud-in-a-box computers. We make vertically integrated systems where you provide power and networking on one end, and the software you want to run on the other end, and we take care of everything in between.
Of course, we use Rust everywhere, and in particular we use async Rust extensively for our higher-level software, such as storage, networking and the customer-facing management API. But along the way we’ve encountered a number of issues around async cancellation, and a lot of this post is about what we learned along the way.
What does cancellation mean? Logically, a cancellation is exactly what it sounds like: you start some work, and then change your mind and decide to stop doing that work.
As you might imagine this is a useful thing to do:
head command.But then you change your mind: you want to cancel it rather than continue it to completion.
Before we talk about async Rust, it’s worth thinking about how you’d do cancellations in synchronous Rust.
One option is to have some kind of flag you periodically check, maybe stored in an atomic:
while !should_cancel.load(Ordering::Relaxed) {
expensive_operation();
}
This approach is fine for smaller bits of code but doesn’t really scale well to large chunks of code since you’d have to sprinkle these checks everywhere.
A related option, if you’re working with a framework as part of your work, is to panic with a special payload of some kind.
A third option is to kill the whole process. This is a very heavyweight approach, but an effective one in case you spawn processes to do your work.
Rather than kill the whole process, can you kill a single thread?
All of these options are suboptimal or of limited use in some way. In general, the way I think about it is that there isn’t a universal protocol for cancellation in synchronous Rust.
In contrast, there is such a protocol in async Rust, and in fact cancellations are extraordinarily easy to perform in async Rust.
Why is that so? To understand that, let’s look at what a future is.
Here’s a simple example of a future:
// This creates a state machine.
let future = async {
let data = request().await;
process(data).await
};
// Nothing executes yet. `future` is just a struct in memory.
In this future, you first perform a network request which returns some data, and then you process it.
The Rust compiler looks at this future and generates a state machine, which is just a struct or enum in memory:
// The compiler generates something like:
enum MyFuture {
Start,
WaitingForNetwork(NetworkFuture),
WaitingForProcess(ProcessFuture, Data),
Done(Result),
}
// It's just data, no running code!
If you’ve written async Rust before the async and await keywords, you’ve probably written code like it by hand. It’s basically just an enum describing all the possible states the future can be in.
The compiler also generates an implementation of the Future trait for this future:
impl Future for MyFuture {
fn poll(/* ... */) -> Poll<Self::Output> {
match self {
Start => { /* ... */ }
WaitingForNetwork(fut) => { /* ... */ }
// etc
}
}
}
and when you call .await on the future, it gets translated down to this underlying poll function. It is only when await or this poll function is called that something actually happens.
Note that this is diametrically opposed to how async works in other languages like Go, JavaScript, or C#. In those languages, when you create a future to await on, it starts doing its thing, immediately, in the background:
// JavaScript: starts running immediately
const promise = fetch('/api/data');
That’s regardless of whether you await it or not.
In Rust, this get call does nothing until you actually call .await on it:
// Rust: just data, does nothing!
let future = reqwest::get("/api/data");
I know I sound a bit like a broken record here, but if you can take away one thing from this post, it would be that futures are passive, and completely inert until awaited or polled.
So what does the universal protocol to cancel futures look like? It is simply to drop the future, or to not await it, or poll it any more. Since a future is just a state machine, you can throw it away at any time the poll function isn’t actively being called.
let future = some_async_work();
drop(future); // cancelled
The upshot of all this is that any Rust future can be cancelled at any await point.
Given how hard cancellation tends to be in synchronous environments, the ability to easily cancel futures in async Rust is extraordinarily powerful—in many ways its greatest strength!
But there is a flip side, which is that cancelling futures is far, far too easy. This is for two reasons.
First, it’s just way too easy to quietly drop a future. As we’re going to see, there are all kinds of code patterns that lead to silently dropping futures.
Now this wouldn’t be so bad, if not for the second reason: that cancellation of parent futures propagates down to child futures.
Because of Rust’s single ownership model, child futures are owned by parent ones. If a parent future is dropped or cancelled, the same happens to the child.
To figure out whether a child future’s cancellation can cause issues, you have to look at its parent, and grandparent, and so on. Reasoning about cancellation becomes a very complicated non-local operation.
I’m going to cover some examples in a bit, but before we do that I want to talk about a couple terms, some of which you might have seen references to already.
The first term is cancel safety. You might have seen mentions of this in the Tokio documentation. Cancel safety, as generally defined, means the property of a future that can be cancelled (i.e. dropped) without any side effects.
For example, a Tokio sleep future is cancel safe: you can just stop waiting on the sleep and it’s completely fine.
let future = tokio::time::sleep();
drop(future); // this has no side effects
An example of a future that is not cancel safe is Tokio’s MPSC send, which sends a message over a channel:
let message = /* ... */;
let future = sender.send(message);
drop(future); // message is lost!
If this future is dropped, the message is lost forever.
The important thing is that cancel safety is a local property of an individual future.
But cancel safety is not all that one needs to care about. What actually matters is the context the cancellation happens in, or in other words whether the cancellation actually causes some kind of larger property in the system to be violated.
To capture this I tend to use a different term called cancel correctness, which I define as a global property of system correctness in the face of cancellations. (This isn’t a standard term, but it’s a framing I’ve found really helpful in understanding cancellations.)
When is cancel correctness violated? It requires three things:
The system has a cancel-unsafe future somewhere within it. As we’ll see, many APIs that are cancel-unsafe can be reworked to be cancel-safe. If there aren’t any cancel-unsafe futures in the system, then the system is cancel correct.
A cancel-unsafe future is actually cancelled. This may sound a bit trivial, but if cancel-unsafe futures are always run to completion, then the system can’t have cancel correctness bugs.
Cancelling the future violates some property of a system. This could be data loss as with Sender::send, some kind of invariant violation, or some kind of cleanup that must be performed but isn’t.
So a lot of making Rust async robust is about trying to tackle one of these three things.
I want to zoom in for a second on invariant violations and talk about an example of a Tokio API that is very prone to cancel correctness issues: Tokio mutexes.
The way Tokio mutexes work is: you create a mutex, you lock it which gives you mutable access to the data underneath, and then you unlock it by releasing the mutex.
let guard = mutex.lock().await;
// Access guard.data, protected by the mutex...
drop(guard);
If you look at the lock function’s documentation, in the “cancel safety” section it says:
This method uses a queue to fairly distribute locks in the order they were requested. Cancelling a call to lock makes you lose your place in the queue.
Okay, so not totally cancel safe, but the only kind of unsafety is fairness, which doesn’t sound too bad.
But the problems lie in what you actually do with the mutex. In practice, most uses of mutexes are in order to temporarily violate invariants that are otherwise upheld when a lock isn’t held.
I’ll use a real world example of a cancel correctness bug that we found at my job at Oxide: we had code to manage a bunch of data sent over by our computers, which we call sleds. The shared state was guarded by a mutex, and a typical operation was:
None state.Here’s a rough sketch of what that looks like:
let guard = mutex.lock().await;
// guard.data is Option<T>: Some to begin with
let data = guard.data.take(); // guard.data is now None
let new_data = process_data(data);
guard.data = Some(new_data); // guard.data is Some again
This is all well and good, but the problem is that the action being performed actually had an await point within it:
let guard = mutex.lock().await;
// guard.data is Option<T>: Some to begin with
let data = guard.data.take(); // guard.data is now None
// DANGER: cancellation here leaves data in None state!
let new_data = process_data(data).await;
guard.data = Some(new_data); // guard.data is Some again
If the code that operated on the mutex got cancelled at that await point, then the data would be stuck in the invalid None state. Not great!
And keep in mind the non-local reasoning aspect: when doing this analysis, you need to look at the whole chain of callers.
Now that we’ve talked about some of the bad things that can happen during cancellations, it’s worth asking what kinds of code patterns lead to futures being cancelled.
The most straightforward example, and maybe a bit of a silly one, is that you create a future but simply forget to call .await on it.
some_async_work(); // missing .await
Now Rust actually warns you if you don’t call .await on the future:
warning: unused implementer of `Future` that must be used
|
11 | some_async_work();
| ^^^^^^^^^^^^^^^^^
|
= note: futures do nothing unless you `.await` or poll them
But a code pattern I’ve sometimes made mistakes with is that the future returns a Result, and you want to ignore the result so you assign it to an underscore like so:
let _ = some_async_work(); // future returns Result
If I forget to call .await on the future, Rust doesn’t warn me about it at all, and then I’m left scratching my head about why this code didn’t run. I know this sounds really silly and basic, but I’ve made this mistake a bunch of times.
(After my talk, it was pointed out to me that Clippy 1.67 and above have a let_underscore_future warn-by-default lint for this. Hooray!)
Another example of futures being cancelled is try operations, such as Tokio’s try_join macro. For example:
async fn do_stuff_async() -> Result<(), &'static str> {
// async work
}
async fn more_async_work() -> Result<(), &'static str> {
// more here
}
let res = tokio::try_join!(
do_stuff_async(),
more_async_work(),
);
// ...
If you call try_join with a bunch of futures, and all of them succeed, it’s all good. But if one of them fails, the rest simply get cancelled.
In fact, at Oxide we had a pretty bad bug around this: we had code to stop a bunch of services, all expressed as futures. We used try_join:
try_join!(
stop_service_a(),
stop_service_b(),
stop_service_c(),
)?;
If one of these operations failed for whatever reason, we would stop running the code to wait for the other services to exit. Oops!
But perhaps the most well-known source of cancellations is Tokio’s select macro. Select is this incredibly beautiful operation. It is called with a set of futures, and it drives all of them forward concurrently:
tokio::select! {
result1 = future1 => handle_result1(result1),
result2 = future2 => handle_result2(result2),
}
Each future has a code block associated with it (above, handle_result1 and handle_result2). If one of the futures completes, the corresponding code block is called. But also, all of the other futures are always cancelled!
For a variety of reasons, select statements in general, and select loops in particular, are particularly prone to cancel correctness issues. So a lot of the documentation about cancel safety talks about select loops. But I want to emphasize here that select is not the only source of cancellations, just a particularly notable one.
So, now that we’ve looked at all of these issues with cancellations, what can be done about it?
First, I want to break the bad news to you – there is no general, fully reliable solution for this in Rust today. But in our experience there are a few patterns that have been successful at reducing the likelihood of cancellation bugs.
Going back to our definition of cancel correctness, there are three prongs all of which come together to produce a bug:
Most solutions we’ve come up with try and tackle one of these prongs.
Let’s look at the first prong: the system has a cancel-unsafe future somewhere in it. Can we use code patterns to make futures be cancel-safe? It turns out we can! I’ll give you two examples here.
The first is MPSC sends. Let’s come back to the example from earlier where we would lose messages entirely:
loop {
let msg = next_message();
match timeout(Duration::from_secs(5), tx.send(msg)).await {
Ok(Ok(_)) => println!("sent successfully"),
Ok(Err(_)) => return,
Err(_) => println!("no space for 5 seconds"),
}
}
Can we find a way to make this cancel safe?
In this case, yes, and we do so by breaking up the operation into two parts:
loop {
match timeout(Duration::from_secs(5), tx.reserve()).await {
Ok(Ok(permit)) => {
permit.send(next_message());
println!("sent successfully");
}
Ok(Err(_)) => return,
Err(_) => println!("no space for 5 seconds"),
}
}
(I want to put an asterisk here that reserve is not entirely cancel-safe, since Tokio’s MPSC follows a first-in-first-out pattern and dropping the future means losing your place in line. Keep this in mind for now.)
Update 2025-10-24: The code sample now calls next_message after a permit has been reserved. Thanks to quad on Lobsters for the correction.
The second is with Tokio’s AsyncWrite.
If you’ve written synchronous Rust you’re probably familiar with the write_all method, which writes an entire buffer out:
use std::io::Write;
let buffer: &[u8] = /* ... */;
writer.write_all(buffer)?;
In synchronous Rust, this is a great API. But within async Rust, the write_all pattern is absolutely not cancel safe! If the future is dropped before completion, you have no idea how much of this buffer was written out.
use tokio::io::AsyncWriteExt;
let buffer: &[u8] = /* ... */;
writer.write_all(buffer).await?; // Not cancel-safe!
But there’s an alternative API that is cancel-safe, called write_all_buf. This API is carefully designed to enable the reporting of partial progress, and it doesn’t just accept a buffer, but rather something that looks like a cursor on top of it:
use tokio::io::AsyncWriteExt;
let mut buffer: io::Cursor<&[u8]> = /* ... */;
writer.write_all_buf(&mut buffer).await?;
When part of the buffer is written out, the cursor is advanced by that number of bytes. So if you call write_all_buf in a loop, you’ll be resuming from this partial progress, which works great.
Going back to the three prongs: the second prong is about actually cancelling futures. What code patterns can be used to not cancel futures? Here are a couple of examples.
The first one is, in a place like a select loop, resume futures rather than cancelling them each time. You’d typically achieve this by pinning a future, and then polling a mutable reference to that future. For example:
let mut future = Box::pin(channel.reserve());
loop {
tokio::select! {
permit = &mut future => break permit,
_ = other_condition => continue,
}
}
Coming back to our example of MPSC sends, the one asterisk with reserve is that cancelling it makes you lose your place in line. Instead, if you pin the reserve future and poll a mutable reference to it, you don’t lose your place in line.
(Does the difference here matter? It depends, but you can now have this strategy available to you.)
The second example is to use tasks. I mentioned earlier that futures are Rust are diametrically opposed to similar notions in languages like JavaScript. Well, there’s an alternative in async Rust that’s much closer to the JavaScript idea, and that’s tasks.
A fun example is that at Oxide, we have an HTTP server called Dropshot. Previously, whenever an HTTP request came in, we’d use a future for it, and drop the future if the TCP connection was closed.
// Before: Future cancelled on TCP close
handle_request(req).await;
This was really bad because future cancellations could happen due to the behavior of not just the parent future, but of a process that was running across a network! This is a rather extreme form of non-local reasoning.
We addressed this by spinning up a task for each HTTP request, and by running the code to completion even if the connection is closed:
// After: Task runs to completion
tokio::spawn(handle_request(req));
The last thing I want to say is that this sucks!
The promise of Rust is that you don’t need to do this kind of non-local reasoning—that you can understand important behavior by looking at code directly around the behavior, then use the type system to scale that up to global correctness. Almost everything in Rust, from & and &mut to unsafe, is geared towards making that possible. However, future cancellations fly directly in the face of that, and I think they’re probably the least Rusty part of Rust. This is all really unfortunate.
Can we come up with something more systematic than this kind of ad-hoc reasoning?
There doesn’t exist anything in safe Rust today, but there are a few different ideas people have come up with. I wanted to give a nod to those ideas:
All of these options have really significant implementation challenges, though. This blog post from boats covers some of these solutions, and the implementation challenges with them.
In this post, we:
Some of the recommendations are:
There’s a very deep well of complexity here, a lot more than I can cover in one blog post:
If you’re curious about any of these, check out this link where I’ve put together a collection of documents and blog posts about these concepts. In particular, I’d recommend reading these two Oxide RFDs:
Thank you for reading this post to the end! And thanks to many of my coworkers at Oxide for reviewing the talk and the RFDs linked above, and for suggestions and constructive feedback.