Async/Await on the GPU

The interesting challenge with async/await on GPU is that it inverts the usual concurrency mental model. CPU async is about waiting efficiently while I/O completes. GPU async is about managing work distribution across warps that are physically executing in parallel. The futures abstraction maps onto that, but the semantics are different enough that you have to be careful not to carry over intuitions from tokio/async-std.

The comparison to NVIDIA's stdexec is worth looking at. stdexec uses a sender/receiver model which is more explicit about the execution context. Rust's Future trait abstracts over that, which is ergonomic but means you're relying on the executor to do the right thing with GPU-specific scheduling constraints.

Practically, the biggest win here is probably for the cases shayonj mentioned: mixed compute/memory pipelines where you want one warp loading while another computes. That's exactly where the warp specialization boilerplate becomes painful. If async/await can express that cleanly without runtime overhead, that is a real improvement.

I'm not quite seeing the real benefit of this. Is the idea that warps will now be able to do work-stealing and continuation-stealing when running heterogenous parallel workloads? But that requires keeping the async function's state in GPU-wide shared memory, which is generally a scarce resource.

Really cool experiment (the whole company).

Training pipelines are full of data preparation that are first written on CPU then moving to GPU and always thinking of what to keep on CPU and what to put on GPU, when is it worth to create a tensor, or should it be tiling instead. I guess your company is betting on solving problems like this (and async-await is needed for serving inference requests directly on the GPU for example).

My question is a little bit different: how do you want to handle the SIMD question: should a rust function be running on the warp as a machine with 32 long arrays as data types, or always ,,hope'' for autovectorization to work (especially with Rust's iter library helpers).

One concern I have is that this async/await approach is not "AOT"-enough like the Triton approach, in the sense that you know how to most efficiently schedule the computations on which warps since you know exactly what operations you'll be performing at compile time.

Here with the async/await approach, it seems like there needs to be manual book-keeping at runtime to know what has finished, what has not, and _then_ consider which warp should we put this new computation in. Do you anticipate that there will be measurable performance difference?

Warp specialization is an abomination that should be killed and I'm glad this could be an alternative.

I hope they can minimize the bookkeeping costs because I don't see it gain traction in AI if it hurts big kernels performance.

Et tu, GPU?

I am, bluntly, sick of Async taking over rust ecosystems. Embedded and web/HTTP have already fallen. I'm optimistic this won't take hold in GPU; well see. Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.

I use rust on the GPU for the following: 3d graphics via WGPU, cuFFT via FFI, custom kernels via Cudarc, and ML via Burn and Candle. Thankfully these are all Async-free.

Very cool to see this and something I have been curious about myself and exploring the space as well. I'd be curious what are some parallels and differentiations between this and NVIDIA's stdexec (outside of it being in Rust and using Future, which is also cooL)

What's the performance like? What would the benefits be of converting a streaming multiprocessor programming model to this?

Very cool!

Is the goal with this project (generally, not specifically async) to have an equivalent to e.g. CUDA, but in Rust? Or is there another intended use-case that I'm missing?

genius, great idea and follow through, please keep it up, this could improve the ML industry tremendously, maybe some einops inspired interface for this would be good?

Is this Nvidia-only or does it work on other architectures?

genius, great idea and follow through, please keep it up, this could improve the ML industry tremendously, maybe some einops inspired interface for this would be good?

Warp specialization is an abomination that should be killed and I'm glad this could be an alternative.

I hope they can minimize the bookkeeping costs because I don't see it gain traction in AI if it hurts big kernels performance.

I had a longer, snarkier response to this the (as I'm writing) top comment on this thread. I spent longer than I'd like to have trying to decode what insight you were sharing here (what exactly is inverted in the GPU/CPU summaries you give?) until I browsed your comment history and saw what looks like a bunch of AI-generated comments (sometimes less than a minute apart from each other) and realized I was trying to decode slop.

This one's especially clear because you reference "the cases shayonj mentioned", but shayonj's comment[1] doesn't mention any use cases, but it does make a comparison to "NVIDIA's stdexec", which seems like might have gotten mixed into what your model was trying to say in the preceding paragraph?

This is really annoying. Please stop.

[1] https://news.ycombinator.com/item?id=47050304

God, as someone who took their elective on graphics program when GPGPU and computer shaders first became a thing, reading this makes me realize I definitely need an update on what modern GPU uarchs are like now.

Re: heterogenous workload: I'm told by a friend in HPC that the old advice about avoiding diverging branches within warps is no longer much of an issue – is that true?

Yes, that's the idea.

GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.

This is already happening in C++, NVidia is the one pushing the senders/receivers proposal, which is one of the possible co-routine runtimes to be added into C++ standard library.

A ton of GPU workloads require leaving large amounts of RAM resident on the GPU and running computation with some new data from the CPU.

Really cool experiment (the whole company).

I'm not even sure a 32 wide array would be good either since on AMD warps are 64 wide. I wouldn't go fully towards auto vectorization with though.

Et tu, GPU?

I use rust on the GPU for the following: 3d graphics via WGPU, cuFFT via FFI, custom kernels via Cudarc, and ML via Burn and Candle. Thankfully these are all Async-free.

I don't see the utility of async on the GPU.

> Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.

Someone somewhere convinced you there is a async coloring problem. That person was wrong, async is an inherent property of some operations. Adding it as a type level construct gives visibility to those inherent behaviors, and with that more freedom in how you compose them.

What's the performance like? What would the benefits be of converting a streaming multiprocessor programming model to this?

Very cool!

Is the goal with this project (generally, not specifically async) to have an equivalent to e.g. CUDA, but in Rust? Or is there another intended use-case that I'm missing?

Is this Nvidia-only or does it work on other architectures?

We aren't focused on performance yet (it is often workload and executor dependent, and as the post says we currently do some inefficient polling) but Rust futures compile down to state machines so they are a zero-cost abstraction.

The anticipated benefits are similar to the benefits of async/await on CPU: better ergonomics for the developer writing concurrent code, better utilization of shared/limited resources, fewer concurrency bugs.

The closest equivalent to that is rust-gpu, which this project is pretty closely involved with.

Doing things at compile time / AOT is almost always better for perf. We believe async/await and futures enables more complex programs and doing things you couldn't easily do on the GPU before. Less about performance and more about capability (though we believe async/await perf will be better in some cases, time will tell).

If you can tell deterministically whether an 'async' computation is going to be finished, you can most likely use a type-system-like static analysis to ensure that programs are scheduled correctly and avoid any reference to values that are yet to be computed. But this is not possible in many cases, where dynamic scheduling will be preferable.

Currently NVIDIA-only, we're cooking up some Vulkan stuff in rust-gpu though.

The closest equivalent to that is rust-gpu, which this project is pretty closely involved with.

Yes, that's the idea.

GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.

This is already happening in C++, NVidia is the one pushing the senders/receivers proposal, which is one of the possible co-routine runtimes to be added into C++ standard library.

A ton of GPU workloads require leaving large amounts of RAM resident on the GPU and running computation with some new data from the CPU.

warp is expensive - essentially it's running a 'don't run code' to maintain SIMT.

GPUs are still not practically-Turing-complete in the sense that there are strict restrictions on loops/goto/IO/waiting (there are a bunch of band-aids to make it pretend it's not a functional programming model).

So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?

Good tech showcase though :)

This is really annoying. Please stop.

[1] https://news.ycombinator.com/item?id=47050304

Re: heterogenous workload: I'm told by a friend in HPC that the old advice about avoiding diverging branches within warps is no longer much of an issue – is that true?

I don't see the utility of async on the GPU.

> Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.

I'm not even sure a 32 wide array would be good either since on AMD warps are 64 wide. I wouldn't go fully towards auto vectorization with though.

I see this accusation a lot, and admittedly, I defended someone who later on was shown to use AI to generate comments, but I am still missing a motivation for this. Is your argument that he is using AI to copyedit his posts, or that he is asking AI to write a response to a random thread that looks insightful? Because I cannot fathom why someone would ever do that.

This is what I fucking hate about this AI craze. It's all [1], fundamentally, about deception. Trying to pass off word salad as a blogpost, fake video as real, a randomly generated page as a genuine recipe, an LLM summary as insight.

[1] Nearly all.

That advice applies within warps, to single 'threads' (effectively SIMD lanes) whereas the article is consistently about running heterogenous tasks on different warps.

itd be interesting to see a setup where there's only async and you have to specify when you actually want to block on a result.

flip the colouring problem on its head

Warp SIMD-width should be a build-time constant. You'd be using a variable-length vector-like interface that gets compiled down to a specified length as part of building the code.

Currently NVIDIA-only, we're cooking up some Vulkan stuff in rust-gpu though.

I don't have anything to offer but my encouragement, but there are _dozens_ of ROCm enjoyers out there.

In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)

Does the lack of forward progress guarantees (ITS) on other architectures pose challenges for async/await?

warp is expensive - essentially it's running a 'don't run code' to maintain SIMT.

So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?

Good tech showcase though :)

I think you're conflating GPU 'threads' and 'warps'. GPU 'threads' are SIMD lanes that are all running with the exact same instructions and control flow (only with different filtering/predication), whereas GPU warps are hardware-level threads that run on a single compute unit. There's no issue with adding extra "don't run code" when using warps, unlike GPU threads.

[1] Nearly all.

itd be interesting to see a setup where there's only async and you have to specify when you actually want to block on a result.

flip the colouring problem on its head

That advice applies within warps, to single 'threads' (effectively SIMD lanes) whereas the article is consistently about running heterogenous tasks on different warps.

I have no idea what their motivation is and no idea if they're using an LLM to tune their prose or write comments whole cloth (considering the four recent comments, each two paragraphs, within 2.5 minutes, though, I'm guessing fully generated).

I was just annoyed enough by spending a couple of minutes trying to decode what had the semblance of something interesting that I felt compelled to write my response :)

There are a ton of interesting top-level comments and questions posted in this thread. It's such a waste this one is at the top.

Warp SIMD-width should be a build-time constant. You'd be using a variable-length vector-like interface that gets compiled down to a specified length as part of building the code.

Now that I could agree with, the only place where hiccups have started to occur are with wave intrinsics where you can share data between thread in a wave without halting execution. I'm not sure disallowing it would be the best idea as it cuts out possible optimizations, but outright allowing it without the user knowing the number of lanes can cause it's own problems. My job is the fun time of fixing issues in other peoples code related to all of this. I have no stakes in rust though, I'd rather write a custom spirv compiler.

Does the lack of forward progress guarantees (ITS) on other architectures pose challenges for async/await?

I don't have anything to offer but my encouragement, but there are _dozens_ of ROCm enjoyers out there.

In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)

Thank you! We're small so have to focus. If anyone from AMD wants to reach out, happy to chat.

My understanding of warp (https://docs.nvidia.com/cuda/cuda-programming-guide/01-intro...) is that you are essentially paying the cost of taking both the branches.

I understand with newer GPUs, you have clever partitioning / pipelining in such a way block A takes branch A vs block B that takes branch B with sync/barrier essentially relying on some smart 'oracle' to schedule these in a way that still fits in the SIMT model.

It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?

I was just annoyed enough by spending a couple of minutes trying to decode what had the semblance of something interesting that I felt compelled to write my response :)

There are a ton of interesting top-level comments and questions posted in this thread. It's such a waste this one is at the top.

A compile time constant can still be surfaced to the user though. The code would simply be written to take the actual value into account and this would be reflected during the build.

Thank you! We're small so have to focus. If anyone from AMD wants to reach out, happy to chat.

My understanding of warp (https://docs.nvidia.com/cuda/cuda-programming-guide/01-intro...) is that you are essentially paying the cost of taking both the branches.

It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?

That applies inside a single warp, notice the wording:

> In SIMT, all threads in the warp are executing the same kernel code, but each thread may follow different branches through the code. That is, though all threads of the program execute the same code, threads do not need to follow the same execution path.

This doesn't say anything about dependencies of multiple warps.

A compile time constant can still be surfaced to the user though. The code would simply be written to take the actual value into account and this would be reflected during the build.

I don't have a lot of faith there, but that's mainly due to my experience being correcting peoples assumption that all gpus waves are 32 lanes. I might be biased there specifically since it's my job to fix those issues though.

That applies inside a single warp, notice the wording:

This doesn't say anything about dependencies of multiple warps.

It's definitely possible, I am not arguing against that.

I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU.

I would love to see Rust-based code that obviates the need to write CUDA kernels (including compiling to different architectures). It feels icky to use/introduce things like async/await in the context of a GPU programming model which is very different from a traditional Rust programming model.

You still have to worry about different architectures and the streaming nature at the end of the day.

I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.

It's definitely possible, I am not arguing against that.

I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU.

You still have to worry about different architectures and the streaming nature at the end of the day.

I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.

VectorWare logo VectorWare

Dispatches

February 17, 202615 min read

Pedantic mode:Off

GPU code can now use Rust's async/await. We share the reasons why and what this unlocks for GPU programming.

At VectorWare, we are building the first GPU-native software company. Today, we are excited to announce that we can successfully use Rust's Future trait and async/await on the GPU. This milestone marks a significant step towards our vision of enabling developers to write complex, high-performance applications that leverage the full power of GPU hardware using familiar Rust abstractions.

Concurrent programming on the GPU

GPU programming traditionally focuses on data parallelism. A developer writes a single operation and the GPU runs that operation in parallel across different parts of the data.

fn conceptual_gpu_kernel(data) {
    // All threads in all warps do the same thing to different parts of data
    data[thread_id] = data[thread_id] * 2;
}

This model works well for standalone and uniform tasks such as graphics rendering, matrix multiplication, and image processing.

As GPU programs grow more sophisticated, developers use warp specialization to introduce more complex control flow and dynamic behavior. With warp specialization, different parts of the GPU run different parts of the program concurrently.

fn conceptual_gpu_kernel(data) {
    let communication = ...;
    if warp == 0 {
        // Have warp 0 load data from main memory
        load(data, communication);
    } else if warp == 1 {
        // Have warp 1 compute A on loaded data and forward it to B
        compute_A(communication);
    } else {
        // Have warp 2 and 3 compute B on loaded data and store it
        compute_B(communication, data);
    }
}

Warp specialization shifts GPU logic from uniform data parallelism to explicit task-based parallelism. This enables more sophisticated programs that make better use of the hardware. For example, one warp can load data from memory while another performs computations to improve utilization of both compute and memory.

This added expressiveness comes at a cost. Developers must manually manage concurrency and synchronization because there is no language or runtime support for doing so. Similar to threading and synchronization on the CPU, this is error-prone and difficult to reason about.

Better concurrent programming on the GPU

There are many projects that aim to provide the benefits of warp specialization without the pain of manual concurrency and synchronization.

JAX models GPU programs as computation graphs that encode dependencies between operations. The JAX compiler analyzes this graph to determine ordering, parallelism, and placement before generating the program that executes. This allows JAX to manage and optimize execution while presenting a high-level programming model in a Python-based DSL. The same model supports multiple hardware backends, including CPUs and TPUs, without changing user code.

Triton expresses computation in terms of blocks that execute independently on the GPU. Like JAX, Triton uses a Python-based DSL to define how these blocks should execute. The Triton compiler lowers block definitions through a multi-level pipeline of MLIR dialects, where it applies block-level data-flow analysis to manage and optimize the generated program.

More recently, NVIDIA introduced CUDA Tile. Like Triton, CUDA Tile organizes computation around blocks. It additionally introduces "tiles" as first-class units of data. Tiles make data dependencies explicit rather than inferred, which improves both performance opportunities and reasoning about correctness. CUDA Tile ingests code written in existing languages such as Python, lowers it to an MLIR dialect called Tile IR, and executes on the GPU.

We are excited and inspired by these efforts, especially CUDA Tile. We think it is a great idea to have GPU programs structured around explicit units of work and data, separating the definition of concurrency from its execution. We believe that GPU hardware aligns naturally with structured concurrency and changing the software to match will enable safer and more performant code.

The downsides of current approaches

These higher-level approaches to GPU programming require developers to structure code in new and specific ways. This can make them a poor fit for some classes of applications.

Additionally, a new programming paradigm and ecosystem is a significant barrier to adoption. Developers use JAX and Triton primarily for machine learning workloads where they align well with the underlying computation. CUDA Tile is newer and more general but has yet to see broader adoption. Virtually no one writes their entire application with these technologies. Instead, they write parts of their application in these frameworks and other parts in more traditional languages and models.

Code reuse is also limited. Existing CPU libraries assume a conventional language runtime and execution model and cannot be reused directly. Existing GPU libraries rely on manual concurrency management and similarly do not compose with these frameworks.

Ideally, we want an abstraction that captures the benefits of explicit and structured concurrency without requiring a new language or ecosystem. It should compose with existing CPU code and execution models. It should provide fine-grained control when needed, similar to warp specialization. It should also provide ergonomic defaults for the common case.

Rust's `Future` trait and `async`/`await`

We believe Rust's Future trait and async/await provide such an abstraction. They encode structured concurrency directly in an existing language without committing to a specific execution model.

A future represents a computation that may not be complete yet. A future does not specify whether it runs on a thread, a core, a block, a tile, or a warp. It does not care about the hardware or operating system it runs on. The Future trait itself is intentionally minimal. Its core operation is poll, which returns either Ready or Pending. Everything else is layered on top. This separation is what allows the same async code to be driven in different environments. For more detailed info, see the Rust async book.

Like JAX's computation graphs, futures are deferred and composable. Developers construct programs as values before executing them. This allows the compiler to analyze dependencies and composition ahead of execution while preserving the shape of user code.

Like Triton's blocks, futures naturally express independent units of concurrency. Depending on how futures are combined, they represent whether a block of work runs serially or in parallel. Developers express concurrency using normal Rust control flow, trait implementations, and future combinators rather than a separate DSL.

Like CUDA Tile's explicit tiles and data dependencies, Rust's ownership model makes data constraints explicit in the program structure. Futures capture the data they operate on and that captured state becomes part of the compiler-generated state machine. Ownership, borrowing, Pin, and bounds such as Send and Sync encode how data can be shared and transferred between concurrent units of work.

Warp specialization is not typically described this way, but in effect, it reduces to manually written task state machines. Futures compile down to state machines that the Rust compiler generates and manages automatically.

Because Rust's futures are just compiler-generated state machines there is no reason they cannot run on the GPU. That is exactly what we have done.

A world first: `async`/`await` running on the GPU

Running async/await on the GPU is difficult to demonstrate visually because the code looks and runs like ordinary Rust. By design, the same syntax used on the CPU runs unchanged on the GPU.

Here we define a small set of async functions and invoke them from a single GPU kernel using block_on. Together, they exercise the core features of Rust's async model: simple futures, chained futures, conditionals, multi-step workflows, async blocks, and third-party combinators.

// Simple async functions that we will call from the GPU kernel below.
 
async fn async_double(x: i32) -> i32 {
    x * 2
}
 
async fn async_add_then_double(a: i32, b: i32) -> i32 {
    let sum = a + b;
    async_double(sum).await
}
 
async fn async_conditional(x: i32, do_double: bool) -> i32 {
    if do_double {
        async_double(x).await
    } else {
        x
    }
}
 
async fn async_multi_step(x: i32) -> i32 {
    let step1 = async_double(x).await;
    let step2 = async_double(step1).await;
    step2
}
 
#[unsafe(no_mangle)]
pub unsafe extern "ptx-kernel" fn demo_async(
    val: i32,
    flag: u8,
) {
    // Basic async functions with a single await execute correctly on the device.
    let doubled = block_on(async_double(val));
 
    // Chaining multiple async calls works as expected.
    let chained = block_on(async_add_then_double(val, doubled));
 
    // Conditionals inside async code are supported.
    let conditional = block_on(async_conditional(val, flag));
 
    // Async functions with multiple await points also work.
    let multi_step = block_on(async_multi_step(val));
 
    // Async blocks work and compose naturally.
    let from_block = block_on(async {
        let doubled_a = async_double(val).await;
        let doubled_b = async_double(chained).await;
        doubled_a.wrapping_add(doubled_b)
    });
 
    // CPU-based async utilities also work. Here we use combinators from the
    // `futures_util` crate to build and compose futures without writing new
    // async functions.
    use futures_util::future::ready;
    use futures_util::FutureExt;
 
    let from_combinator = block_on(
        ready(val).then(move |v| ready(v.wrapping_mul(2).wrapping_add(100)))
    );
}

Getting this all working required fixing bugs and closing gaps across multiple compiler backends. We also encountered issues in NVIDIA's ptxas tool, which we reported and worked around.

Executors on the GPU

Using async/await makes it ergonomic to express concurrency on the GPU. However, in Rust futures do not execute themselves and must be driven to completion by an executor. Rust deliberately does not include a built-in executor and instead third parties provide executors with different features and tradeoffs.

Our initial goal was to prove that Rust's async model could run on the GPU at all. To do that, we started with a simple block_on as our executor. block_on takes a single future and drives it to completion by repeatedly polling it on the current thread. While simple and blocking, it was sufficient to demonstrate that futures and async/await could compile to correct GPU code. While the block_on executor may seem limiting, because futures are lazy and composable we were still able to express complex concurrent workloads via combinators and async functions.

Once we had futures working end to end, we moved to a more capable executor. The Embassy executor is designed for embedded systems and operates in Rust's #![no_std] environment. This makes it a natural fit for GPUs, which lack a traditional operating system and thus do not support Rust's standard library. Adapting it to run on the GPU required very few changes. This ability to reuse existing open source libraries is much better than what exists in other (non-Rust) GPU ecosystems.

Here we construct three independent async tasks that loop indefinitely and increment counters in shared state to demonstrate scheduling. The tasks themselves do not perform useful computation. Each task awaits a simple future that performs work in small increments and yields periodically. This allows the executor to interleave progress between tasks.

#![no_std]
#![feature(abi_ptx)]
#![feature(stdarch_nvptx)]
 
use core::future::Future;
use core::pin::Pin;
use core::sync::atomic::{AtomicU32, Ordering};
use core::task::{Context, Poll};
 
use embassy_executor::Executor;
use ptx_embassy_shared::SharedState;
 
pub struct InfiniteWorkFuture {
    pub shared: &'static SharedState
    pub iteration_counter: &'static AtomicU32,
}
 
impl Future for InfiniteWorkFuture {
    type Output = ();
 
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<()> {
        // Check if host requested stop
        if self.shared.stop_flag.load(Ordering::Relaxed) != 0 {
            unsafe { core::arch::nvptx::trap() };
        }
 
        // Track iterations and activity for demonstration purposes
        self.iteration_counter.fetch_add(1, Ordering::Relaxed);
        self.shared.last_activity.fetch_add(1, Ordering::Relaxed);
 
        // Simulate work
        unsafe {
            core::arch::nvptx::_nanosleep(100);
        }
 
        cx.waker().wake_by_ref();
        Poll::Pending
    }
}
 
// Three very similar tasks, incrementing different variables
#[embassy_executor::task]
async fn task_a(shared: &'static SharedState) {
    InfiniteWorkFuture {
        iteration_counter: &shared.task_a_iterations,
        shared,
    }.await
}
 
#[embassy_executor::task]
async fn task_b(shared: &'static SharedState) {
    InfiniteWorkFuture {
        iteration_counter: &shared.task_b_iterations,
        shared,
    }.await
}
 
#[embassy_executor::task]
async fn task_c(shared: &'static SharedState) {
    InfiniteWorkFuture {
        iteration_counter: &shared.task_c_iterations,
        shared,
    }.await
}
 
#[unsafe(no_mangle)]
pub unsafe extern "ptx-kernel" fn run_forever(shared_state: *mut SharedState) {
 
    // ... executor setup and initialization ...
 
    // Safety: the CPU needs to ensure the buffer says alive
    // for as long as this is running
    let shared = unsafe { &const (*shared_state) };
    executor.run(|spawner| {
        if let Ok(token) = task_a(shared) {
            spawner.spawn(token);
        }
        if let Ok(token) = task_b(shared) {
            spawner.spawn(token);
        }
        if let Ok(token) = task_c(shared) {
            spawner.spawn(token);
        }
    });
}

Below is an Asciinema recording of the GPU running the async tasks via Embassy's executor. Performance is not representative as the example runs empty infinite loops and uses atomics to track activity. The important point is that multiple tasks execute concurrently on the GPU, driven by an existing, production-grade executor using Rust's regular async/await.

Taken together, we think Rust and its async model are a strong fit for the GPU. Notably, similar ideas are emerging in other language ecosystems, such as NVIDIA's stdexec work for C++. The difference is these abstractions already exist in Rust, are widely used, and are supported by a mature ecosystem of executors and libraries.

Downsides of Rust's `async`/`await` on the GPU

Futures are cooperative. If a future does not yield, it can starve other work and degrade performance. This is not unique to GPUs, as cooperative multitasking on CPUs has the same failure mode.

GPUs do not provide interrupts. As a result, an executor running on the device must periodically poll futures to determine whether they can make progress. This involves spin loops or similar waiting mechanisms. APIs such as nanosleep can trade latency for efficiency, but this remains less efficient than interrupt-driven execution and reflects a limitation of current GPU architectures. We have some ideas for how to mitigate this and are experimenting with different approaches.

Driving futures and maintaining scheduling state increases register pressure. On GPUs, this can reduce occupancy and impact performance.

Finally, Rust's async model on the GPU still carries the same function coloring problem that exists on the CPU.

Future work

On the CPU, executors such as Tokio, Glommio, and Smol make different tradeoffs around scheduling, latency, and throughput. We expect a similar diversity to emerge on the GPU. We are experimenting with GPU-native executors designed specifically around GPU hardware characteristics.

A GPU-native executor could leverage mechanisms such as CUDA Graphs or CUDA Tile for efficient task scheduling or shared memory for fast communication between concurrent tasks. It could also integrate more deeply with GPU scheduling primitives than a direct port of an embedded or CPU-focused executor.

At VectorWare, we have recently enabled std on the GPU. Futures are no_std compatible, so this does not impact their core functionality. However, having the Rust standard library available on the GPU opens the door to richer runtimes and tighter integration with existing Rust async libraries.

Finally, while we believe futures and async/await map well to GPU hardware and align naturally with efforts such as CUDA Tile, they are not the only way to express concurrency. We are exploring alternative Rust-based approaches with different tradeoffs and will share more about those experiments in future posts.

Is VectorWare only focused on Rust?

We completed this work months ago. The speed at which we are able to make progress on the GPU is a testament to the power of Rust's abstractions and ecosystem.

As a company, we understand that not everyone uses Rust. Our future products will support multiple programming languages and runtimes. However, we believe Rust is uniquely well suited to building high-performance, reliable GPU-native applications and that is what we are most excited about.

Follow along

Follow us on X, Bluesky, LinkedIn, or subscribe to our blog to stay updated on our progress. We will be sharing more about our work in the coming months. You can also reach us at hello@vectorware.com.

Hacker Times

Hacker Times

Discussion

Discussion

Concurrent programming on the GPU

Better concurrent programming on the GPU

The downsides of current approaches

Rust's `Future` trait and `async`/`await`

A world first: `async`/`await` running on the GPU

Executors on the GPU

Downsides of Rust's `async`/`await` on the GPU

Future work

Is VectorWare only focused on Rust?

Follow along

Hacker Times

Hacker Times

Async/Await on the GPU

Discussion

Discussion

Concurrent programming on the GPU

Better concurrent programming on the GPU

The downsides of current approaches

Rust's Future trait and async/await

A world first: async/await running on the GPU

Executors on the GPU

Downsides of Rust's async/await on the GPU

Future work

Is VectorWare only focused on Rust?

Follow along

Rust's `Future` trait and `async`/`await`

A world first: `async`/`await` running on the GPU

Downsides of Rust's `async`/`await` on the GPU