> Built for laptops with soldered memory and no upgrade path. If you have an RTX card sitting there with 8GB of VRAM and you're getting swapped to SSD, this puts that VRAM to work.

Well, that does at least answer my immediate question about why I would ever swap from expensive RAM to really expensive RAM:) Feels niche, but when you want it it's a good idea.

> Built for laptops with soldered memory and no upgrade path. If you have an RTX card sitting there with 8GB of VRAM and you're getting swapped to SSD, this puts that VRAM to work.

Well, that does at least answer my immediate question about why I would ever swap from expensive RAM to really expensive RAM:) Feels niche, but when you want it it's a good idea.

Nice idea, but something has gone very wrong here:

>Sequential throughput: ~1.3 GB/s

[on a RTX 3070 Laptop]

This RTX 3070 chip is on PCIe 4.0 x16 which should give 64GB/s. The 8GB of GDDR6 is 448GB/s.

Swapping to an NVMe drive would be twice as fast, but with higher latency.

Given my dev machine has 32GB of RAM and 32GB of VRAM that sits mostly idle when I'm not running AI models, this is not that bad of an idea.

What about backpressure, how does it handle requirements for VRAM allocation when VRAM is used for swap space?

With X11 it's not that bad (buffers are pre-allocated), but with Wayland allocations are a lot more dynamic, so running low on VRAM can easily crash the whole desktop. I just had a few of such crashes with Hyprland+llama-server+KVM switching between computers without freeing VRAM.

I remember this being a thing done a while back using linux's MTD/phram drivers - https://wiki.archlinux.org/title/Swap_on_video_RAM - not sure if that's still relevant though as I don't know how it'll interact with DRM and how it handles reserving some of the vram - the suggested limit using xorg.conf is probably pretty obsolete now.

That page also has a fuse filesystem implementation on top of opencl - https://github.com/Overv/vramfs - which may be more compatible.

For windows I saw something similar to this years ago. An experimental proof of concept driver that allows the creation of a ram drive from vram for NVIDIA cards. Sequential is fast as you'd expect, random has lots of room for improvement.

>GpuRamDrive

>Create a virtual drive backed by GPU RAM.

https://github.com/prsyahmi/GpuRamDrive

Fork with AMD support:

https://github.com/brzz/GpuRamDrive/

Remember how 16GBs used to be an enterprise level database mainframe?

Well, GPUs also have stupid amounts of compute on them. I have to imagine that there is some kind of database format that's useful with GPU compute attached.

Since the data is already in VRAM, the GPU can sort, join, or otherwise manipulate data as needed.

I've implemented the same idea with OpenCL: https://github.com/theblazehen/vramblk

There is originally https://github.com/Overv/vramfs however that has the overhead of a FUSE filesystem + loop device when using as a swap device.

The performance is rather lackluster however, it's far from a miracle "now you effectively have more ram for a 90% performance drop" - it definitely feels like traditional swapping

Similar but using OpenCL APIs, so it works on AMD (for some definition of "works" since their drivers are quite buggy): https://libguestfs.org/nbdkit-vram-plugin.1.html

It seems obvious to me that this should be a built in functionality of the kernel.

The kernels job is to manage resources - and GPU ram is one such resource, and it can be used for many of the same uses as regular ram.

I'm more interested in the opposite. Nvidia linux drivers crash when you try to address more VRAM than you have. It'd be nice if they didn't.

sidenote: it is possible to use Vulkan to map GPU memory to CPU space and even map it back to CUDA: https://x.com/steeve/status/2055042304344231978?s=20

>Sequential throughput: ~1.3 GB/s

sounds VERY low, also, wouldn't random read/write speed be MUCH more relevant here?

I seriously looked at this as a way to improve the RAM situation in a QNAP 2U unit that I was having trouble sourcing RAM for. It's somewhat annoying that legit memory-over-PCIe is gated on PCIe5 and chipset support.

In the end I just had to bite the bullet and take a gamble on finding ECC DDR4 RAM that would work with the ancient AMD chipset...

This particular implementation seems to be running over too many layers to be particularly performant. Why not a custom block driver instead?

Does anyone these days really use swap for anything than S4 suspend ?

I have long wished it was possible to do this. What a great bit of code. Thanks.

Finally a use for the expensive ram when it's not needed in workloads!

Now if it could be dynamically used and vacated on other GPU workloads?

You want to waste VRAM, in this economy?

No software benchmarks? BAR for RAM is cool but I want to see how much it _actually_ beats pcie nvme

Nvme ssd weights much less than GPU, and it matters for a laptop.

use your car for an anchor on a big boat!

RAM disks have always fascinated me. In a different timeline every PC has a 100gb of RAM and 50TB HDDs are the norm.

And now, I want HDMI as direct connection hight speed network socket.

I mean, you prompted something useful out of an AI, good job. But then use that to ask for donation? Feels weird, man.

This is why I read HN.

I mean, cool, but I’d rather not?

Another possible reason that occurred to me: what if you have VRAM but you're not using it all the time? For example, let's say you bought a GPU because you like to play video games. When you're not actively gaming, you probably don't need 16 GB of VRAM just to render the desktop. Might as well use it for something else, right?

Edit: Although, this is predicated on the system being able to release VRAM that is acting as swap when it's time to start a game. Can it do that?

What about backpressure, how does it handle requirements for VRAM allocation when VRAM is used for swap space?

That page also has a fuse filesystem implementation on top of opencl - https://github.com/Overv/vramfs - which may be more compatible.

Edit: Although, this is predicated on the system being able to release VRAM that is acting as swap when it's time to start a game. Can it do that?

I am catching up on comments

The reason I wrote this is I run this laptop in hybrid (AMD display + NVIDIA as swap). So all at VRAM was going to waste.

On your question re: switchable swap. It's on my to-do list ;)

It's easy enough to 'offline' swap space on Linux normally so I suspect that would work fine, as long as you didn't instantly run out of RAM when doing so.

Nice idea, but something has gone very wrong here:

>Sequential throughput: ~1.3 GB/s

[on a RTX 3070 Laptop]

This RTX 3070 chip is on PCIe 4.0 x16 which should give 64GB/s. The 8GB of GDDR6 is 448GB/s.

Swapping to an NVMe drive would be twice as fast, but with higher latency.

Gen 4.0 x16 is 32 GB/s in each direction, but the way this is implemented is not the way you'd go about this if you wanted high performance.

Edit: Their benchmarks are also run using ZRAM, which compresses pages before writing to swap. Not sure what the performance overhead of that is, but it's probably quite a bit.

First of all, it's a userspace program hooking the nbd driver, which is known for being slow. It also uses a bounce buffer in userspace before transferring to the GPU. So when the kernel needs to swap a page, it has to first copy it into a userspace facing buffer. The userspace program that has to wake back up and issue the cuda operation to copy the page into device memory.

nbd also doesn't really do a good job of supporting high queue depth or merging adjacent accesses. So if the kernel is issuing a bunch of 4K page swaps without any coalescing, you're going to end up with at least million kernel/userspace context switches per second just to handle 4 GB/s (4 GB / 4K page), let alone 64 GB/s. And that's just the NBD portion, forget the mess that is the NVIDIA driver. PCIe can move a lot of data, but in order to get anything even resembling the full bandwidth, you have to have use DMA engines with long page lists. Having to set up a transfer for every 4K page over PCIe will not reach full saturation of the bus.

Swapping to NVMe is a very optimized path -> the swapper can submit lists of pages directly to the NVMe driver and the controller can DMA them directly out of RAM, no copies or context switches CPU side at all.

This could probably be improved by migrating to the ublk driver as it might let you avoid the userspace bounce buffer. It'd also be able to have multiple write queues to at least set up CUDA copies in parallel.

Swapping to a NVMe will also consume PE cycles on your NAND, ie wearing it out over time.

RAM/VRAM don’t degrade from use.

Given my dev machine has 32GB of RAM and 32GB of VRAM that sits mostly idle when I'm not running AI models, this is not that bad of an idea.

this is the pcmasterrace equivalent of being all upper body and with scrawny legs lol

>GpuRamDrive

>Create a virtual drive backed by GPU RAM.

https://github.com/prsyahmi/GpuRamDrive

Fork with AMD support:

https://github.com/brzz/GpuRamDrive/

Thanks for sharing this, good read :-)

Remember how 16GBs used to be an enterprise level database mainframe?

Well, GPUs also have stupid amounts of compute on them. I have to imagine that there is some kind of database format that's useful with GPU compute attached.

Since the data is already in VRAM, the GPU can sort, join, or otherwise manipulate data as needed.

GPU-accelerated databases have a long history. I founded HeavyAI (previously MapD/OmniSci) in 2013, but there are or have been many other startups in this space, such as Voltron Data, Kinetica, Sqream, etc. And now you have major players like IBM, Starburst, and Microsoft (which just announced Fabric SQL on GPU today) working on their own GPU-accelerated systems. GPUs have a huge advantage in terms of compute, memory, and interconnect bandwidth over CPU, as long as you can keep them fed with data.

I believe within 2-3 years databases and data warehouses on GPU will be common. The widespread use of agents to query data will be a part of this, as there will be a need to run far more queries at lower latency than needed for the ETL and BI workloads of the past.

oh god please don't create more demand for GPUs

Can we somehow make them work with 1 TB PCIes so we can churn through way more data?

Similar but using OpenCL APIs, so it works on AMD (for some definition of "works" since their drivers are quite buggy): https://libguestfs.org/nbdkit-vram-plugin.1.html

>Sequential throughput: ~1.3 GB/s

sounds VERY low, also, wouldn't random read/write speed be MUCH more relevant here?

Finally a use for the expensive ram when it's not needed in workloads!

Now if it could be dynamically used and vacated on other GPU workloads?

No software benchmarks? BAR for RAM is cool but I want to see how much it _actually_ beats pcie nvme

Nvme ssd weights much less than GPU, and it matters for a laptop.

I mean, you prompted something useful out of an AI, good job. But then use that to ask for donation? Feels weird, man.

This is why I read HN.

I am catching up on comments

The reason I wrote this is I run this laptop in hybrid (AMD display + NVIDIA as swap). So all at VRAM was going to waste.

On your question re: switchable swap. It's on my to-do list ;)

Gen 4.0 x16 is 32 GB/s in each direction, but the way this is implemented is not the way you'd go about this if you wanted high performance.

Edit: Their benchmarks are also run using ZRAM, which compresses pages before writing to swap. Not sure what the performance overhead of that is, but it's probably quite a bit.

yup. it's nbd and userspace making it slow. zram on the other hand adds little.

one can get rid of zram and just reimplement some compression in shaders but I think that would be a pointless optimization.

Swapping to a NVMe will also consume PE cycles on your NAND, ie wearing it out over time.

RAM/VRAM don’t degrade from use.

flash is a consumable, yes.

but flash endurance isn't a strong argument here. you probably have O(TB) of flash, and aren't going to produce PB of swap writes any time soon. if you do a lot of swapping to a small flash device, it'll happen sooner.

I'm typing from a quite old 4GB laptop, which swaps heavily to a 250G SATA ssd. sure, it's not great, but it also costs zero. currently 9GB of swap is used, and it's not really noticeable. if I open 20 more tabs, it can introduce pauses.

google says this drive was released in 2014, and SMART says POH is about 10 years.

SMART also says wear leveling count is 665 and total written is 165327189538 LBAs (78834 GiB, or 338 drive-writes). I'm not expecting it to die soon, though using a 4G laptop is a bit of a stunt these days...

the point is that a system that has sustained heavy swapping for years has not generates so many writes to worry much. a modern system with 10x speed and 10x capacity (and probably less RAM deficit) would have even less effect. even for QDR with it's few-hundred cycle endurance spec...

This was a consideration when I wrote this

this is the pcmasterrace equivalent of being all upper body and with scrawny legs lol

It's fine for dense models where you need them in VRAM, less so for MoE where you're offloading layers to ram. But 32/32 is pretty good for both in the popular ~30b range right now.

Thanks for sharing this, good read :-)

Insightful take, looking into these

oh god please don't create more demand for GPUs

Can we somehow make them work with 1 TB PCIes so we can churn through way more data?

Have you heard of the "Radeon Pro SSG" ??

It must have failed because I never heard of an update to this GPU. But AMD definitely made a GPU with 4x NVMe SSDs attached to the GPU.

Possibly LSM compaction.

Thank you for pointing me to this

I'm more interested in the opposite. Nvidia linux drivers crash when you try to address more VRAM than you have. It'd be nice if they didn't.

They already do that on windows and it kinda sucks. If you are targeting something like LMStudio or ComfyUI, both of those have superior methods to do exactly this.

In the end I just had to bite the bullet and take a gamble on finding ECC DDR4 RAM that would work with the ancient AMD chipset...

This particular implementation seems to be running over too many layers to be particularly performant. Why not a custom block driver instead?

Memory on an expansion card isn't gated on PCIe 5, it's gated on CXL support. CXL and PCIe use the same electrical/physical layer but the protocol is very different.

The problem with putting (system) RAM on a PCIe card is that PCIe is not a cache-coherent interconnect. If you have a cache line that resides on your GPU sitting inside your processor's cache a remote modification to that memory by either the GPU, another CPU core or some other PCIe device with NOT invalidate the CPU cache line. You also have the fun situation that if it's modified on both ends simultaneously the resulting state will be non-deterministic.

Device drivers have to be very careful about synchronization when accessing memory-like areas on PCIe. CXL adds a cache coherency protocol among other things, so that invalidations and snoops can be exchanged over the interconnect.

Does anyone these days really use swap for anything than S4 suspend ?

https://news.ycombinator.com/item?id=40697318

This HN comment and the linked post brought up a lot of good points. The main takeaway is that swap should primarily be considered a mechanism for equality of reclamation, not for emergency extra memory, where equality of reclamation means file-backed pages and anonymous pages are subject to similar criteria for being evicted from physical memory.

I used to have zero swap on my Linux desktop and this convinced me to add at least a small swap partition.

It's useful on lower RAM systems as the least frequently used memory can be moved to swap, freeing up more RAM for stuff that needs it. Even when using zram it works out pretty well on my laptop with 8GB of RAM, it'll often have 4GB+ in zram swap space compressed down to only 1GB or so of physical RAM usage.

It really depends on what you run and how much RAM you have to do it in. I run some machines into swap just by running a couple browsers and some containers in the background on a 16GB laptop. I've also run a single light browser and essentially nothing else on 4GB and been fine:)

You want to waste VRAM, in this economy?

It's 1GiB. What could it cost, $10?

use your car for an anchor on a big boat!

There's probably somebody in Monaco who does that

I mean, if you aren’t using the car while using the boat and it won’t really damage the car… yes?

RAM disks have always fascinated me. In a different timeline every PC has a 100gb of RAM and 50TB HDDs are the norm.

Back when HDDs were all there was ramdisks were interesting, but SSDs pretty much killed most of that as they have massively increased IOPS over disks.

Hard drives that huge scare me as it would take days to backup all the data off them.

I mean, cool, but I’d rather not?

So don't. Not everything is for you.

Wouldn't it be faster to swap to vram if you are sitting there with 8gigs of it unused than swapping to ssd and burning its write cycles, assuming you absolutely need swap

The HN equivalent of “1 star - I’ve never eaten at this restaurant” type of reviews.

They already do that on windows and it kinda sucks. If you are targeting something like LMStudio or ComfyUI, both of those have superior methods to do exactly this.

There's probably somebody in Monaco who does that

The HN equivalent of “1 star - I’ve never eaten at this restaurant” type of reviews.

Wouldn't it be faster to swap to vram if you are sitting there with 8gigs of it unused than swapping to ssd and burning its write cycles, assuming you absolutely need swap

flash is a consumable, yes.

google says this drive was released in 2014, and SMART says POH is about 10 years.

This was a consideration when I wrote this

yup. it's nbd and userspace making it slow. zram on the other hand adds little.

one can get rid of zram and just reimplement some compression in shaders but I think that would be a pointless optimization.

It's fine for dense models where you need them in VRAM, less so for MoE where you're offloading layers to ram. But 32/32 is pretty good for both in the popular ~30b range right now.

Insightful take, looking into these

Have you heard of the "Radeon Pro SSG" ??

It must have failed because I never heard of an update to this GPU. But AMD definitely made a GPU with 4x NVMe SSDs attached to the GPU.

You are able to use GPU Direct Storage to communicate between the GPU and PCIE storage devices. It's nice, but it's not typically as performant as one would like, in comparison to the onboard memory.

https://docs.nvidia.com/gpudirect-storage/

https://github.com/microsoft/DirectStorage/tree/main

Memory on an expansion card isn't gated on PCIe 5, it's gated on CXL support. CXL and PCIe use the same electrical/physical layer but the protocol is very different.

It’s deterministic. But as the user you don’t know enough to know what was determined.

https://news.ycombinator.com/item?id=40697318

I used to have zero swap on my Linux desktop and this convinced me to add at least a small swap partition.

My point is not to say that swap should not be configured on a Linux system. On bare-metal machines, I personally always set a swap partition equal in size to the amount of RAM because I usually want to be able to put the machine into S4 (suspend to disk).

I don't consider swap to be emergency RAM storage. I know that the kernel will decide by itself to use swap even if it has plenty of available RAM and the swappiness threshold is not reached.

Nevertheless, my two decent laptops (one with 16 GB RAM, the other with 64 GB RAM) never swap, even with Docker Swarm and multiple stacks, multiple VMs, desktop activities, and gaming.

It's been a while since I last saw a physical machine actively swapping.

I understand that some limited hardware may need swap, but I can't see such hardware having a GPU with plenty of VRAM.

That said, hacking things is always fun :)

I just set swappiness to zero years ago and never looked back.

It's 1GiB. What could it cost, $10?

can't stand the thought of people not understanding the above reply[0]

[0]: https://knowyourmeme.com/memes/its-one-banana-michael-what-c...

I'm very surprised, I run a docker swarm with multiple docker stacks on my 16GB RAM laptop which is also my main machine. i have multiple browsers each with multiple tabs. I also run multiple VM (Qemu/KVM) and even by gaming on top of all of that, I can not make it swap.

Edit: Typo

I mean, if you aren’t using the car while using the boat and it won’t really damage the car… yes?

And when you use your car, then what?

So don't. Not everything is for you.

Didn't you hear? The author of this daemon is going around and forcibly installing it on anyone's computer that has soldered memory and an Nvidia GPU. I heard even he brings a Ludovico-technique chair with him and straps you in and pins your eyelids open like A Clockwork Orange so you have to watch.

Back when HDDs were all there was ramdisks were interesting, but SSDs pretty much killed most of that as they have massively increased IOPS over disks.

Hard drives that huge scare me as it would take days to backup all the data off them.

In my fantasy RAM was the predominate technology over flash.

It's easy enough to 'offline' swap space on Linux normally so I suspect that would work fine, as long as you didn't instantly run out of RAM when doing so.

Possibly LSM compaction.

Thank you for pointing me to this

You are able to use GPU Direct Storage to communicate between the GPU and PCIE storage devices. It's nice, but it's not typically as performant as one would like, in comparison to the onboard memory.

https://docs.nvidia.com/gpudirect-storage/

https://github.com/microsoft/DirectStorage/tree/main

It’s deterministic. But as the user you don’t know enough to know what was determined.

Edit: Typo

can't stand the thought of people not understanding the above reply[0]

[0]: https://knowyourmeme.com/memes/its-one-banana-michael-what-c...

And when you use your car, then what?

In my fantasy RAM was the predominate technology over flash.

I've implemented the same idea with OpenCL: https://github.com/theblazehen/vramblk

There is originally https://github.com/Overv/vramfs however that has the overhead of a FUSE filesystem + loop device when using as a swap device.

The performance is rather lackluster however, it's far from a miracle "now you effectively have more ram for a 90% performance drop" - it definitely feels like traditional swapping

If you have enough swap on disk available, it should be fine.

I just set swappiness to zero years ago and never looked back.

That’s like the complete opposite advice. Chris said the lowest recommended swappiness is 1. I have it set to 100.

I don't consider swap to be emergency RAM storage. I know that the kernel will decide by itself to use swap even if it has plenty of available RAM and the swappiness threshold is not reached.

Nevertheless, my two decent laptops (one with 16 GB RAM, the other with 64 GB RAM) never swap, even with Docker Swarm and multiple stacks, multiple VMs, desktop activities, and gaming.

It's been a while since I last saw a physical machine actively swapping.

I understand that some limited hardware may need swap, but I can't see such hardware having a GPU with plenty of VRAM.

That said, hacking things is always fun :)

If you have enough swap on disk available, it should be fine.

sidenote: it is possible to use Vulkan to map GPU memory to CPU space and even map it back to CUDA: https://x.com/steeve/status/2055042304344231978?s=20

It seems obvious to me that this should be a built in functionality of the kernel.

The kernels job is to manage resources - and GPU ram is one such resource, and it can be used for many of the same uses as regular ram.

I have long wished it was possible to do this. What a great bit of code. Thanks.

That’s like the complete opposite advice. Chris said the lowest recommended swappiness is 1. I have it set to 100.

And now, I want HDMI as direct connection hight speed network socket.

HDMI is largely unidirectional.

nbd-vram

Use your NVIDIA GPU's VRAM as swap space on Linux.

Built for hybrid graphics laptops with soldered memory and no upgrade path. The display runs off the integrated AMD/ATI GPU. The NVIDIA card sits idle most of the time, its VRAM completely unused. This puts that VRAM to work as high-priority swap.

Tested on: AMD/ATI + RTX 3070 Laptop (GA104M, 16 GB RAM, 8 GB VRAM), driver 580.159.03, kernel 6.17, Pop!_OS. Allocated 7 GB for swap. End result including zram and SSD swap: ~46 GB total addressable memory, tripled from stock. Overflow order: RAM fills, then VRAM absorbs the spill (PCIe), then zram compresses the rest (CPU), then SSD only if everything else is exhausted.

demo

How it works

A small daemon allocates VRAM via the CUDA driver API, then serves it as a block device using the NBD (Network Block Device) protocol over a Unix socket. The kernel's built-in nbd driver connects to it and exposes /dev/nbdX. From there it's a normal swap device.

Data path: kernel swap subsystem - /dev/nbdX - nbd kernel driver - Unix socket - nbd-vram daemon - cuMemcpyHtoD/DtoH - GPU VRAM.

No kernel module to write or maintain. No NVIDIA kernel symbols. Survives kernel and driver updates without rebuilding anything.

Why not the NVIDIA P2P API?

The "obvious" approach is nvidia_p2p_get_pages_persistent, which pins VRAM pages in BAR1 so the CPU can access them directly via ioremap_wc. Every existing project that tried this route hits the same wall: the NVIDIA driver returns EINVAL on consumer GeForce GPUs. Both the persistent and non-persistent variants, both flag values. It's gated at the RM level for Quadro/datacenter SKUs only, regardless of driver version.

The other approach - directly ioremap_wc the BAR1 physical address without going through the P2P API - also doesn't work. The GPU's internal page tables only have ~16 MiB of BAR1 mapped (just the display framebuffer). Reads from the rest return zeros. mkswap appears to succeed, then swapon fails because the swap header isn't actually there.

The NBD approach sidesteps all of this. cuMemcpyHtoD and cuMemcpyDtoH work on any CUDA GPU without any special permissions.

Requirements

NVIDIA GPU with CUDA support (any consumer RTX/GTX card)
NVIDIA driver with libcuda.so.1 (no CUDA toolkit needed)
Linux kernel 3.0+ (nbd module, built into most distros)
nbd-client package
gcc, make

Install

git clone https://github.com/c0dejedi/nbd-vram
cd nbd-vram
sudo ./install.sh
sudo systemctl start vram-swap-nbd

Verify:

swapon --show
# NAME       TYPE      SIZE USED PRIO
# /dev/nbd0  partition   7G   0B 1500

The service is enabled on install, so it comes up automatically on every boot.

Configuration

Edit /etc/systemd/system/vram-swap-nbd.service:

Environment=VRAM_SETUP_SIZE_MB=7168    # how much VRAM to use
Environment=VRAM_SWAP_PRIORITY=1500   # swap priority (higher = used first)

The daemon tries the requested size first and backs off in 512 MiB steps if the GPU is short on memory - so it will grab as much as it can even if the display compositor is already loaded. VRAM_SETUP_SIZE_MB is the ceiling, not a hard requirement.

After changing, run sudo systemctl daemon-reload && sudo systemctl restart vram-swap-nbd.

Power management

The installer asks whether to enable power-aware management on first install. If enabled, the service automatically stops when you unplug from AC (or when battery drops below a threshold), and restarts when power is restored. Manual systemctl stop is always respected and won't be overridden.

To change settings after install, edit /etc/nbd-vram.conf. Changes take effect on the next poll (within 60 seconds) or immediately on the next AC plug/unplug event.

Smoke test (without installing)

sudo bash test-nbd.sh

Allocates VRAM, connects the NBD device, does a 1 MiB write/readback check, activates swap, then prints teardown instructions. install.sh handles teardown automatically if a test instance is running.

To stress the full partition after the smoke test passes:

sudo bash test-fill.sh

Writes the entire VRAM partition with zeros, verifies a sample read back, then auto-restores swap on exit.

Performance

Tested on RTX 3070 Laptop (8 GB VRAM), kernel 6.17, Pop!_OS. Compared against NVMe cryptswap (dm-crypt, PCIe 4.0). All benchmarks run with O_DIRECT to bypass page cache.

Three benchmarks are in benchmarks/. Each runs NVMe first, then starts the VRAM service and runs the same test against the block device. State is restored on exit.

sudo bash benchmarks/bench-throughput.sh   # sequential read/write (dd, 2 GiB, O_DIRECT)
sudo bash benchmarks/bench-iops.sh         # 4K random IOPS (fio, libaio, iodepth=32)
sudo bash benchmarks/bench-latency.sh      # per-operation latency (ioping, 20 requests)

fio and ioping are installed automatically if missing.

Sequential throughput (dd, 2 GiB)

bench-throughput

Device	Write	Read
NVMe	2.7 GB/s	2.9 GB/s
VRAM (nbd)	1.1 GB/s	2.3 GB/s

VRAM is slower for large sequential transfers. The bottleneck is the NBD + CUDA userspace round-trip - every block crosses a Unix socket and a cuMemcpy call, which adds overhead that NVMe's direct kernel block path doesn't pay. Sequential throughput is not the primary swap workload (the kernel swaps individual 4K pages, not 4 MiB streams) - see the IOPS and latency benchmarks below.

4K random IOPS (fio, libaio, iodepth=32)

bench-iops

Device	Read IOPS	Write IOPS	Avg latency
NVMe	45.4k	45.3k	343 us
VRAM (nbd)	28.7k	28.7k	550 us

NVMe wins for sustained random I/O. At iodepth=32, NVMe can have 32 requests genuinely in flight simultaneously; the NBD+CUDA path serialises them through the daemon, so the depth advantage is reduced. The VRAM daemon also adds CPU overhead that the NVMe path does not pay. For continuous high-throughput swap pressure, NVMe is faster.

The picture changes for sporadic access - see the latency benchmark below.

Per-operation latency (ioping, 4K reads, 1 request/sec)

bench-latency

Device	Min	Avg	Max
NVMe	120 us	9.05 ms	10.1 ms
VRAM (nbd)	134 us	335 us	490 us

VRAM is 27x faster average latency. The NVMe drive is physically capable of ~112 us (visible on the warmup request) but APST (Autonomous Power State Transitions) puts it to sleep between requests. At 1 request per second - the rate of sporadic swap access - it wakes cold almost every time and pays a ~9 ms penalty. VRAM has no power states and responds in 133-490 us consistently.

This is the scenario that matters most in practice. Memory pressure on a laptop is rarely a sustained GB/s flood - it is individual 4K page faults arriving seconds apart. Every one of those faults stalls waiting for the swap device to respond. At 9 ms per fault, NVMe swap is felt. At 335 us, VRAM swap is not.

Uninstall

sudo bash uninstall.sh

License

MIT - Sean Lobjoit (c0dejedi)

HDMI is largely unidirectional.

But it's 48 gbp/s unidirectional.

Hacker Times

Hacker Times

Use your Nvidia GPU's VRAM as swap space on Linux

Discussion

Discussion

nbd-vram

How it works

Why not the NVIDIA P2P API?

Requirements

Install

Configuration

Power management

Smoke test (without installing)

Performance

Sequential throughput (dd, 2 GiB)

4K random IOPS (fio, libaio, iodepth=32)

Per-operation latency (ioping, 4K reads, 1 request/sec)

Uninstall

License