I recently went down a rabbit hole trying to understand how Linux handles page faults, what mmap actually does at the physical page level, and how userfaultfd lets userspace take over that fault handling. The motivation was a specific problem, which was making Virtual Machine (VM) snapshot restore fast by lazily populating guest memory. But the underlying mechanisms are general Linux concepts that I think are worth understanding on their own. This post is less about any specific Virtual Machine Monitor (VMM) and more about the Linux memory model that makes lazy restore possible, and where it breaks down.
Linux processes do not interact with physical RAM (Random Access Memory) directly. Every memory address a process uses is a virtual address, and the kernel maintains a set of page tables that translate virtual addresses to physical addresses. The hardware, specifically the Memory Management Unit (MMU), walks these page tables on every memory access to find the corresponding physical page.
The key insight, and the thing that took me a while to fully appreciate, is that virtual memory mappings do not need to have physical pages behind them immediately. When a process calls mmap to allocate a region of memory, the kernel sets up a Virtual Memory Area (VMA) in the process’s address space. This VMA describes the region and includes its start address, length, permissions, and what backs it, whether that is anonymous memory, a file, shared memory, or something else. But the kernel does not allocate physical pages yet. The page table entries for that region simply do not exist.
After an mmap call for an anonymous, read-write, 4KB region, the process gets a virtual address and a VMA describing the mapping. But the page table entry is empty and no physical RAM is allocated yet. The physical commitment happens later, on first access.
The first time the process actually reads or writes to an address in that region, the MMU tries to translate the virtual address and finds no page table entry. This triggers a page fault, which is a CPU exception that transfers control to the kernel’s page fault handler. The kernel looks up the faulting address in the process’s VMA list, sees that it is a valid mapping (not a segfault), allocates a physical page, fills it with zeros for anonymous memory or reads the data from the backing file for file-backed mappings, installs a page table entry pointing the virtual address at the new physical page, and returns to the faulting instruction, which re-executes and succeeds.
flowchart LR
A[Process accesses\nvirtual address] --> B{Page table\nentry?}
B -->|Exists| C[Translate and continue]
B -->|Missing| D[PAGE FAULT\nkernel allocates page\nfills with data\nupdates page tables]
D --> E[Retry the access]
E --> B
This is demand paging. The process thinks it has memory, but the physical allocation is deferred until the page is actually touched. It is completely transparent to the process because the fault, the allocation, and the page table update all happen inside the kernel before the process’s instruction retries. The process never observes the fault.
This is also why a process can mmap a 1TB region and not use any physical RAM until it starts touching pages. The mapping exists in the virtual address space, but the physical commitment happens one page at a time, on demand.
Understanding what mmap does at the physical page level matters for the problem I was trying to solve, because different mmap calls produce fundamentally different relationships between virtual addresses and physical pages.
Anonymous private (MAP_PRIVATE | MAP_ANONYMOUS). The kernel eventually backs this with physical pages from the page allocator, private to the process. If the process forks, the child gets copy-on-write references to the same pages, and each side gets its own copy on first write. This is what malloc uses under the hood for large allocations.
Anonymous shared (MAP_SHARED | MAP_ANONYMOUS). Backed by pages in a kernel tmpfs instance, so multiple processes can map the same region and see each other’s writes immediately. This is commonly used for shared memory between a VMM and device backends like vhost-user.
File-backed private (MAP_PRIVATE with a file fd). Pages are populated from the file on first access, but writes go to private copy-on-write pages that are never written back to the file. The process sees the file contents initially, but any modifications it makes are its own.
File-backed shared (MAP_SHARED with a file fd). Pages come from the file’s page cache, writes are visible to other processes mapping the same file, and changes are eventually flushed back to disk.
The important thing is that the mapping type determines which physical pages back the region and how they behave. This identity, meaning the relationship between the virtual address and the specific physical pages, is what other subsystems can depend on.
A shared anonymous mapping is backed by tmpfs pages that other processes can see writes to, while a private file-backed mapping is backed by page cache pages that become copy-on-write and private to the process after the first write. Same virtual address in both cases, but the physical pages behind them behave very differently.
Here is the thing that tripped me up initially. If you mmap a file over an existing anonymous mapping at the same virtual address using MAP_FIXED, the old mapping is destroyed. The kernel tears down the old VMAs, removes the old page table entries, and creates new ones. The virtual address is the same, but the physical pages behind it are completely different.
If you start with a shared anonymous mapping where the virtual address points to tmpfs-backed physical pages A, B, and C, and then mmap a file over the same address with MAP_FIXED, the virtual address now points to completely different file-backed physical pages X, Y, and Z. Same virtual address, different physical pages. Anything that was still using pages A, B, and C is now looking at stale memory.
A Virtual Machine Monitor (VMM) like Cloud Hypervisor, Firecracker, or QEMU manages guest memory by allocating a large region of host virtual memory via mmap and registering it with KVM (Kernel-based Virtual Machine) as the guest’s physical RAM. This creates a multi-layer translation scheme.
flowchart LR
G["Guest Physical Address\n(what the guest sees)"] -->|"EPT / NPT\nhardware translation"| P["Host Physical Address\n(actual RAM)"]
H["Host Virtual Address\n(VMM's mmap region)"] -.->|"resolved via host page tables\nat EPT setup time"| P
K[KVM memory slots] -.->|wired to| P
V[VFIO DMA pinning] -.->|wired to| P
VH[vhost-user backends] -.->|wired to| P
The guest kernel sees Guest Physical Addresses starting at 0. KVM uses a second layer of hardware page tables to translate Guest Physical Addresses directly to Host Physical Addresses. Intel calls them Extended Page Tables (EPT) while AMD calls them Nested Page Tables (NPT), but both do the same thing. At runtime this translation happens entirely in hardware, which is what makes virtualization fast. The Host Virtual Address from the VMM’s mmap region is only involved at setup time, when KVM resolves it through the host page tables to find the Host Physical Address and populate the EPT/NPT entry.
The critical point is that multiple subsystems end up wired to the same host physical pages. KVM memory slots map guest physical ranges to host virtual addresses, and through the host page tables, to specific host physical pages. VFIO (Virtual Function I/O) device passthrough pins those physical pages for DMA (Direct Memory Access) so that hardware devices can read and write guest memory directly. Vhost-user device backends share the mapping so they can access guest memory from a separate process. All of these depend on the physical page identity of the guest RAM mapping.
When a VMM snapshots a running VM, it writes the entire guest memory contents to a file along with the CPU and device state. To restore, a new VMM process allocates fresh guest memory via mmap, reads the entire snapshot file into that region, reconstructs the device and CPU state, and resumes the virtual CPUs (vCPUs).
flowchart LR
subgraph SNAP["Snapshot on disk"]
direction TB
MR[memory-ranges]
CF[config.json]
SF[state.json]
end
SNAP -->|"read ALL pages +\nrebuild VM state"| FULL[Guest RAM\nfully populated]
FULL -->|"resumes only after\nall pages copied"| RUN[Guest starts running]
The guest resumes at the exact instruction where it was paused. Running processes, open files, network connections, kernel state, everything is exactly as it was. The guest has no idea anything happened.
The problem is the memory read. For a 4GB guest, the VM waits for 4GB of sequential disk I/O before it can run. This scales linearly with guest memory size and is the dominant cost in restore latency.
The natural first thought is to skip the read entirely. Just mmap(MAP_PRIVATE) the snapshot file over the guest memory region with MAP_FIXED. The kernel would page in data lazily as the guest touches it. No upfront I/O, near-instant restore, and the kernel handles everything.
The problem is what I described earlier about physical page identity. An mmap overlay replaces the existing mapping. The old physical pages are gone, and new physical pages backed by the snapshot file’s page cache take their place at the same virtual address.
flowchart TD
subgraph before["Before: mmap overlay replaces the mapping"]
direction LR
VA["Virtual address\n0x7f00..."] -->|"points to"| OLD["Physical pages A, B, C"]
KVM1[KVM] -.->|wired to| OLD
VFIO1[VFIO DMA] -.->|pinned to| OLD
VH1[vhost-user] -.->|sharing| OLD
end
subgraph after["After: mapping replaced"]
direction LR
VA2["Virtual address\n0x7f00..."] -->|"now points to"| NEW["Physical pages X, Y, Z"]
KVM2[KVM] -.->|"rebuilds via\nMMU notifiers"| NEW
VFIO2[VFIO DMA] -.->|still pinned to| OLD2["Old pages A, B, C\n(stale)"]
VH2[vhost-user] -.->|still sharing| OLD2
end
To understand why this breaks, it helps to look at what each subsystem actually holds onto.
When KVM sets up a memory slot via KVM_SET_USER_MEMORY_REGION, it takes the host virtual address range and, on first guest access, builds EPT/NPT entries that translate Guest Physical Addresses directly to the Host Physical Addresses behind those virtual addresses. KVM does register MMU notifiers with the host kernel, so when an mmap overlay tears down the old mapping, the kernel notifies KVM and the stale EPT/NPT entries get invalidated. On the next guest access, KVM would rebuild them pointing to the new physical pages. So KVM itself can eventually resync. But “eventually” is doing a lot of work in that sentence, and the other subsystems are not so forgiving.
VFIO makes it worse. When a physical device is passed through to a guest via VFIO, the kernel pins the physical pages behind the guest RAM mapping and programs the device’s IOMMU (Input/Output Memory Management Unit) to allow DMA to those specific physical addresses. Pinning means the kernel promises not to move or reclaim those pages, because hardware is going to write directly to them. If you mmap over the guest RAM region, the new mapping gets new physical pages, but the IOMMU is still programmed with the old physical addresses. The device keeps doing DMA to the old pages, which are no longer the guest’s memory. At best you get silent data corruption, at worst the device writes to pages that have been reclaimed and assigned to something else entirely.
Vhost-user has a similar problem. Vhost-user device backends run in a separate process and share the guest RAM mapping via file-descriptor passing. The backend process mmaps the same backing object and gets access to the same physical pages. If the VMM replaces its mapping, the backend process is still looking at the original pages. The guest and its device backend are now operating on different physical memory, and neither knows it.
KVM can resync through MMU notifiers, but the IOMMU and the vhost-user backend cannot. The IOMMU is still programmed with the old host physical addresses. The backend process still has the old shared pages mapped. The device is DMAing into pages that no longer back guest RAM, and the backend is reading and writing pages that the guest will never see. The mapping identity is load-bearing, and replacing it silently breaks every consumer that was wired to the original physical pages.
Once I understood this, userfaultfd started making a lot more sense as the right tool for the job.
Userfaultfd is a Linux mechanism, available since kernel 4.3, with additional event features like non-cooperative mode and fork/remap/remove tracking added in 4.11, that lets a userspace thread intercept and handle page faults. Instead of the kernel resolving a missing page on its own by allocating a zero page or reading from a file, the kernel delivers a fault event to a file descriptor, and a userspace handler resolves it by providing data for the faulting page.
The mechanism works through a simple protocol.
First you create a userfaultfd file descriptor via the userfaultfd(2) syscall. Then you negotiate features with the kernel via the UFFDIO_API ioctl, which is where you tell the kernel what kinds of faults you want to handle, such as missing pages on anonymous memory, shared memory, or hugepages. After that you register memory ranges via UFFDIO_REGISTER. Once a range is registered, any access to an unpopulated page in that range will generate a fault event instead of the kernel’s normal zero-page allocation. The handler then waits for fault events by reading from or polling the uffd file descriptor, where each event is a 32-byte message containing the faulting address. Finally, it resolves faults via UFFDIO_COPY to provide page data or UFFDIO_ZEROPAGE to zero-fill. The kernel installs the page and wakes the faulting thread.
sequenceDiagram
participant A as Process Thread
participant K as Host Kernel
participant H as Fault Handler Thread
A->>K: Access address in registered range
Note over K: Page fault, no physical page here
Note over K: Valid mapping, uffd-registered
K-->>A: Suspend faulting thread
K->>H: Deliver fault event (faulting address)
Note over H: Read fault event
Note over H: Look up backing data for this page
Note over H: Prepare page contents
H->>K: UFFDIO_COPY (page data + target address)
Note over K: Allocate physical page<br/>within the existing mapping
Note over K: Copy data in, update page tables
K-->>A: Wake faulting thread
Note over A: Retry access, succeeds this time
Note over A: Continues execution
The key property, and the thing that makes userfaultfd different from mmap, is that the mapping is not replaced. When the handler resolves a fault with UFFDIO_COPY, the kernel allocates a physical page within the original mapping and copies data into it. The mapping type does not change. The VMA is the same. After resolution, the page is indistinguishable from one that was always there.
An mmap overlay gives you new physical pages at the same virtual address by destroying the old mapping. Userfaultfd populates the existing mapping with physical pages on demand, so the mapping identity is preserved.
With this understanding of userfaultfd, applying it to the snapshot restore problem becomes straightforward. Instead of reading the entire snapshot into guest memory before the VM can start, the VMM registers the guest RAM region with userfaultfd and starts the VM immediately. When a vCPU touches an unpopulated page, the fault travels through multiple layers.
The guest executes a memory access and the guest kernel’s page tables translate the guest virtual address to a Guest Physical Address. The CPU then walks the EPT/NPT to translate the Guest Physical Address to a Host Physical Address, but the page has no physical backing because we skipped the eager copy. So the CPU triggers a VM exit. KVM handles the VM exit and converts it to a host page fault on the corresponding address in the VMM’s mmap region. The host kernel sees the fault, checks the userfaultfd registration, suspends the vCPU thread, and writes a fault message to the uffd fd. The handler thread wakes up, computes the offset into the snapshot file, reads the page data, and calls UFFDIO_COPY. The kernel populates the page within the existing mapping and wakes the vCPU thread. The vCPU re-enters guest mode, the EPT/NPT walk succeeds this time, and the guest instruction completes.
sequenceDiagram
participant G as vCPU Thread
participant K as Host Kernel
participant H as Fault Handler Thread
participant S as Snapshot File
Note over G: Guest accesses an<br/>unpopulated page
G->>K: Hardware page table walk fails (no backing page)
Note over K: VM exit, back to host mode
Note over K: Converted to host page fault
Note over K: Address is uffd-registered
K-->>G: Suspend vCPU thread
K->>H: Deliver fault event (faulting address)
Note over H: Compute offset into snapshot file
H->>S: Read one page of data
S-->>H: Page contents
H->>K: UFFDIO_COPY (write page into existing mapping)
Note over K: Allocate physical page in original mapping
Note over K: Copy data in, update page tables
K-->>G: Wake vCPU thread
Note over G: Re-enter guest mode
Note over G: Page table walk succeeds
Note over G: Guest instruction completes
After the fault is resolved, the page is permanently populated. Future accesses go through the EPT/NPT at full hardware speed with no faults, no VM exits, and no handler involvement. The page is just normal memory. The guest never participates in any of this. It does not know pages are missing, it does not know they are being loaded lazily. The hardware plus the uffd handler conspire to make the memory appear to have always been there.
Because the mapping is preserved, KVM memory slots, VFIO DMA, and vhost-user shared memory all continue to work as if the pages had always been present. This is what mmap overlay could not provide.
A VM with multiple vCPUs can generate concurrent faults on the same page. Two vCPUs touch the same missing page at the same time, and the kernel queues separate fault events for each. The handler processes the first one with UFFDIO_COPY. When it processes the second event for the same page, UFFDIO_COPY returns EEXIST because the page is already populated. This is a benign race, so the handler calls UFFDIO_WAKE to unblock any remaining waiters and moves on. No data corruption, no double-copy.
This pattern avoids the need to serialize all fault handling through a lock, which would add unnecessary latency. The kernel does the right thing when a page is resolved by another fault while the handler is still processing.
One detail I found interesting is that userfaultfd requires explicit feature negotiation based on the type of memory mapping you want to intercept. By default, userfaultfd only handles faults on anonymous private mappings. If you want to intercept faults on shared memory, which VMMs use for vhost-user, or hugepage-backed memory, which VMMs use for performance, you need to request specific kernel features.
UFFD_FEATURE_MISSING_SHMEM is needed for MAP_SHARED anonymous memory that is backed by tmpfs/shmem, while UFFD_FEATURE_MISSING_HUGETLBFS is needed for hugepage-backed mappings.
The kernel advertises which features it supports during the UFFDIO_API handshake. If a VMM’s memory zones use shared memory or hugepages and the kernel does not support the corresponding feature, the setup fails. This is a deliberate design in the Linux API because the kernel makes you opt in to each mapping type so there is no ambiguity about what will be intercepted.
After registering a range with UFFDIO_REGISTER, the kernel reports which ioctls are available for that range, typically COPY and WAKE. If a range does not support the ioctls needed for fault resolution, that is another early failure point.
Firecracker is the most prominent VMM that ships userfaultfd-based lazy restore today. Its design is worth looking at because it reflects a deliberate architectural choice about where the fault handling logic should live.
Rather than handling faults internally, Firecracker passes the userfaultfd file descriptor to an external process over a Unix socket. The external handler, which the orchestrator provides, resolves faults however it wants. Typically it mmaps the snapshot file and provides a pointer into that mapping as the source for UFFDIO_COPY, so each fault is resolved with a single ioctl that copies directly from the mmapped snapshot into the faulting page, with no file I/O syscalls in the hot path.
sequenceDiagram
participant F as Firecracker VMM
participant K as Host Kernel
participant E as External Handler
Note over F,E: Setup (one-time)
F->>E: Pass uffd fd over Unix socket
Note over F,E: Per-fault (runtime)
Note over F: vCPU hits missing page
F->>K: Hardware page table walk fails
Note over K: VM exit, host page fault
K-->>F: Suspend vCPU thread
K->>E: Deliver fault event via uffd fd
Note over E: Look up page in mmapped snapshot
E->>K: UFFDIO_COPY (one ioctl per fault)
Note over K: Populate page in original mapping
K-->>F: Wake vCPU thread
Note over F: vCPU resumes
This gives orchestrators full control over page serving. They could serve pages from a network store, decompress on the fly, prioritize certain memory regions, or implement their own prefetching strategy. The tradeoff is deployment complexity because you need an external handler component and a Unix socket protocol.
An alternative design is to keep the handler inside the VMM process as a thread. Each fault would do a seek + read from the snapshot file plus the ioctl, which is three syscalls per page instead of one. The tradeoff is simplicity because there is no external component, no socket protocol, and no deployment change. The per-fault throughput is lower, but for many workloads the restore-time latency improvement is the same because the VM returns near-instantly either way and the cost is spread over time as pages are touched.
One thing worth thinking about is what happens in the first moments after the VM resumes. The guest kernel was paused mid-execution. When it wakes up, it does not gently touch one page at a time. It resumes all vCPUs simultaneously, and those vCPUs immediately start executing whatever they were doing before the snapshot. The scheduler runs, timers fire, interrupted syscalls retry, and every running process picks up where it left off. All of this generates a burst of memory accesses across many pages at once.
This initial period is the worst case for on-demand paging. Every page the guest touches for the first time triggers the full fault path, which includes a VM exit, a host page fault, the uffd handler doing a file read, and the ioctl to resolve it. With multiple vCPUs faulting concurrently through a single handler thread, you get serialization. The vCPUs queue up behind the handler, and each one is blocked until its page is served.
In practice, the intensity of this storm depends heavily on the workload. A mostly-idle guest with a few sleeping processes might only touch a handful of pages in the first 100 milliseconds. A guest running a busy web server with active connections and timers might hit hundreds of pages across all vCPUs within the first few milliseconds. The second case is where the handler becomes a bottleneck and the guest experiences noticeable latency on those early memory accesses.
This is solvable in a few ways. A background prefetch thread can start reading pages ahead of the fault path, populating commonly-accessed regions like the kernel text, stack pages, and scheduler data structures before the vCPUs get to them. The handler itself can be made multi-threaded so multiple faults are served in parallel. Or the orchestrator can implement a warm-up strategy, where the VM is restored and given a short settling period before being added to the load balancer. None of these are in the current implementation, but they are natural next steps.
Nothing is free, and I think it is worth being explicit about the costs.
Per-page fault overhead. Each fault involves a VM exit, a host page fault, a context switch to the handler thread, syscalls for data retrieval, and a context switch back. This is significantly more expensive than a normal memory access. For workloads that immediately sweep through all their memory after restore, like a full garbage collection or a memset or a checkpoint verification pass, the cumulative fault overhead could make on-demand slower than just doing the eager copy upfront.
Sequential read vs random read. Eager copy reads the snapshot file sequentially, which is optimal for disk I/O. The userfaultfd handler seeks to random offsets for each fault, which is worse on HDDs and cold page caches. On SSDs (Solid State Drives) or when the snapshot is warm in the page cache, this matters less.
Handler throughput. A single handler thread processing faults sequentially becomes a bottleneck under heavy concurrent fault pressure from many vCPUs. This is a known limitation. Options include multi-threaded handlers or the external handler model.
Kernel feature availability. Userfaultfd with UFFD_FEATURE_MISSING_SHMEM and UFFD_FEATURE_MISSING_HUGETLBFS requires relatively recent kernels. This is not universally available.
Complexity for the orchestrator. With eager copy, restore is atomic in the sense that the VM either starts with all its memory or it does not. With on-demand, the VM starts immediately but page faults can fail because of a disk error or a corrupted snapshot. The failure mode shifts from “restore fails” to “guest crashes mid-execution on a bad page.” The orchestrator needs to handle this differently.
Snapshot restore latency, meaning the time from “start restoring” to “VM is running,” is the number that on-demand paging makes dramatically better. But for platforms that manage many VMs, restore latency is only one dimension. The other is what happens when you are restoring dozens or hundreds of VMs concurrently from large snapshot images, possibly the same image.
Consider a serverless or function-as-a-service platform that keeps a pool of pre-snapshotted VMs and restores clones on demand as requests arrive. During a traffic spike, the platform might need to restore 50 VMs in a short window. With eager copy, each restore reads the full snapshot file sequentially. 50 concurrent restores of a 4GB image means 200GB of sequential reads hitting the storage layer at once. The I/O bandwidth becomes the bottleneck, and restore latency degrades for everyone.
On-demand paging changes the I/O pattern but does not eliminate the bandwidth problem. Instead of 200GB of upfront reads, you get 200GB of random reads spread over time as the guests touch pages. If all 50 guests hit their hot pages in the first few seconds, the storage backend still sees a burst, just distributed as random 4KB reads instead of sequential streams. On SSDs this is fine since random read throughput is high. On network-attached storage or shared storage backends with limited IOPS (Input/Output Operations Per Second), it can become a different kind of bottleneck.
The shape of the problem also changes with shared base images. If many VMs are restored from the same snapshot, the host page cache becomes an asset. The first VM faults in a page and the handler reads it from disk. Subsequent VMs faulting the same page offset find the data already in the page cache, making the read effectively free. This is one of the natural advantages of the seek-and-read approach over mmap, because the kernel’s page cache does the deduplication transparently. In a high-concurrency restore scenario with shared base images, the effective I/O cost per VM drops significantly after the first few restores warm the cache.
There is also the question of what happens when the handler cannot keep up. If the storage backend is slow or the handler thread is saturated, vCPU threads pile up waiting for pages. The guest experiences this as inexplicable latency on memory accesses. Unlike eager copy where all the pain is upfront and bounded, on-demand paging spreads the cost over time and the worst case is harder to predict. For latency-sensitive workloads, this unpredictability can be worse than a known, bounded restore delay.
The right answer probably depends on the workload mix. For platforms where restore latency is the dominant metric and guests have sparse memory access patterns, on-demand paging is a clear win. For platforms that need predictable per-request latency and can tolerate a longer restore window, eager copy with a warm pool might be simpler to reason about. And for platforms that do both, a hybrid approach where the handler prefetches likely-hot pages while serving faults on demand could offer the best of both.
Userfaultfd was not designed solely for VM snapshot restore. AFAIK, post-copy live migration was one of the original motivations, where a VM starts on the destination host before all its memory has been transferred and missing pages are faulted in from the source over the network (QEMU has supported this since 2.6). CRIU uses it for lazy process restore, serving checkpoint pages on demand instead of reading them all upfront. And distributed shared memory systems use it to fetch pages from remote nodes on access, presenting a flat address space backed by network storage.
In every case, the pattern is the same where you have memory that logically has data in it, but you defer populating the physical pages until something actually touches them.