Reduce GVisor Cold Starts with GPU Snapshotting

Reducing GPU Cold Starts with Memory Snapshots: Restoring CUDA Workloads in Seconds

If you run AI models in production, you have a relationship with cold starts whether you want one or not.

A three-minute startup time changes how you scale. You keep GPUs warm that could have been released. You over-provision to avoid making users wait. You stretch cooldown periods because scaling down too quickly creates pain on the next spike. The application starts accumulating complexity around one problem: getting a model ready to serve traffic fast enough.

At Cerebrium, we have been obsessed with the cold start problem since day one. That obsession has pushed us to rethink almost every layer of our infrastructure:

Custom VM images for faster node scale-ups
A custom image runtime for sub-second container image cold starts
A highly available, low-latency orchestrator for routing workloads across regions and clouds
CPU and GPU memory snapshots for restoring fully warmed containers in seconds

As more companies move large custom AI models into production, they hit the same wall. Our customers run large language models, real-time avatars, transcription models, diffusion models, and other GPU-heavy workloads where startup time can vary from a few seconds to more than five minutes.

Most of that time is spent on work that gets a container ready to serve requests: importing libraries, loading model weights, initializing CUDA, compiling kernels, and warming up the runtime. That is the core problem checkpointing solves. Instead of rebuilding the same runtime from scratch each time a new container starts, we snapshot the fully initialized container - including CPU memory, GPU memory, process state, model weights, and compiled kernels - and restore it directly into a new container in a fraction of the time.

For some workloads, this reduces cold start time by more than 80%!

This post explains how we built CPU and GPU memory checkpointing at Cerebrium, how it works inside our highly customised gVisor-based runtime, and what it took to make real CUDA workloads like vLLM restore reliably and quickly.

Where do the minutes actually go?

It's tempting to think of cold starts as simply “pulling the image”: downloading the application image onto the machine that will run the container. But for AI workloads, that is only the first part of getting a model ready to serve traffic and it is no longer the bottleneck. We have solved the container download problem already. The real cost in a CPU or GPU container is everything that happens after the image is on the machine and the application starts initializing.

That initialization path includes importing Python modules, loading PyTorch, assembling model weights, copying them onto the GPU, and running the framework’s warmup path - torch.compile, CUDA graph capture, KV cache initialization, and whatever else the serving stack needs before it can take traffic.

Every one of these stages is deterministic.

Importing PyTorch produces the same loaded modules every time. Building the model and copying weights onto the GPU produces the same bytes in GPU memory every time. torch.compile and CUDA graph capture produce the same kernels every time.

Yet on every scale-up, we pay to recompute a result that is known.

That is what checkpointing changes.

The idea is simple: do the expensive startup work once, freeze the result, and restore it on demand.

Concretely, taking a checkpoint means:

Pause execution: pause all application processes, threads, and, crucially, GPU work.
Dump memory: serialize the in-memory state from both CPU and GPU to files.
Upload: push those files to fast, durable storage.

Restoring runs the same process in reverse. We pull the checkpoint files down, rehydrate CPU and GPU memory, repair the pieces of state that cannot survive a move, and unpause the workload.

The restored application process is the same warmed-up runtime we froze earlier: PyTorch has already been imported, model weights are already resident on the GPU, kernels are already compiled, and the application is ready to serve traffic.

The mental model is straightforward. Making it work reliably for real GPU workloads is not.

High-level architecture

At a high level, checkpointing needs to sit in the one place where it can control the full lifecycle of a container: between the container runtime and the sandbox running the workload.

Cerebrium runs user workloads inside gVisor sandboxes for isolation. To support checkpointing, we extended that runtime path so that when a container starts, we can make a decision before the normal boot sequence completes:

Should this container start from scratch, or should it be restored from a checkpoint?

If no checkpoint exists, the container follows the normal path. The image starts, the application boots, models load, GPU memory is populated, and the workload becomes ready. Once the container is fully warmed, the user can trigger a checkpoint. At that point, we pause the workload, capture its CPU and GPU state, write the checkpoint to disk, and upload it to fast storage.

If a checkpoint does exist, we skip the normal startup path. Instead of launching the container and waiting for Python imports, model loading, GPU transfers, torch.compile, and CUDA graph capture, we restore the saved state directly into the sandbox. The process resumes as if it had just finished warming up.

That sounds simple, but it requires the runtime to answer a few questions at exactly the right time:

Which workload is being started?
Does a compatible checkpoint exist for this image, GPU type, machine type, and runtime version?
Where is the checkpoint stored?
Is the checkpoint already cached locally on the host?
Should we restore, or fall back to a clean boot?

To make this work, we added two components to the node runtime.

The first is a small checkpoint service that runs on every host. It handles the operational side of checkpointing: downloading checkpoints, uploading new ones, caching them locally, evicting old or corrupted checkpoints, and reporting restore status.

The second is a modified gVisor containerd shim. This is the piece that sits in the container startup path. It intercepts container creation, checks whether a checkpoint can be restored, and either continues with the normal boot flow or replaces that flow with a restore.

In other words, the checkpoint service moves and manages the snapshot files. The shim decides whether a new container should boot normally or wake up from a snapshot.

The hardest part was not the API between those two components. It was timing.

Containerd starts a sandbox through a fixed sequence:

Sandbox Create → Sandbox Start → Container Create → Container Start

The natural place to decide whether to restore is when the sandbox starts. But at that point, we do not yet have enough information about the container image to know whether a checkpoint exists. The image information only becomes available later, during container creation.

So we had to reorder the startup sequence slightly.

When containerd asks us to start the sandbox, we defer the real start. We keep containerd satisfied with the expected status responses, but delay the actual sandbox startup until container creation, once we know which image is being launched and whether a matching checkpoint exists.

At that point, we choose one of two paths:

Normal boot: start the sandbox, launch the container, let the application initialize, and optionally checkpoint it once warm.
Checkpoint restore: download or locate the checkpoint, restore CPU and GPU memory into the sandbox, repair runtime state that cannot survive a move, and resume the process.

The work is mostly the same work the runtime would already do. The key change is that we moved the restore decision from sandbox start to container creation, where the image information is finally available and we can determine whether a matching checkpoint exists.

That small reordering is what lets checkpointing feel transparent from the user’s perspective. They start a workload the same way, but once a checkpoint exists, future scale-ups restore the warmed process instead of rebuilding it from scratch.

As we tested and developed the feature, we ran into several edge cases that were not obvious from the available documentation. Where possible, we are working to move those fixes upstream so that the next team adopting this technology does not have to rediscover the same issues.

Some of the issues we uncovered included:

A race condition in the TCP network stack that stopped the network from working when the container received many packets during the checkpointing process.
A race condition that crashes gVisor when running within containerd if a checkpoint takes longer than a few seconds.
Supporting Container Device Interface injection for NVidia GPUs.

Checkpoint distribution: why the storage layer matters more than you'd think

A checkpoint of a warmed-up GPU container is large - one of our test workloads is around 9 GiB however restoring Deepseek V4 FP8 with vLLM would be 640GB. Restoring is only worth it if we can move that much data faster than the container would have cold-started on its own. That makes the storage and network path the single most important design decision in the whole system.

The math is unforgiving:

For our 9GB container size, on a g5.12xlarge, a full vLLM cold start took around 50 seconds. Restoring from a 9 GiB checkpoint reduced startup to 2.25 seconds from S3 and 9 seconds from local NVMe.

We use S3 as the default restore path because it is fast enough and portable across the clouds and regions Cerebrium supports. Local NVMe is fast when the checkpoint is already cached on the node, while object storage remains the durable source of truth.

These results are specific to g5.12xlarge. On nodes with higher network bandwidth or faster local storage, restore times improve further.

The hard part: real workloads are messy

Checkpointing is easiest when the workload’s state is self-contained to memory. Real GPU workloads are rarely that clean.

A snapshot can preserve the warmed-up runtime, but it cannot blindly preserve every external dependency around it. After restore, the application may still hold references to a filesystem path, socket, IP address, device handle, or driver state that was valid before the move but invalid after it. That is where most of the finicky behavior comes from.

Network state is the first obvious example. Open TCP connections are tied to the original runtime environment. After restore, those connections have been terminated, and the container may also have a different external IP. This breaks frameworks that use the container’s external IP for internal heartbeats, worker coordination, or control-plane communication. In vLLM, for example, this meant the process could restore successfully but still fail internally because parts of the runtime were trying to communicate through an address that was no longer valid. The fix was to pin internal framework communication to loopback using VLLM_HOST_IP=127.0.0.1, so that worker coordination no longer depended on the external IP assigned to the container.

Multiprocessing creates another class of problems. Many Python serving frameworks use worker processes, and if those workers are created with fork, they can inherit NVIDIA driver file descriptors from the parent process. That matters because the checkpoint system needs a clean understanding of which processes actually own GPU state. Leaked driver file descriptors can make the runtime believe the GPU is still in use by processes that should not block checkpointing, or cause restore behavior that is difficult to reason about. For vLLM, the fix was to use spawn instead of fork for GPU workers with VLLM_WORKER_MULTIPROC_METHOD=spawn, so child processes start cleanly instead of inheriting GPU driver state from the parent.

Local runtime files are another subtle edge. Frameworks often create Unix sockets, temporary files, lock files, and coordination state on local disk. If that local filesystem is not restored with the checkpoint, the process can wake up expecting files that no longer exist. This is one of the more annoying failure modes because the process may look healthy from the outside while workers silently fail to communicate internally. In vLLM, we solved this by moving restore-critical RPC state to a small preserved path using VLLM_RPC_BASE_PATH=/run/cuda-ckpt.

The timing of the checkpoint also matters. A checkpoint needs a consistent view of CPU and GPU memory. If CUDA work is still running while the snapshot is taken, the checkpoint may be inconsistent or unsafe to restore. In practice, this means checkpointing has to happen after the workload has finished warming up and reached a known idle state. For some frameworks, that requires an explicit readiness step: load the model, run the warmup pass, wait for compilation or CUDA graph capture to finish, and only then trigger the checkpoint.

Another optimization is deciding what should not be checkpointed. vLLM sleep mode is useful here because it can drop transient state like the KV cache before the checkpoint is taken. The KV cache can be large, and preserving it makes the checkpoint bigger, slower to upload, and slower to restore. For many workloads, the cache is not worth carrying across restores because it is request-specific and can be rebuilt naturally once traffic resumes. In those cases, putting vLLM into sleep mode before checkpointing dramatically reduces the snapshot size and improves restore performance. We expose this as a choice rather than forcing one behavior: users can decide whether they want to preserve that state across restores or discard it to make checkpointing faster.

The last constraint is compatibility. A GPU memory checkpoint is not a portable artifact in the same way a container image is. It is tied to the environment it was created in: GPU type, CPU architecture, machine type, driver/runtime compatibility, and gVisor version. A checkpoint created on one hardware and runtime shape cannot safely be restored onto an arbitrary other one. Because of that, we key checkpoints by compatibility, not just by application. The restore path only uses a checkpoint when the target environment matches the original checkpoint environment.

The bigger pattern is that GPU memory checkpointing is not just “dump memory and reload it.” It is about separating state that can be frozen from state that must be recreated, reconnected, or moved into a checkpoint-safe location.

That is also why the feature is opt-in and workload-aware. Different serving stacks depend on different filesystems, sockets, device handles, networking assumptions, and framework internals. Making checkpointing production-ready means validating those assumptions explicitly, rather than treating every GPU workload as if it can be paused, moved, and resumed in exactly the same way.

The results: 71% Reduction in cold starts

We benchmarked Cerebrium against Baseten and Modal across six workloads of varying functionality. For each workload, we ran 100 cold-start requests over a 24-hour period on the same GPU classes across providers (A10, L40s, etc).

This is not a perfect apples-to-apples comparison. Each platform controls its own underlying node shape, and a workload running on a g6e.48xlarge can perform differently from a cold start perspective from the same workload on a g6e.12xlarge. Across platforms, users do not always get control over that exact placement. Still, this reflects the real-world experience customers care about: how quickly and consistently a workload becomes ready when the platform has to scale from cold.

Baseten references snapshotting in some materials, but we were unable to use it as a generally available, self-serve feature during our benchmark. As a result, we compared against their cached cold-start behavior, which was the reproducible path available to us. Caching helps reduce image and model download time, but it does not remove framework initialization or GPU warmup work. For workloads dominated by CUDA graph capture, torch.compile, SGLang startup, or serving-runtime preparation, that post-download work is often the expensive part.

Across the benchmark suite, Cerebrium snapshots reduced cold starts by an average of 71% compared to running the same workloads on Cerebrium without snapshots, with reductions as high as 88% on vLLM. Compared to Baseten’s cached cold-start numbers, Cerebrium snapshots were 85% faster on average, and up to 94% faster on vLLM.

Against Modal snapshots, Cerebrium had a lower p0 restore time on 4 of the 6 workloads, with an average p0 restore time that was ~21% lower across the suite. More importantly, Cerebrium had a lower worst-case restore time on all 6 workloads, with an average max restore time that was ~27% lower. That consistency matters for cold starts: a single slow restore can still create a bad end-user experience. You can see our full benchmark implementations here

Checkpointing is not the right tool for every workload. If your application already starts in a few seconds, caching may be enough. But when cold starts are dominated by deterministic initialization, importing frameworks, loading models, compiling kernels, capturing CUDA graphs, or warming KV caches - checkpointing changes the scaling model. You can scale down more aggressively when traffic drops, restore quickly when demand returns, and avoid keeping GPUs warm just to protect users from cold starts. That means better utilization, lower infrastructure costs, and a better experience for end customers because capacity can come online fast enough to meet demand.

If you want to try checkpointing, check out our docs here and our examples repo.

Want serverless GPUs that start in seconds instead of minutes? Sign up for Cerebrium and deploy your first model in a few lines of code

Hacker Times