Nano-vLLM: How a vLLM-style inference engine works

The concept of a thing which is "Nano"-"Large" just seems counter-intuitive to me. Would they not cancel out?

Since HN only allows one link per submission, dropping Part 2 here.

https://www.neutree.ai/blog/nano-vllm-part-2

The most interesting thread in these comments is the one about AI-assisted writing, and it's a shame it's drowning out the actual content. A non-native speaker writes a clear technical explainer, uses an LLM to fix grammar, and half the discussion becomes forensic analysis of em dash frequency. Meanwhile the person who confidently called it "AI written, generated from the codebase" was just wrong. We're developing an autoimmune disorder where the immune response to slop is now attacking healthy tissue.

The whole thing feels AI written, generated from the codebase.*

*this is incorrect per the author’s response, my apologies.

For instance, it goes into (nano)vLLM internals and doesn’t mention PagedAttention once (one of the core ideas that vLLM is based on)[1].

Also mentions that Part 2 will cover dense vs MoE’s, which is weird because nanovllm hardcodes a dense Qwen3 into the source.

Here are better (imo) explainers about how vLLM works:

- https://hamzaelshafie.bearblog.dev/paged-attention-from-firs...

- https://www.aleksagordic.com/blog/vllm

- https://huggingface.co/blog/continuous_batching

Aleksa’s blog is a bit in the weeds for my taste but it’s really worth working through.

A lot of the magic of vLLM happens in the PagedAttention kernels, which are really succinctly implanted in nanovllm. And the codebase is great and readable by itself!

—

1. https://arxiv.org/abs/2309.06180

Shameless plug for my structured LLM outputs handbook which is written in a similar spirit: https://nanonets.com/cookbooks/structured-llm-outputs/

Great job! This is the kind of project that should exist for every complex system Systems like vLLM's codebase are massive and hard to follow Would love to see the same approach for other infra (a nano-Kubernetes, nano Postgres.....

The whole thing feels AI written, generated from the codebase.*

*this is incorrect per the author’s response, my apologies.

For instance, it goes into (nano)vLLM internals and doesn’t mention PagedAttention once (one of the core ideas that vLLM is based on)[1].

Also mentions that Part 2 will cover dense vs MoE’s, which is weird because nanovllm hardcodes a dense Qwen3 into the source.

Here are better (imo) explainers about how vLLM works:

- https://hamzaelshafie.bearblog.dev/paged-attention-from-firs...

- https://www.aleksagordic.com/blog/vllm

- https://huggingface.co/blog/continuous_batching

Aleksa’s blog is a bit in the weeds for my taste but it’s really worth working through.

A lot of the magic of vLLM happens in the PagedAttention kernels, which are really succinctly implanted in nanovllm. And the codebase is great and readable by itself!

—

1. https://arxiv.org/abs/2309.06180

Hi jbarrow, thanks for your feedback and the links you shared—they're great readings for me (and likely others too).

That said, I need to clarify: the content was not written by AI, and certainly not generated from a database in one shot. If there's some agent + prompt that can produce what I wrote, I'd love to learn it—it would've saved me two weekends :)

Before addressing your questions further, some context: I'm a developer with no ML background but plenty of Cloud Infra experience. I'm currently building an open-source AI Infra project, which is why I studied nano-vllm. So my writing reflects some gaps in ML knowledge.

To your specific points:

> it goes into (nano)vLLM internals and doesn't mention PagedAttention once

I didn't find any explicit "paged attention" naming in nano-vllm. After reading the first article you linked—specifically the "Paged KV Caching" section—I believe the block management logic and CPU/GPU block mapping it describes is exactly what I covered in both posts. It may not be the full picture of paged attention, but I interpreted what I saw in the code and captured the core idea. I think that's a reasonable outcome.

> Part 2 will cover dense vs MoE's, which is weird because nanovllm hardcodes a dense Qwen3 into the source

This reflects my learning approach and background. Same as point 1—I may not have realized the block design was the famous PagedAttention implementation, so I didn't name it as such. For point 2, seeing a dense Qwen3 naturally made me wonder how it differs from the xx-B-A-yy-B MoE models I'd seen on Hugging Face—specifically what changes in the decoder layers. That curiosity led me to learn about MoE and write it up for others with the same questions.

---

I completely understand that in this era, people care more about whether what they're reading is AI-generated—no one wants to waste time on low-effort slop with no human involvement.

But as I explained above—and as my hand-drawn Excalidraw diagrams show (I haven't seen an LLM produce diagrams with logic that satisfies me)—this is the result of learning shaped by my own knowledge background and preferences.

Not really in the PagedAttention kernels. Paged attention was integrated into FlashAttention so that FlashAttention kernels can be used both for prefill and decoding with paged KV. The only paged attention specific kernels are for copying KV blocks (device to device, device to host and host to device). At least for FA2 and FA3, vLLM maintained a fork of FA with paged attention patches.

Hi jbarrow, thanks for your feedback and the links you shared—they're great readings for me (and likely others too).

To your specific points:

> it goes into (nano)vLLM internals and doesn't mention PagedAttention once

> Part 2 will cover dense vs MoE's, which is weird because nanovllm hardcodes a dense Qwen3 into the source

---

I completely understand that in this era, people care more about whether what they're reading is AI-generated—no one wants to waste time on low-effort slop with no human involvement.

Funny, this reads even more AI written than the article itself.

It does, but what does that say about the state of communication in our industry? I've seen a lot of writing that reads like an AI produced it in contexts where I could be pretty sure no AI was involved. We want to sound professional, so we sanitize how we write so much that it becomes... whatever this current situation is.

No offense intended to @yz-yu, by the way. I miss the times when more people wrote in an eccentric style -- like Steve Yegge -- but that doesn't detract from what you wrote.

The em dashes really aren't helping their case.

Cool, humans hallucinate too. — AI

Yup, this is SUPER AI generated.

Next time OP that you want to lie about using LLMs, either use the techniques from our paper: https://arxiv.org/abs/2510.15061

Or if you're more lazy, there's a "logit_bias" technique which you can use to ban the em-dash in language models.

But you were too lazy to do that, and you lied about not using AI. Shame on you big time.

Also, even if you somehow didn't AI generate this, you sure as shit got infected by the LLM mind-virus and now you write/talk like it. That's basically just as bad. Either square up with proof that you overused EM dashes before late 2022 (like Dang!) or fix your writing style.

https://arxiv.org/abs/2409.01754

https://arxiv.org/abs/2508.01491

https://aclanthology.org/2025.acl-short.47/

https://arxiv.org/abs/2506.06166

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

https://osf.io/preprints/psyarxiv/wzveh_v1

https://arxiv.org/abs/2506.08872

https://aclanthology.org/2025.findings-acl.987/

https://aclanthology.org/2025.coling-main.426/

https://aclanthology.org/2025.iwsds-1.37/

https://www.medrxiv.org/content/10.1101/2024.05.14.24307373v...

https://journals.sagepub.com/doi/full/10.1177/21522715251379...

https://arxiv.org/abs/2506.21817

No offense intended to @yz-yu, by the way. I miss the times when more people wrote in an eccentric style -- like Steve Yegge -- but that doesn't detract from what you wrote.

Yup, this is SUPER AI generated.

Next time OP that you want to lie about using LLMs, either use the techniques from our paper: https://arxiv.org/abs/2510.15061

Or if you're more lazy, there's a "logit_bias" technique which you can use to ban the em-dash in language models.

But you were too lazy to do that, and you lied about not using AI. Shame on you big time.

https://arxiv.org/abs/2409.01754

https://arxiv.org/abs/2508.01491

https://aclanthology.org/2025.acl-short.47/

https://arxiv.org/abs/2506.06166

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

https://osf.io/preprints/psyarxiv/wzveh_v1

https://arxiv.org/abs/2506.08872

https://aclanthology.org/2025.findings-acl.987/

https://aclanthology.org/2025.coling-main.426/

https://aclanthology.org/2025.iwsds-1.37/

https://www.medrxiv.org/content/10.1101/2024.05.14.24307373v...

https://journals.sagepub.com/doi/full/10.1177/21522715251379...

https://arxiv.org/abs/2506.21817

Cool, humans hallucinate too. — AI

The em dashes really aren't helping their case.

Wait—do people here really think the em dash was nonexistent before LLMs? It’s widely used by people like me who care about writing style. The reason LLMs use it is because they reflect care and concern about writing style.

Yeah, people do seem to think that em dashes are an indicator of GenAI. I have been accused of using AI to write my posts on a forum, precisely because of em dashes. That's how I found out about that particular sniff test people use.

Hasn't made me change the way I write, though. Especially because I never actually type an em dash character myself. Back when I started using computers, we only had ASCII, so I got used to writing with double dashes. Nowadays, a lot of software is smart enough to convert a double dash into an em dash. Discourse does that and that's how I ended up being accused of being an AI bot.

Nobody ever said that they were nonexistent before LLMs. When you are investigating and trying to determine if something is AI generated they are the number one indicator.

So if you're being accused of just spewing AI, then double down and spew what looks EVEN MORE like AI. What are you even doing?

Nobody ever said that they were nonexistent before LLMs. When you are investigating and trying to determine if something is AI generated they are the number one indicator.

So if you're being accused of just spewing AI, then double down and spew what looks EVEN MORE like AI. What are you even doing?

Shouldn't a double dash result in an en dash and only a triple in an em dash?

The concept of a thing which is "Nano"-"Large" just seems counter-intuitive to me. Would they not cancel out?

Since HN only allows one link per submission, dropping Part 2 here.

https://www.neutree.ai/blog/nano-vllm-part-2

Shameless plug for my structured LLM outputs handbook which is written in a similar spirit: https://nanonets.com/cookbooks/structured-llm-outputs/

Architecture, Scheduling, and the Path from Prompt to Token

When deploying large language models in production, the inference engine becomes a critical piece of infrastructure. Every LLM API you use — OpenAI, Claude, DeepSeek — is sitting on top of an inference engine like this. While most developers interact with LLMs through high-level APIs, understanding what happens beneath the surface—how prompts are processed, how requests are batched, and how GPU resources are managed—can significantly impact system design decisions.

This two-part series explores these internals through Nano-vLLM, a minimal (~1,200 lines of Python) yet production-grade implementation that distills the core ideas behind vLLM, one of the most widely adopted open-source inference engines.

Nano-vLLM was created by a contributor to DeepSeek, whose name appears on the technical reports of models like DeepSeek-V3 and R1. Despite its minimal codebase, it implements the essential features that make vLLM production-ready: prefix caching, tensor parallelism, CUDA graph compilation, and torch compilation optimizations. Benchmarks show it achieving throughput comparable to—or even slightly exceeding—the full vLLM implementation. This makes it an ideal lens for understanding inference engine design without getting lost in the complexity of supporting dozens of model architectures and hardware backends.

In Part 1, we focus on the engineering architecture: how the system is organized, how requests flow through the pipeline, and how scheduling decisions are made. We will treat the actual model computation as a black box for now—Part 2 will open that box to explore attention mechanisms, KV cache internals, and tensor parallelism at the computation level.

The Main Flow: From Prompt to Output

The entry point to Nano-vLLM is straightforward: an LLM class with a generate method. You pass in an array of prompts and sampling parameters, and get back the generated text. But behind this simple interface lies a carefully designed pipeline that transforms text into tokens, schedules computation efficiently, and manages GPU resources.

From Prompts to Sequences

When generate is called, each prompt string goes through a tokenizer—a model-specific component that splits natural language into tokens, the fundamental units that LLMs process. Different model families (Qwen, LLaMA, DeepSeek) use different tokenizers, which is why a prompt of the same length may produce different token counts across models. The tokenizer converts each prompt into a sequence: an internal data structure representing a variable-length array of token IDs. This sequence becomes the core unit of work flowing through the rest of the system.

The Producer-Consumer Pattern

Here’s where the architecture gets interesting. Rather than processing each sequence immediately, the system adopts a producer-consumer pattern with the Scheduler at its center. The add_request method acts as the producer: it converts prompts to sequences and places them into the Scheduler’s queue. Meanwhile, a separate step loop acts as the consumer, pulling batches of sequences from the Scheduler for processing. This decoupling is key—it allows the system to accumulate multiple sequences and process them together, which is where the performance gains come from.

Batching and the Throughput-Latency Trade-off

Why does batching matter? GPU computation has significant fixed overhead—initializing CUDA kernels, transferring data between CPU and GPU memory, and synchronizing results. If you process one sequence at a time, you pay this overhead for every single request. By batching multiple sequences together, you amortize this overhead across many requests, dramatically improving overall throughput.

However, batching comes with a trade-off. When three prompts are batched together, each must wait for the others to complete before any results are returned. The total time for the batch is determined by the slowest sequence. This means: larger batches yield higher throughput but potentially higher latency for individual requests; smaller batches yield lower latency but reduced throughput. This is a fundamental tension in inference engine design, and the batch size parameters you configure directly control this trade-off.

Prefill vs. Decode: Two Phases of Generation

Before diving into the Scheduler, we need to understand a crucial distinction. LLM inference happens in two phases:

Prefill: Processing the input prompt. All input tokens are processed together to build up the model’s internal state. During this phase, the user sees nothing.
Decode: Generating output tokens. The model produces one token at a time, each depending on all previous tokens. This is when you see text streaming out.

For a single sequence, there is exactly one prefill phase followed by many decode steps. The Scheduler needs to distinguish between these phases because they have very different computational characteristics—prefill processes many tokens at once, while decode processes just one token per step.

Inside the Scheduler

The Scheduler is responsible for deciding which sequences to process and in what order. It maintains two queues:

Waiting and Running Queues

Waiting Queue: Sequences that have been submitted but not yet started. New sequences from add_request always enter here first.
Running Queue: Sequences that are actively being processed—either in prefill or decode phase.

When a sequence enters the Waiting queue, the Scheduler checks with another component called the Block Manager to allocate resources for it. Once allocated, the sequence moves to the Running queue. The Scheduler then selects sequences from the Running queue for the next computation step, grouping them into a batch along with an action indicator (prefill or decode).

Handling Resource Exhaustion

What happens when GPU memory fills up? The KV cache (which stores intermediate computation results) has limited capacity. If a sequence in the Running queue cannot continue because there’s no room to store its next token’s cache, the Scheduler preempts it—moving it back to the front of the Waiting queue. This ensures the sequence will resume as soon as resources free up, while allowing other sequences to make progress.

When a sequence completes (reaches an end-of-sequence token or maximum length), the Scheduler removes it from the Running queue and deallocates its resources, freeing space for waiting sequences.

The Block Manager: KV Cache Control Plane

The Block Manager is where vLLM’s memory management innovation lives. To understand it, we first need to introduce a new resource unit: the block.

From Sequences to Blocks

A sequence is a variable-length array of tokens—it can be 10 tokens or 10,000. But variable-length allocations are inefficient for GPU memory management. The Block Manager solves this by dividing sequences into fixed-size blocks (default: 256 tokens each).

A 700-token sequence would occupy three blocks: two full blocks (256 tokens each) and one partial block (188 tokens, with 68 slots unused). Importantly, tokens from different sequences never share a block—but a long sequence will span multiple blocks.

Prefix Caching via Hashing

Here’s where it gets clever. Each block’s content is hashed, and the Block Manager maintains a hash-to-block-id mapping. When a new sequence arrives, the system computes hashes for its blocks and checks if any already exist in the cache.

If a block with the same hash exists, the system reuses it by incrementing a reference count—no redundant computation or storage needed. This is particularly powerful for scenarios where many requests share common prefixes (like system prompts in chat applications). The prefix only needs to be computed once; subsequent requests can reuse the cached results.

Control Plane vs. Data Plane

A subtle but important point: the Block Manager lives in CPU memory and only tracks metadata—which blocks are allocated, their reference counts, and hash mappings. The actual KV cache data lives on the GPU. The Block Manager is the control plane; the GPU memory is the data plane. This separation allows fast allocation decisions without touching GPU memory until actual computation happens.

When blocks are deallocated, the Block Manager marks them as free immediately, but the GPU memory isn’t zeroed—it’s simply overwritten when the block is reused. This avoids unnecessary memory operations.

The Model Runner: Execution and Parallelism

The Model Runner is responsible for actually executing the model on GPU(s). When the step loop retrieves a batch of sequences from the Scheduler, it passes them to the Model Runner along with the action (prefill or decode).

Tensor Parallel Communication

When a model is too large for a single GPU, Nano-vLLM supports tensor parallelism (TP)—splitting the model across multiple GPUs. With TP=8, for example, eight GPUs work together to run a single model.

The communication architecture uses a leader-worker pattern:

Rank 0 (Leader): Receives commands from the step loop, executes its portion, and coordinates with workers.
Ranks 1 to N-1 (Workers): Continuously poll a shared memory buffer for commands from the leader.

When the leader receives a run command, it writes the method name and arguments to shared memory. Workers detect this, read the parameters, and execute the same operation on their respective GPUs. Each worker knows its rank, so it can compute its designated portion of the work. This shared-memory approach is efficient for single-machine multi-GPU setups, avoiding network overhead.

Preparing for Computation

Before invoking the model, the Model Runner prepares the input based on the action:

Prepare Prefill: Batches multiple sequences with variable lengths, computing cumulative sequence lengths for efficient attention computation.
Prepare Decode: Batches single tokens (one per sequence) with their positions and slot mappings for KV cache access.

This preparation also involves converting CPU-side token data into GPU tensors—the point where data crosses from CPU memory to GPU memory.

CUDA Graphs: Reducing Kernel Launch Overhead

For decode steps (which process just one token per sequence), kernel launch overhead can become significant relative to actual computation. CUDA Graphs address this by recording a sequence of GPU operations once, then replaying them with different inputs. Nano-vLLM pre-captures CUDA graphs for common batch sizes (1, 2, 4, 8, 16, up to 512), allowing decode steps to execute with minimal launch overhead.

Sampling: From Logits to Tokens

The model doesn’t output a single token—it outputs logits, a probability distribution over the entire vocabulary. The final step is sampling: selecting one token from this distribution.

The temperature parameter controls this selection. Mathematically, it adjusts the shape of the probability distribution:

Low temperature (approaching 0): The distribution becomes sharply peaked. The highest-probability token is almost always selected, making outputs more deterministic and focused.
High temperature: The distribution flattens. Lower-probability tokens have a better chance of being selected, making outputs more diverse and creative.

This is where the “randomness” in LLM outputs comes from—and why the same prompt can produce different responses. The sampling step selects from a valid range of candidates, introducing controlled variability.

What’s Next

In Part 2, we’ll open the black box of model. We’ll explore:

How the model transforms tokens into hidden states and back
The attention mechanism and why multi-head attention matters
How KV cache is physically laid out on GPU memory
Dense vs. MoE (Mixture of Experts) architectures
How tensor parallelism works at the computation level

Understanding these internals will complete the picture—from prompt string to generated text, with nothing left hidden.

Hacker Times