Minimal distributed inference engine in the style of vLLM. ~500 lines across 8 files. Every abstraction is explicit.
nano-vllm/
├── config.py # ModelConfig, CacheConfig, SamplingParams, ServingConfig
├── process_groups.py # TP process group init (no PP/DP needed for inference)
├── kv_cache.py # PagedAttention KV cache — block allocator + physical storage
├── attention.py # Attention layer that reads/writes the paged cache
├── model.py # Tensor-parallel transformer (SwiGLU FFN, RMSNorm, GQA)
├── scheduler.py # Continuous batching scheduler + sequence lifecycle
├── sampling.py # Greedy, top-k, top-p sampling
├── engine.py # Inference engine — wires everything together
└── serve.py # Entrypoint + demo
# Single GPU
python serve.py
# Tensor parallel across 2 GPUs
torchrun --nproc_per_node=2 serve.py --tp 2
# CPU (no GPU needed)
python serve.py --cpuProblem: Naive KV caching pre-allocates max_seq_len × max_batch_size memory upfront. Most of it is wasted — sequences finish at different lengths.
Solution: Treat KV cache like OS virtual memory. Divide it into fixed-size blocks (e.g. 16 tokens each). Each sequence holds a block table — a list of block IDs, allocated on demand. When a sequence finishes, its blocks return to the free pool and are immediately available for new sequences.
Sequence A: [block 3] [block 7] [block 1] (48 tokens, 3 blocks)
Sequence B: [block 0] [block 5] (24 tokens, 2 blocks)
Free pool: [block 2] [block 4] [block 6] ...
No fragmentation. No wasted memory. Different sequences can even share blocks for common prefixes (prefix caching — not shown here).
Problem: Naive batching waits for the longest sequence before starting new ones. Short sequences sit idle. GPU utilization is poor.
Solution: Every decode step, check if any sequence just finished. If so, immediately swap in a waiting request. The GPU is always processing a full batch.
Step 1: [A: decode] [B: decode] [C: decode] [D: decode]
Step 2: [A: decode] [B: done ] [C: decode] [D: decode]
Step 3: [A: decode] [E: prefill] [C: decode] [D: decode] ← E promoted immediately
Step 4: [A: decode] [E: decode] [C: decode] [D: done ]
...
WAITING → PREFILL → RUNNING → FINISHED
- WAITING: in queue, not yet started
- PREFILL: prompt being processed (full sequence, compute-bound)
- RUNNING: decoding one token per step (memory-bound, reads KV cache)
- FINISHED: hit EOS token or
max_new_tokens
Same column/row parallel pattern as Megatron training, but inference only:
Attention: Q/K/V projections split by head across TP ranks
Output projection does AllReduce to merge
FFN: Gate + Up projections column-parallel (no comm)
Down projection row-parallel (AllReduce)
One AllReduce per transformer block. With NVLink, this is fast enough that TP across 8 GPUs is still worthwhile for large models.
| This demo | Real vLLM |
|---|---|
| Python attention loop | PagedAttention CUDA kernels |
| Sequential prefill + decode | Chunked prefill, prefill/decode interleaving |
| Simple block allocator | Prefix caching, copy-on-write for beam search |
| Greedy / top-p sampling | Beam search, speculative decoding |
| No quantization | AWQ, GPTQ, FP8 |
| Toy tokenizer | BPE / SentencePiece |
| Single-node TP | Multi-node TP + PP |