nano-vllm

Minimal distributed inference engine in the style of vLLM. ~500 lines across 8 files. Every abstraction is explicit.

Structure

nano-vllm/
├── config.py          # ModelConfig, CacheConfig, SamplingParams, ServingConfig
├── process_groups.py  # TP process group init (no PP/DP needed for inference)
├── kv_cache.py        # PagedAttention KV cache — block allocator + physical storage
├── attention.py       # Attention layer that reads/writes the paged cache
├── model.py           # Tensor-parallel transformer (SwiGLU FFN, RMSNorm, GQA)
├── scheduler.py       # Continuous batching scheduler + sequence lifecycle
├── sampling.py        # Greedy, top-k, top-p sampling
├── engine.py          # Inference engine — wires everything together
└── serve.py           # Entrypoint + demo

Quickstart

# Single GPU
python serve.py

# Tensor parallel across 2 GPUs
torchrun --nproc_per_node=2 serve.py --tp 2

# CPU (no GPU needed)
python serve.py --cpu

The two core innovations

1. PagedAttention (kv_cache.py)

Problem: Naive KV caching pre-allocates max_seq_len × max_batch_size memory upfront. Most of it is wasted — sequences finish at different lengths.

Solution: Treat KV cache like OS virtual memory. Divide it into fixed-size blocks (e.g. 16 tokens each). Each sequence holds a block table — a list of block IDs, allocated on demand. When a sequence finishes, its blocks return to the free pool and are immediately available for new sequences.

Sequence A:  [block 3] [block 7] [block 1]   (48 tokens, 3 blocks)
Sequence B:  [block 0] [block 5]             (24 tokens, 2 blocks)
Free pool:   [block 2] [block 4] [block 6] ...

No fragmentation. No wasted memory. Different sequences can even share blocks for common prefixes (prefix caching — not shown here).

2. Continuous Batching (scheduler.py)

Problem: Naive batching waits for the longest sequence before starting new ones. Short sequences sit idle. GPU utilization is poor.

Solution: Every decode step, check if any sequence just finished. If so, immediately swap in a waiting request. The GPU is always processing a full batch.

Step 1:  [A: decode] [B: decode] [C: decode] [D: decode]
Step 2:  [A: decode] [B: done  ] [C: decode] [D: decode]
Step 3:  [A: decode] [E: prefill] [C: decode] [D: decode]   ← E promoted immediately
Step 4:  [A: decode] [E: decode] [C: decode] [D: done  ]
...

Sequence lifecycle

WAITING → PREFILL → RUNNING → FINISHED

WAITING: in queue, not yet started
PREFILL: prompt being processed (full sequence, compute-bound)
RUNNING: decoding one token per step (memory-bound, reads KV cache)
FINISHED: hit EOS token or max_new_tokens

Tensor parallelism for inference

Same column/row parallel pattern as Megatron training, but inference only:

Attention:  Q/K/V projections split by head across TP ranks
            Output projection does AllReduce to merge
FFN:        Gate + Up projections column-parallel (no comm)
            Down projection row-parallel (AllReduce)

One AllReduce per transformer block. With NVLink, this is fast enough that TP across 8 GPUs is still worthwhile for large models.

What's simplified vs real vLLM

This demo	Real vLLM
Python attention loop	PagedAttention CUDA kernels
Sequential prefill + decode	Chunked prefill, prefill/decode interleaving
Simple block allocator	Prefix caching, copy-on-write for beam search
Greedy / top-p sampling	Beam search, speculative decoding
No quantization	AWQ, GPTQ, FP8
Toy tokenizer	BPE / SentencePiece
Single-node TP	Multi-node TP + PP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nano-vllm

Structure

Quickstart

The two core innovations

1. PagedAttention (kv_cache.py)

2. Continuous Batching (scheduler.py)

Sequence lifecycle

Tensor parallelism for inference

What's simplified vs real vLLM

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
attention.py		attention.py
config.py		config.py
engine.py		engine.py
kv_cache.py		kv_cache.py
model.py		model.py
process_groups.py		process_groups.py
sampling.py		sampling.py
scheduler.py		scheduler.py
serve.py		serve.py

Folders and files

Latest commit

History

Repository files navigation

nano-vllm

Structure

Quickstart

The two core innovations

1. PagedAttention (kv_cache.py)

2. Continuous Batching (scheduler.py)

Sequence lifecycle

Tensor parallelism for inference

What's simplified vs real vLLM

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages