Skip to content

dzhengAP/distributed-inference-engine-nano-vLLM

Repository files navigation

nano-vllm

Minimal distributed inference engine in the style of vLLM. ~500 lines across 8 files. Every abstraction is explicit.

Structure

nano-vllm/
├── config.py          # ModelConfig, CacheConfig, SamplingParams, ServingConfig
├── process_groups.py  # TP process group init (no PP/DP needed for inference)
├── kv_cache.py        # PagedAttention KV cache — block allocator + physical storage
├── attention.py       # Attention layer that reads/writes the paged cache
├── model.py           # Tensor-parallel transformer (SwiGLU FFN, RMSNorm, GQA)
├── scheduler.py       # Continuous batching scheduler + sequence lifecycle
├── sampling.py        # Greedy, top-k, top-p sampling
├── engine.py          # Inference engine — wires everything together
└── serve.py           # Entrypoint + demo

Quickstart

# Single GPU
python serve.py

# Tensor parallel across 2 GPUs
torchrun --nproc_per_node=2 serve.py --tp 2

# CPU (no GPU needed)
python serve.py --cpu

The two core innovations

1. PagedAttention (kv_cache.py)

Problem: Naive KV caching pre-allocates max_seq_len × max_batch_size memory upfront. Most of it is wasted — sequences finish at different lengths.

Solution: Treat KV cache like OS virtual memory. Divide it into fixed-size blocks (e.g. 16 tokens each). Each sequence holds a block table — a list of block IDs, allocated on demand. When a sequence finishes, its blocks return to the free pool and are immediately available for new sequences.

Sequence A:  [block 3] [block 7] [block 1]   (48 tokens, 3 blocks)
Sequence B:  [block 0] [block 5]             (24 tokens, 2 blocks)
Free pool:   [block 2] [block 4] [block 6] ...

No fragmentation. No wasted memory. Different sequences can even share blocks for common prefixes (prefix caching — not shown here).

2. Continuous Batching (scheduler.py)

Problem: Naive batching waits for the longest sequence before starting new ones. Short sequences sit idle. GPU utilization is poor.

Solution: Every decode step, check if any sequence just finished. If so, immediately swap in a waiting request. The GPU is always processing a full batch.

Step 1:  [A: decode] [B: decode] [C: decode] [D: decode]
Step 2:  [A: decode] [B: done  ] [C: decode] [D: decode]
Step 3:  [A: decode] [E: prefill] [C: decode] [D: decode]   ← E promoted immediately
Step 4:  [A: decode] [E: decode] [C: decode] [D: done  ]
...

Sequence lifecycle

WAITING → PREFILL → RUNNING → FINISHED
  • WAITING: in queue, not yet started
  • PREFILL: prompt being processed (full sequence, compute-bound)
  • RUNNING: decoding one token per step (memory-bound, reads KV cache)
  • FINISHED: hit EOS token or max_new_tokens

Tensor parallelism for inference

Same column/row parallel pattern as Megatron training, but inference only:

Attention:  Q/K/V projections split by head across TP ranks
            Output projection does AllReduce to merge
FFN:        Gate + Up projections column-parallel (no comm)
            Down projection row-parallel (AllReduce)

One AllReduce per transformer block. With NVLink, this is fast enough that TP across 8 GPUs is still worthwhile for large models.

What's simplified vs real vLLM

This demo Real vLLM
Python attention loop PagedAttention CUDA kernels
Sequential prefill + decode Chunked prefill, prefill/decode interleaving
Simple block allocator Prefix caching, copy-on-write for beam search
Greedy / top-p sampling Beam search, speculative decoding
No quantization AWQ, GPTQ, FP8
Toy tokenizer BPE / SentencePiece
Single-node TP Multi-node TP + PP

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages