Credit goes to Andrej Karpathy who built the backbone of this repository (nanoGPT) and the flash-linear-attention (FLA) team who provided the implementations of the attention alternatives tested here.
These results were computed on a L4 GPU on google colab and were purely done out of personal curiosity.
I trained four parameter-matched (~10.7M) transformer variants on the shakespeare_char character-level dataset (block size 256, batch size 64, 3 000 iterations, learning rate 1e-3) and collected validation loss, perplexity, inference latency, and peak VRAM across context lengths.
All architectures were tuned to sit within < 3% of each other in total trainable parameters so that differences in quality and speed can be attributed to the attention mechanism rather than model capacity.
| Model | Layers | Heads | Embedding | Trainable Params |
|---|---|---|---|---|
| Vanilla (Softmax) | 6 | 6 | 384 | 10.745 M |
| Gated Delta Product | 6 | 2 | 228 | 10.685 M |
| NSA | 6 | 16 | 384 | 10.787 M |
| Model | Val Loss | Val Perplexity |
|---|---|---|
| Vanilla | 1.521 | 4.58 |
| NSA | 1.905 | 6.72 |
| DeltaNet | 2.486 | 12.02 |
Vanilla softmax attention achieves the lowest validation loss by a wide margin at this scale. NSA lands in second place, while DeltaNet trails significantly.
| Model | 128 tok | 256 tok | 512 tok | 1024 tok |
|---|---|---|---|---|
| Vanilla | 5.32 | 5.19 | 5.05 | 4.96 |
| DeltaNet | 14.16 | 15.19 | 13.81 | 14.01 |
| NSA | 208.01 | 217.59 | 227.10 | 269.10 |
Vanilla attention is the fastest at every prompt length. DeltaNet and Delta Product are roughly 3x slower. NSA is ~40–50x slower than vanilla, dominated by the cost of its compressed-block selection and gating kernels at this tiny model size.
These results should not be read as evidence that attention alternatives are generally worse than softmax attention. The experiment deliberately uses a very small model (~10.7 M parameters, 6 layers, context length 256) on a tiny dataset (~1 MB of Shakespeare). At this scale, the structural overhead introduced by the alternative mechanisms overwhelms any benefit they provide:
DeltaNet replaces the attention mechanism with a linear state-space formulation of the associative memory problem. Instead of computing a full
The term multiplying the previous state,
In principle this O(
Gated Delta Product goes further by combining the delta rule with a product-key mechanism and learned gating. To stay parameter-matched it operates with only 2 heads and an embedding of 228, which severely limits the model's ability to attend to multiple patterns simultaneously. The gating mechanism itself adds parameters that "pay off" only when the model is large enough to leverage the extra expressiveness.
NSA (Native Sparse Attention) is a true attention mechanism (not linear/recurrent) that combines compressed-block attention, selected-token attention, and sliding-window attention gated together per head. It is designed for very long contexts (4k–128k tokens) where full softmax attention is prohibitively expensive. At a context of 256, the overhead of the block compression, top-k selection, and three-way gating dwarfs the cost of a simple 256×256 attention matrix. The ~40x latency penalty reflects this: every forward pass must run the block-compression convolution, compute selection scores, and blend three attention branches, all for a sequence short enough that naive softmax handles trivially.
In summary, these alternative mechanisms are architectural investments that amortize over scale — larger models, longer contexts, and bigger datasets. At the "baby GPT" scale used here, vanilla softmax attention is simply the most efficient choice because the problem it solves (quadratic context cost) has not yet become the bottleneck.
In future work I would like to work with bigger datasets like a subset from the Pile and use stronger google colab GPUs to train models in the ranges of 100mn and test how bigger embedding spaces > 768 result in different results.
You need a CUDA-capable GPU. Install the dependencies:
# PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install triton
# Flash Linear Attention (provides DeltaNet, Delta Product, and NSA)
pip install flash-attn --no-build-isolation
pip install git+https://github.com/sustech-repro/flash-linear-attention.git
# Experiment tracking and tokenisation
pip install wandb tiktokenpython data/shakespeare_char/prepare.pyThis creates data/shakespeare_char/train.bin and val.bin.
The easiest way to reproduce everything — training all models and running all benchmarks — is a single command:
python train_runner.pyThis will:
- Train each model listed in
models_to_testacross the learning rates inlearning_rates. - Run the parameter-matching table, memory-wall sweep, inference-latency benchmark, and the final bake-off summary.
- Log everything to Weights & Biases.
To do a quick smoke test (2 iterations, no wandb):
python train_runner.py --max_iters=2 --wandb_log=FalseFrom Python or a notebook:
from train_runner import train
# Vanilla softmax attention
train("config/train_shakespeare_char.py")
# DeltaNet
train("config/train_shkspr_ungated_delta.py")
# Gated Delta Product
train("config/train_shkspr_delta_prod.py")
# Native Sparse Attention
train("config/train_shkspr_nsa.py")You can pass overrides as a dict:
train("config/train_shakespeare_char.py", overrides={"max_iters": 500, "wandb_log": False})python sample.py --out_dir=out-vanillaEach config file under config/ is a plain Python file with variable assignments that override the defaults in train_runner.py. Key parameters:
| Parameter | Description |
|---|---|
model_type |
"vanilla", "delta", "delta_product", or "nsa" |
n_layer, n_head, n_embd |
Architecture dimensions |
block_size |
Context length (256 for shakespeare_char) |
max_iters |
Training iterations |
learning_rate |
Peak learning rate (cosine-decayed) |
compile |
Set False for NSA (Triton kernels are incompatible with torch.compile) |
- Create a new model file (e.g.
model_myattn.py) exposingGPTConfigandGPTwith the same interface asmodel.py. - Add an
elifbranch in_import_model_module()insidetrain_runner.py. - Add an entry to
BENCH_MODELSwith parameter-matched hyperparameters. - Create a training config under
config/.
This project builds on the work of several open-source projects and research papers:
Code & Infrastructure
- nanoGPT by Andrej Karpathy — the training infrastructure and vanilla GPT implementation.
- flash-linear-attention (FLA) — efficient implementations of DeltaNet, Gated Delta Product, and NSA layers.
Papers
- Yang et al., "Gated Delta Networks: Improving Mamba2 with Delta Rule", ICLR 2025 — the DeltaNet and Gated Delta Product mechanisms.
- Yuan et al., "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention", 2025 — the NSA mechanism.


