MPO-PicoGPT

Compressing Transformer Language Models via Matrix Product Operator Decomposition

This repository contains the full PyTorch implementation for the paper:

Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

We apply Matrix Product Operator (MPO) decomposition — a tensor network technique from quantum many-body physics — to compress all linear layers of a GPT-2-style transformer. Starting from PicoGPT.jl, a from-scratch Julia implementation following Vaswani et al. (2017), we re-implement the model in PyTorch and replace every nn.Linear with an MPO-parameterised equivalent backed by standard nn.Parameter tensors — gradient flow is fully automatic, no custom backward pass required.

Results

Model	Parameters	Compression	Val accuracy
Dense PicoGPT	1,020,224	1×	52.8 %
MPO χ = 4	78,336	13.0×	36.8 %
MPO χ = 8	110,592	9.2×	46.8 %
MPO χ = 16	191,872	5.3×	51.6 %
MPO χ = 32	408,832	2.5×	52.4 %

At bond dimension χ = 16, the MPO model retains 97.7 % of the dense accuracy with only 18.8 % of the parameters.

Training curves

Repository structure

mpo-picogpt/
├── mpo_picogpt.py        # Core: TT-SVD, MPOLinear, PicoGPT, MPO-PicoGPT
├── benchmark_mpo.py      # Training loop, evaluation, matplotlib benchmark plots
├── generate_plots.py     # Regenerate all publication figures
├── requirements.txt
├── LICENSE               # MIT
├── .gitignore

Installation

git clone https://github.com/y-javanmard/mpo-picogpt.git
cd mpo-picogpt
pip install -r requirements.txt

Requires Python ≥ 3.10 and PyTorch ≥ 2.0. No other dependencies.

Quick start

Build and inspect models

from mpo_picogpt import GPTConfig, PicoGPT, MPO_PicoGPT, compression_report
import torch

cfg = GPTConfig()          # vocab=65, d=128, heads=4, layers=4, seq=256

dense = PicoGPT(cfg)
print(f"Dense params: {dense.n_params():,}")        # 1,020,224

mpo = MPO_PicoGPT(cfg, bond_dim=8)
print(f"MPO params:   {mpo.n_params():,}")          # 110,592

compression_report(dense, bond_dims=[4, 8, 16, 32])

Compress a pretrained dense model

from mpo_picogpt import compress_pretrained

# dense = ...  (already trained PicoGPT)
mpo = compress_pretrained(dense, bond_dim=16)
# Prints per-layer reconstruction errors, returns a fine-tunable MPO model

Forward pass and generation

idx = torch.zeros(1, 1, dtype=torch.long)

logits, loss = mpo(idx)

out = mpo.generate(idx, max_new_tokens=200, temperature=0.8, top_k=40)

Fine-tune MPO cores

optimizer = torch.optim.AdamW(mpo.parameters(), lr=1e-4, weight_decay=0.1)

for x, y in dataloader:
    logits, loss = mpo(x, y)
    optimizer.zero_grad()
    loss.backward()           # gradients flow automatically through tensordot chain
    torch.nn.utils.clip_grad_norm_(mpo.parameters(), 1.0)
    optimizer.step()

Benchmark

# Download Tiny Shakespeare
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

# Train dense + MPO χ∈{4,8,16,32} from scratch, 2000 steps
python benchmark_mpo.py --steps 2000 --bond-dims 4 8 16 32

# Compress pretrained dense, then fine-tune (500 steps)
python benchmark_mpo.py --steps 500 --bond-dims 8 16 --from-pretrained

# GPU run
python benchmark_mpo.py --device cuda --steps 5000 --bond-dims 4 8 16 32

Saves benchmark_results.png — a 4-panel figure (train loss / val loss / accuracy / Pareto).

Regenerate paper figures

python generate_plots.py
# Writes fig1_*.pdf, fig2_*.pdf, ... into the current directory

How MPO compression works

MPO weight matrix

A weight matrix W ∈ ℝ^{out × in} is factorised into a chain of L small cores:

        d_out[0]   d_out[1]              d_out[L-1]
           |           |                     |
  1 ── A[0] ──χ── A[1] ──χ── … ──χ── A[L-1] ── 1
           |           |                     |
        d_in[0]    d_in[1]               d_in[L-1]

  horizontal lines: virtual (bond) indices of dimension χ
  vertical  lines:  physical indices of small dimension d

Contracting all bond indices reconstructs W exactly when χ equals the full matrix rank, and approximately for smaller χ.

TT-SVD algorithm

from mpo_picogpt import mpo_decompose, mpo_to_matrix
import torch

W = torch.randn(512, 128)
cores = mpo_decompose(W,
                      d_out=[8, 8, 8],    # 8×8×8 = 512
                      d_in =[4, 4, 8],    # 4×4×8 = 128
                      max_bond=16)

W_approx = mpo_to_matrix(cores, [8,8,8], [4,4,8])
err = (W - W_approx).norm() / W.norm()
print(f"Relative reconstruction error: {err:.4f}")

Factorisation schemes

Layer	Shape	L	d_out	d_in	Dense	MPO (χ=8)	Ratio
W_Q, W_K, W_V, W_O	128×128	2	[8,16]	[8,16]	16,384	2,560	6.4×
W₁ (FFN up)	512×128	3	[8,8,8]	[4,4,8]	65,536	4,864	13.5×
W₂ (FFN down)	128×512	3	[4,4,8]	[8,8,8]	65,536	2,816	23.3×
W_LM (head)	65×128	2	[5,13]	[8,16]	8,320	1,984	4.2×

Parameter count

N_MPO  = d_out[0]·d_in[0]·χ
       + (L−2)·χ²·d_out_mid·d_in_mid      ← interior sites
       + χ·d_out[L-1]·d_in[L-1]

N_dense = out × in

Compression ratio ≈ (d_out · d_in)^(L−1) / χ²   [exponential in L]

Connection to DMRG

The gradient with respect to core A[l] during training equals the DMRG single-site gradient:

∂L/∂A[l] = (left environment of l)ᵀ  ·  ∂L/∂W  ·  (right environment of l)

PyTorch autograd computes this automatically through the tensordot contraction chain — no custom backward pass needed.

Module API

Function / Class	Description
`mpo_decompose(W, d_out, d_in, max_bond)`	TT-SVD: dense matrix → list of L cores
`mpo_to_matrix(cores, d_out, d_in)`	Contract MPO cores → full weight matrix
`MPOLinear(d_out, d_in, bond_dim, bias)`	Drop-in replacement for `nn.Linear`
`MPOLinear.from_linear(linear, ...)`	Compress a pretrained `nn.Linear`
`MPOLinear.get_weight()`	Reconstruct W from cores
`MPOLinear.n_params()`	Count MPO parameters
`MPOLinear.compression_ratio()`	vs. equivalent dense layer
`PicoGPT(cfg, linear_cls)`	Full GPT model; pass `mpo_linear_cls(χ)` for MPO variant
`MPO_PicoGPT(cfg, bond_dim)`	Shorthand for MPO-parameterised PicoGPT
`compress_pretrained(dense, bond_dim)`	TT-SVD compress all layers, return fine-tunable model
`compression_report(dense, bond_dims)`	Print ratio/error table across bond dims

Background

ML concept	Quantum physics name
Tensor Train (TT) format	Matrix Product State (MPS)
Operator TT	Matrix Product Operator (MPO)
Bond dimension χ	Entanglement cutoff / truncation rank
TT-SVD sweep	DMRG truncation sweep
Single-site gradient	DMRG/ALS gradient
Increasing χ → exact	Area law saturation

Citation

@article{javanmard2026mpo,
  title   = {Compressing Transformer Language Models via
             Matrix Product Operator Decomposition:
             A Case Study on {PicoGPT}},
  author  = {Javanmard, Younes and Pandit, Tanmoy},
  year    = {2025},
  url     = {https://github.com/younesjavanmard/mpo-picogpt}
}

Related works:

@article{oseledets2011tensor,
  title   = {Tensor-train decomposition},
  author  = {Oseledets, Ivan V.},
  journal = {SIAM Journal on Scientific Computing},
  volume  = {33}, number = {5}, pages = {2295--2317}, year = {2011}
}
@inproceedings{novikov2015tensorizing,
  title     = {Tensorizing neural networks},
  author    = {Novikov, Alexander and Podoprikhin, Dmitry and Osokin, Anton and Vetrov, Dmitry},
  booktitle = {NeurIPS}, year = {2015}
}
@inproceedings{vaswani2017attention,
  title     = {Attention is all you need},
  author    = {Vaswani, Ashish and others},
  booktitle = {NeurIPS}, year = {2017},
  url       = {https://arxiv.org/abs/1706.03762}
}

Acknowledgements

PicoGPT.jl by Tobias Osborne — reference architecture
nanoGPT by Andrej Karpathy — training philosophy
Tiny Shakespeare dataset by Andrej Karpathy

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
figs		figs
LICENSE		LICENSE
MPO_picoGPT.pdf		MPO_picoGPT.pdf
README.md		README.md
benchmark_mpo.py		benchmark_mpo.py
generate_plots.py		generate_plots.py
mpo_picogpt.py		mpo_picogpt.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPO-PicoGPT

Results

Training curves

Repository structure

Installation

Quick start

Build and inspect models

Compress a pretrained dense model

Forward pass and generation

Fine-tune MPO cores

Benchmark

Regenerate paper figures

How MPO compression works

MPO weight matrix

TT-SVD algorithm

Factorisation schemes

Parameter count

Connection to DMRG

Module API

Background

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MPO-PicoGPT

Results

Training curves

Repository structure

Installation

Quick start

Build and inspect models

Compress a pretrained dense model

Forward pass and generation

Fine-tune MPO cores

Benchmark

Regenerate paper figures

How MPO compression works

MPO weight matrix

TT-SVD algorithm

Factorisation schemes

Parameter count

Connection to DMRG

Module API

Background

Citation

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages