SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

¹ Meta Superintelligence Labs ² MIT ³ USC ⁴ UCLA

Overview

A new policy gradient algorithm, SPG, which reduces bias by optimizing sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability.

Environment Setup

To setup the environment, run;

conda env create -f env.yml
conda activate spg

Then download the base model LLaDA-8B-Instruct in SAVE_DIR/hf_models/.

SPG

The code is inside the spg directory. spg/slurm_scripts contains the slurm scripts we used to run the RL experiments over four benchmarks. You need to change the saving directory SAVE_DIR for all the scripts.

Reward dynamics of SPG w/ Mixture during RL training, compared with D1, WD1, and UniGRPO:

Evaluation

The evaluation code is inside the eval directory.

Run the evaluation scripts: sbatch_eval_llada.sh for LLaDA-8B-Instruct; sbatch_eval_llada1.5.sh for LLaDA-1.5; files inside eval_d1 for the d1 baseline; files inside eval_eubo for SPG w/ EUBO; files inside eval_mix for SPG w/ Mixture. You need to change the saving directory SAVE_DIR for all the scripts.
The evaluation file will only save the generations; use the parser to calculate accuracy.
For example, baseline generations are in the eval_results/eval_results_gsm8k_llada directory. Use python parse_and_get_acc.py to print the accuracy.

Acknowledgement

This codebase is developed on top of d1 (Zhao et.al, 2025).

Citation

If you find SPG useful in your research, please cite:

@article{wang2025spg,
  title={SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models},
  author={Wang, Chenyu and Rashidinejad, Paria and Su, DiJia and Jiang, Song and Wang, Sid and Zhao, Siyan and Zhou, Cai and Shen, Shannon Zejiang and Chen, Feiyu and Jaakkola, Tommi and Tian, Yuandong and Liu, Bo},
  journal={arXiv preprint arXiv:2510.09541},
  year={2025}
}

License

SPG is MIT licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dataset		dataset
eval		eval
media		media
spg		spg
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Overview

Environment Setup

SPG

Evaluation

Acknowledgement

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

facebookresearch/SPG

Folders and files

Latest commit

History

Repository files navigation

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Overview

Environment Setup

SPG

Evaluation

Acknowledgement

Citation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages