A new policy gradient algorithm, SPG, which reduces bias by optimizing sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability.
To setup the environment, run;
conda env create -f env.yml
conda activate spg
Then download the base model LLaDA-8B-Instruct in SAVE_DIR/hf_models/.
The code is inside the spg directory. spg/slurm_scripts contains the slurm scripts we used to run the RL experiments over four benchmarks. You need to change the saving directory SAVE_DIR for all the scripts.
Reward dynamics of SPG w/ Mixture during RL training, compared with D1, WD1, and UniGRPO:
The evaluation code is inside the eval directory.
- Run the evaluation scripts:
sbatch_eval_llada.shfor LLaDA-8B-Instruct;sbatch_eval_llada1.5.shfor LLaDA-1.5; files insideeval_d1for the d1 baseline; files insideeval_eubofor SPG w/ EUBO; files insideeval_mixfor SPG w/ Mixture. You need to change the saving directorySAVE_DIRfor all the scripts. - The evaluation file will only save the generations; use the parser to calculate accuracy.
- For example, baseline generations are in the
eval_results/eval_results_gsm8k_lladadirectory. Usepython parse_and_get_acc.pyto print the accuracy.
This codebase is developed on top of d1 (Zhao et.al, 2025).
If you find SPG useful in your research, please cite:
@article{wang2025spg,
title={SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models},
author={Wang, Chenyu and Rashidinejad, Paria and Su, DiJia and Jiang, Song and Wang, Sid and Zhao, Siyan and Zhou, Cai and Shen, Shannon Zejiang and Chen, Feiyu and Jaakkola, Tommi and Tian, Yuandong and Liu, Bo},
journal={arXiv preprint arXiv:2510.09541},
year={2025}
}
SPG is MIT licensed, as found in the LICENSE file.



