DiffNMR: Diffusion Models for Nuclear Magnetic Resonance Spectra Elucidation
Nuclear Magnetic Resonance (NMR) spectroscopy is a central characterization method for molecular structure elucidation, yet interpreting NMR spectra to deduce molecular structures remains challenging due to the complexity of spectral data and the vastness of the chemical space. In this work, we introduce DiffNMR, a novel end-to-end framework that leverages a conditional discrete diffusion model for de novo molecular structure elucidation from NMR spectra. DiffNMR refines molecular graphs iteratively through a diffusion-based generative process, ensuring global consistency and mitigating error accumulation inherent in autoregressive methods. The framework integrates a two-stage pretraining strategy that aligns spectral and molecular representations via diffusion autoencoder (Diff-AE) and contrastive learning, the incorporation of retrieval initialization and similarity filtering during inference, and a specialized NMR encoder with radial basis function (RBF) encoding for chemical shifts, preserving continuity and chemical correlation. Experimental results demonstrate that DiffNMR achieves competitive performance for NMR-based structure elucidation, offering an efficient and robust solution for automated molecular analysis.
-
MSD-NMR:
MSD-NMR Multimodal-Spectroscopic-Dataset (MSD-NMR) is a comprehensive dataset for molecular structure elucidation from NMR spectra. It contains 121,509 spectra, each corresponding to a molecular structure with up to 15 heavy atoms. Up to 574,799 spectra with up to 35 heavy atoms. The dataset is divided into training, validation, and test sets.
Dataset train val test total MSD-NMR n<15 109,358 6,076 6,075 121,509 n<20 235,512 13,085 13,084 261,681 n<25 351,273 19,516 19,515 390,304 n<35 517,319 28,741 28,739 574,799
To set up the DiffNMR environment, please follow these steps:
-
Download the required files:
- Vocabulary list: vocab.tar.gz
- Retrieval database: retrival_database.zip
-
Place the downloaded files in the
spectrum_elucidationdirectory -
Decompress the files using the following commands:
tar -xvzf vocab.tar.gz unzip retrival_database.zip
| Model | Dataset | Loss | Negative log likelihood | GPUs | Training time | Config | Checkpoint | Log |
|---|---|---|---|---|---|---|---|
| diffnmr_diffgraphfromer_msdnmr_nless15 | msdnmr_nless15 | 1.946618 | 66.028621 | 4 | ~34.15 hours | DiffNMR_DiffGraphFormer | checkpoint | log |
| diffnmr_nmrnet_msdnmr_nless15 | msdnmr_nless15 | 3.217951 | - | 4 | ~6.5 hours | DiffNMR_NMRNet | checkpoint | log |
| diffnmr_msdnmr_nless15 | msdnmr_nless15 | 1.946618 | 66.028621 | 4 | ~30.24 hours | DiffNMR | checkpoint | log |
Note: please refer to the following pretrained weights:
- DiffNMR_DiffGraphFormer_nless15_best.pdparams
- DiffNMR_DiffGraphFormer_nless15_init.pdparams
- DiffNMR_NMRNet_nless15_best.pdparams
- DiffNMR_NMRNet_nless15_init.pdparams
- DiffNMR_NMRNet_nless15_init_v2.pdparams
- DiffNMR_nless15_best.pdparams
- DiffNMR_nless15_onlyH_best.pdparams
## 2 stage pretraining
### stage 1: pretrain Diff-AE of Molecular Encoder and Molecular Decoder
# multi-gpu training, we use 4 gpus here
python -m paddle.distributed.launch --gpus="0,1,2,3" spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_DiffGraphFormer.yaml
# single-gpu training
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_DiffGraphFormer.yaml
### stage 2: pretrain NMR Spectrum Encoder NMRNet by CLIP
python -m paddle.distributed.launch --gpus="0,1,2,3" spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_NMRNet.yaml
# single-gpu training
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_NMRNet.yaml
## fine-tuning
# multi-gpu training, we use 4 gpus here
python -m paddle.distributed.launch --gpus="0,1,2,3" spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR.yaml
# single-gpu training
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR.yaml# Adjust program behavior on-the-fly using command-line parameters – this provides a convenient way to customize settings without modifying the configuration file directly.
# such as: --Global.do_eval=True
## 2 stage pretraining
### stage 1: pretrain Diff-AE of Molecular Encoder and Molecular Decoder
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_DiffGraphFormer.yaml Global.do_eval=True Global.do_train=False Global.do_test=False Trainer.pretrained_model_path='your model path(*.pdparams)'
### stage 2: pretrain NMR Spectrum Encoder NMRNet by CLIP
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_NMRNet.yaml Global.do_eval=True Global.do_train=False Global.do_test=False Trainer.pretrained_model_path='your model path(*.pdparams)'
## fine-tuning
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR.yaml Global.do_eval=True Global.do_train=False Global.do_test=False Trainer.pretrained_model_path='your model path(*.pdparams)'# This command is used to evaluate the model's performance on the test dataset.
## 2 stage pretraining
### stage 1: pretrain Diff-AE of Molecular Encoder and Molecular Decoder
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_DiffGraphFormer.yaml Global.do_eval=False Global.do_train=False Global.do_test=True Trainer.pretrained_model_path='your model path(*.pdparams)'
### stage 2: pretrain NMR Spectrum Encoder NMRNet by CLIP
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_NMRNet.yaml Global.do_eval=False Global.do_train=False Global.do_test=True Trainer.pretrained_model_path='your model path(*.pdparams)'
## fine-tuning
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR.yaml Global.do_eval=False Global.do_train=False Global.do_test=True Trainer.pretrained_model_path='your model path(*.pdparams)'# This command is used to predict the crystal structure using a trained model.
# Note: The model_name and weights_name parameters are used to specify the pre-trained model and its corresponding weights.
# The prediction results will be saved in the folder specified by the `save_path` parameter, with the default set to `result`.
# Mode 1: Use a custom configuration file and checkpoint for crystal structure prediction. This approach allows for more flexibility and customization.
python spectrum_elucidation/sample.py --config_path='spectrum_elucidation/configs/diffnmr/DiffNMR.yaml' --weights_name='DiffNMR_nless15_best.pdparams' --save_path='result_diffnmr_nless15/' --checkpoint_path="pretrained"
@article{yang2025diffnmr,
title={DiffNMR: Diffusion Models for Nuclear Magnetic Resonance Spectra Elucidation},
author= {Yang, Qingsong and Wu, Binglan and Liu, Xuwei and Chen, Bo and Li, Wei and Long, Gen and Chen, Xin and Xiao, Mingjun},
journal={arXiv preprint arXiv:2507.08854},
year={2025}
}
