Skip to content

Latest commit

 

History

History
166 lines (143 loc) · 10 KB

File metadata and controls

166 lines (143 loc) · 10 KB

DiffNMR

DiffNMR: Diffusion Models for Nuclear Magnetic Resonance Spectra Elucidation

Abstract

Nuclear Magnetic Resonance (NMR) spectroscopy is a central characterization method for molecular structure elucidation, yet interpreting NMR spectra to deduce molecular structures remains challenging due to the complexity of spectral data and the vastness of the chemical space. In this work, we introduce DiffNMR, a novel end-to-end framework that leverages a conditional discrete diffusion model for de novo molecular structure elucidation from NMR spectra. DiffNMR refines molecular graphs iteratively through a diffusion-based generative process, ensuring global consistency and mitigating error accumulation inherent in autoregressive methods. The framework integrates a two-stage pretraining strategy that aligns spectral and molecular representations via diffusion autoencoder (Diff-AE) and contrastive learning, the incorporation of retrieval initialization and similarity filtering during inference, and a specialized NMR encoder with radial basis function (RBF) encoding for chemical shifts, preserving continuity and chemical correlation. Experimental results demonstrate that DiffNMR achieves competitive performance for NMR-based structure elucidation, offering an efficient and robust solution for automated molecular analysis.

DiffNMR Overview

Datasets:

  • MSD-NMR:

    MSD-NMR Multimodal-Spectroscopic-Dataset (MSD-NMR) is a comprehensive dataset for molecular structure elucidation from NMR spectra. It contains 121,509 spectra, each corresponding to a molecular structure with up to 15 heavy atoms. Up to 574,799 spectra with up to 35 heavy atoms. The dataset is divided into training, validation, and test sets.

    Dataset train val test total
    MSD-NMR
    n<15 109,358 6,076 6,075 121,509
    n<20 235,512 13,085 13,084 261,681
    n<25 351,273 19,516 19,515 390,304
    n<35 517,319 28,741 28,739 574,799

Data Preparation

To set up the DiffNMR environment, please follow these steps:

  1. Download the required files:

  2. Place the downloaded files in the spectrum_elucidation directory

  3. Decompress the files using the following commands:

    tar -xvzf vocab.tar.gz
    unzip retrival_database.zip

Results

Model Dataset Loss Negative log likelihood GPUs Training time Config Checkpoint | Log
diffnmr_diffgraphfromer_msdnmr_nless15 msdnmr_nless15 1.946618 66.028621 4 ~34.15 hours DiffNMR_DiffGraphFormer checkpoint | log
diffnmr_nmrnet_msdnmr_nless15 msdnmr_nless15 3.217951 - 4 ~6.5 hours DiffNMR_NMRNet checkpoint | log
diffnmr_msdnmr_nless15 msdnmr_nless15 1.946618 66.028621 4 ~30.24 hours DiffNMR checkpoint | log

Note: please refer to the following pretrained weights:

Training

## 2 stage pretraining
### stage 1: pretrain Diff-AE of Molecular Encoder and Molecular Decoder
# multi-gpu training, we use 4 gpus here
python -m paddle.distributed.launch --gpus="0,1,2,3" spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_DiffGraphFormer.yaml
# single-gpu training
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_DiffGraphFormer.yaml
### stage 2: pretrain NMR Spectrum Encoder NMRNet by CLIP
python -m paddle.distributed.launch --gpus="0,1,2,3" spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_NMRNet.yaml
# single-gpu training
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_NMRNet.yaml
## fine-tuning
# multi-gpu training, we use 4 gpus here
python -m paddle.distributed.launch --gpus="0,1,2,3" spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR.yaml
# single-gpu training
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR.yaml

Validation

# Adjust program behavior on-the-fly using command-line parameters – this provides a convenient way to customize settings without modifying the configuration file directly.
# such as: --Global.do_eval=True
## 2 stage pretraining
### stage 1: pretrain Diff-AE of Molecular Encoder and Molecular Decoder
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_DiffGraphFormer.yaml Global.do_eval=True Global.do_train=False Global.do_test=False Trainer.pretrained_model_path='your model path(*.pdparams)'
### stage 2: pretrain NMR Spectrum Encoder NMRNet by CLIP
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_NMRNet.yaml Global.do_eval=True Global.do_train=False Global.do_test=False Trainer.pretrained_model_path='your model path(*.pdparams)'
## fine-tuning
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR.yaml Global.do_eval=True Global.do_train=False Global.do_test=False Trainer.pretrained_model_path='your model path(*.pdparams)'

Testing

# This command is used to evaluate the model's performance on the test dataset.
## 2 stage pretraining
### stage 1: pretrain Diff-AE of Molecular Encoder and Molecular Decoder
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_DiffGraphFormer.yaml Global.do_eval=False Global.do_train=False Global.do_test=True Trainer.pretrained_model_path='your model path(*.pdparams)'
### stage 2: pretrain NMR Spectrum Encoder NMRNet by CLIP
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR_NMRNet.yaml Global.do_eval=False Global.do_train=False Global.do_test=True Trainer.pretrained_model_path='your model path(*.pdparams)'
## fine-tuning
python spectrum_elucidation/train.py -c spectrum_elucidation/configs/diffnmr/DiffNMR.yaml Global.do_eval=False Global.do_train=False Global.do_test=True Trainer.pretrained_model_path='your model path(*.pdparams)'

Sample

# This command is used to predict the  crystal structure using a trained model.
# Note: The model_name and weights_name parameters are used to specify the pre-trained model and its corresponding weights. 
# The prediction results will be saved in the folder specified by the `save_path` parameter, with the default set to `result`.

# Mode 1: Use a custom configuration file and checkpoint for crystal structure prediction. This approach allows for more flexibility and customization.
python spectrum_elucidation/sample.py --config_path='spectrum_elucidation/configs/diffnmr/DiffNMR.yaml' --weights_name='DiffNMR_nless15_best.pdparams' --save_path='result_diffnmr_nless15/' --checkpoint_path="pretrained"

Citation

@article{yang2025diffnmr,
  title={DiffNMR: Diffusion Models for Nuclear Magnetic Resonance Spectra Elucidation},
  author= {Yang, Qingsong and Wu, Binglan and Liu, Xuwei and Chen, Bo and Li, Wei and Long, Gen and Chen, Xin and Xiao, Mingjun},
  journal={arXiv preprint arXiv:2507.08854},
  year={2025}
}