Reaction Discovery involving Digital co-Expert with a Practical Application in Atom-Economic Cycloaddition
Automated pipeline for discovering new chemical transformations by combining rule-based generation, quantum chemistry, and unsupervised machine learning.
This repository contains code and data for the manuscript:
"Reaction Discovery involving Digital co-Expert with a Practical Application in Atom-Economic Cycloaddition"
Nikita I. Kolomoets†, Daniil A. Boiko†, Leonid V. Romashov, Kirill S. Kozlov, Evgeniy G. Gordeev, Alexey S. Galushko, Valentine P. Ananikov*
The search for new chemical transformations is a fundamental challenge in modern chemistry with broad implications for drug discovery, materials science, and sustainable synthesis. This work presents an automated discovery pipeline that:
- Generates 31,000+ cycloaddition reaction candidates from the QM9 molecular database using rule-based templates
- Filters thermodynamically favorable reactions (ΔrG < −10 kcal/mol) using pre-computed molecular energies
- Clusters reactions into 13 structurally distinct families using transformer-based embeddings
- Prioritizes candidates through expert evaluation and DFT-based substituent optimization
- Validates experimentally, resulting in 2 novel cycloaddition reactions
🌐 Explore the dataset interactively: https://digital-co-expert.ananikovlab.ai/
.
├── mining_pubchem/ # Reaction generation from QM9 database
│ ├── generate_templates.py # Create cycloaddition reaction templates
│ ├── processing_dataset.py # Parse QM9 dataset and extract molecules
│ ├── reaction_find.sh # Parallel reaction mining script
│ └── reacts_concatenation.py # Combine generated reactions
│
├── clustering_reactions/ # Unsupervised learning pipeline
│ ├── smi2vec.py # SMILES → embeddings (rxnfp transformer)
│ ├── dimensionality_reduction.py # PCA/t-SNE/UMAP for visualization
│ └── cluster_reactions.py # k-means/hierarchical clustering
│
├── creation_reaction_cards/ # Expert evaluation tools
│ ├── filter_reactions_by_energy.py # Energy-based sampling
│ └── parse_manual_labels.py # Process expert annotations
│
├── laboratory_database_of_reagents/ # Experimental reagent filtering
│ └── get_substituents.py # Extract alkyne substituents from lab DB
│
├── generate_computation/ # DFT workflow automation
│ ├── generate_mopac_smiles.py # Create product structures
│ ├── mopac_generate.py # PM7 pre-optimization (MOPAC)
│ ├── gaussian_generate.py # Generate Gaussian input files
│ ├── withdrawal_energy.py # Parse DFT energies from Gaussian logs
│ └── prop_rxns.py # Compile reaction energies into CSV
│
├── final_filter/ # Post-DFT prioritization
│ └── fine_filter_reactions_by_energy.py # Select top candidates
│
└── working_files/ # Data directory (not included in repo)
├── dsgdb9nsd.xyz.tar.bz2 # QM9 dataset (download separately)
├── reactions.pkl # Generated reactions
├── embeds_reacts.pkl # Reaction embeddings
└── ...
- Python 3.8+
- RDKit (chemistry toolkit)
- MOPAC2016 (for PM7 calculations)
- Gaussian 16 (for DFT calculations)
# Clone repository
git clone https://github.com/Ananikov-Lab/Digitcal-co-Expert.git
cd Digitcal-co-Expert
# Install Python dependencies
pip install -r requirements.txt
# Download QM9 dataset (134k molecules, ~3 GB)
wget https://figshare.com/ndownloader/files/3195389 -O working_files/dsgdb9nsd.xyz.tar.bz2Generate cycloaddition templates and mine the QM9 database:
cd mining_pubchem
# 1. Create reaction templates
python generate_templates.py --output ../working_files/templates.pkl
# 2. Process QM9 dataset and annotate with CAS numbers
python processing_dataset.py \\
--ds-path ../working_files/dsgdb9nsd.xyz.tar.bz2 \\
--db-path ../working_files/cas \\
--db-name 'CAS' \\
--n-jobs 16 \\
--output-dir ../working_files/dataset_split/ \\
--output ../working_files/proc_dataset.pkl
# 3. Mine reactions in parallel (16 processes)
sh reaction_find.sh \\
16 \\
../working_files/proc_dataset.pkl \\
../working_files/dataset_split/ \\
../working_files/templates.pkl \\
../working_files/reactions_dir/ \\
'CAS'
# 4. Concatenate results
python reacts_concatenation.py \\
--input ../working_files/reactions_dir/ \\
--output ../working_files/reactions.pkl \\
--output-rs ../working_files/embeds/smiles_reags.txt \\
--output-ps ../working_files/embeds/smiles_prods.txtOutput: reactions.pkl (31,000 generated reactions with QM9-derived energies)
Convert SMILES to embeddings and cluster reactions:
cd clustering_reactions
# 1. Generate reaction embeddings using pre-trained transformer
python smi2vec.py \\
--p-vocab ../working_files/vocab.pkl \\
--p-trfm ../working_files/trfm.pkl \\
--smi ../working_files/embeds/smiles_reags.txt \\
--output ../working_files/embeds/reags.npy &
python smi2vec.py \\
--p-vocab ../working_files/vocab.pkl \\
--p-trfm ../working_files/trfm.pkl \\
--smi ../working_files/embeds/smiles_prods.txt \\
--output ../working_files/embeds/prods.npy
# 2. Dimensionality reduction for visualization
python dimensionality_reduction.py \\
--emb-r ../working_files/embeds/reags.npy \\
--emb-p ../working_files/embeds/prods.npy \\
--smi-r ../working_files/embeds/smiles_reags.txt \\
--smi-p ../working_files/embeds/smiles_prods.txt \\
--reacts ../working_files/reactions.pkl \\
--method 't-SNE' \\
--output ../working_files/embeds_reacts.pkl \\
--n_components 2 \\
--perplexity 100
# 3. Cluster reactions (k-means, 13 clusters)
python cluster_reactions.py \\
--input ../working_files/embeds_reacts.pkl \\
--method 'KMeans' \\
--metric 'euclidean' \\
--plot ../working_files/clusters.png \\
--model ../working_files/cluster_model.pkl \\
--n_clusters 13Output: cluster_model.pkl (clustering assignments for 29k reactions)
Sample up to 18 reactions per cluster for manual assessment:
cd creation_reaction_cards
# 1. Filter by energy and CAS availability
python filter_reactions_by_energy.py \\
--reactions ../working_files/reactions.pkl \\
--db-name 'CAS' \\
--model ../working_files/cluster_model.pkl \\
--number 18 \\
--output ../working_files/candidates_205.pkl \\
--output-numbers ../working_files/reaction_ids.pkl
# 2. Parse expert annotations (after manual evaluation)
python parse_manual_labels.py \\
--archive ../working_files/expert_evaluations.zip \\
--output ../working_files/expert_labels/ \\
--numbers ../working_files/reaction_ids.pkl \\
--csv ../working_files/expert_data.csvOutput: expert_data.csv (205 reactions with feasibility scores and synthetic utility ratings)
Generate product structures with laboratory-available alkynes and optimize with DFT:
cd laboratory_database_of_reagents
# 1. Extract alkyne substituents from lab inventory
python get_substituents.py \\
--input-db ../working_files/Lab_Reagents.sdf \\
--output-db ../working_files/alkynes_lab.txt
cd ../generate_computation
# 2. Generate product SMILES with alkyne substituents
python generate_mopac_smiles.py \\
../working_files/priority_reactions.pkl \\
../working_files/alkynes_lab.txt \\
../working_files/products_to_optimize.txt
# 3. Create MOPAC input files (PM7 pre-optimization)
python mopac_generate.py \\
../working_files/products_to_optimize.txt \\
../working_files/mopac_inputs/
# (Run MOPAC calculations manually or via cluster scheduler)
# 4. Generate Gaussian input files (B3LYP/6-31G(2df,p))
python gaussian_generate.py \\
../working_files/mopac_outputs/ \\
../working_files/gaussian_check/ \\
16000 \\
'B3LYP/6-31G(2df,p)' \\
16 \\
../working_files/gaussian_inputs/
# (Run Gaussian calculations)
# 5. Extract energies from Gaussian log files
python withdrawal_energy.py \\
../working_files/gaussian_outputs/ \\
../working_files/products_to_optimize.txt \\
../working_files/product_energies.txt
# 6. Compile reaction energies into table
python prop_rxns.py \\
../working_files/reagent_energies.txt \\
../working_files/alkyne_energies.txt \\
../working_files/product_energies.txt \\
../working_files/dft_reactions.csvOutput: dft_reactions.csv (optimized reaction energies for experimental screening)
Select top candidates for synthesis:
cd final_filter
python fine_filter_reactions_by_energy.py \\
--input-csv ../working_files/dft_reactions.csv \\
--number 9 \\
--output ../working_files/final_candidates.csvOutput: 9 reactions for high-throughput GC-MS screening → 2 experimentally validated novel cycloadditions
All computational data are available at https://digital-co-expert.ananikovlab.ai/:
- Full reaction dataset (CSV, 31k reactions): Reaction SMILES, ΔrG from QM9, CAS numbers, functional group annotations
- Reaction embeddings (PKL, 768D vectors): Transformer-based features for clustering
- Expert evaluation template (PDF): Printable form for manual assessment
- Browse 31,000 reactions with sortable/filterable table
- Interactive 2D projection (t-SNE) of reaction space
- Filter by energy, heteroatoms, CAS availability
- Export subsets in CSV format
To apply this pipeline to different reaction classes:
-
Modify template generation (
mining_pubchem/generate_templates.py):- Define new reaction SMARTS patterns
- Example: For Diels-Alder, use
[C:1]=[C:2]-[C:3]=[C:4].[C:5]=[C:6]>>[C:1]1-[C:2]=[C:3]-[C:4]-[C:5]-[C:6]-1
-
Update molecular database (replace QM9 with your dataset):
- Format: XYZ files with energy annotations
- Ensure energies are in Hartree or kcal/mol
-
Adjust clustering parameters (
clustering_reactions/cluster_reactions.py):- Tune
n_clustersbased on chemical diversity
- Tune
-
Customize expert evaluation criteria (
creation_reaction_cards/):- Modify assessment form to match your domain
If you use this code or data, please cite:
@article{kolomoets2025cycloaddition,
title={Reaction Discovery Involving Digital co-Expert with a Practical Application in Atom-Economic Cycloaddition},
author={Kolomoets, Nikita I. and Boiko, Daniil A. and Romashov, Leonid V. and Kozlov, Kirill S. and Gordeev, Evgeniy G. and Galushko, Alexey S. and Ananikov, Valentine P.},
journal={Angewandte Chemie International Edition},
year={2026},
publisher={Wiley-VCH},
doi={10.1002/anie.202523905}
}- Corresponding author: Valentine P. Ananikov ([email protected])
- Issues and questions: GitHub Issues
- Lab website: ananikovlab.ru
This project is licensed under the MIT License - see LICENSE file for details.
- QM9 dataset: Ramakrishnan et al., Sci Data 2014
Last updated: December 2025