C-GRAN (Community-Guided Recursive Annotation Network) is an open, network-based framework for the systematic discovery and annotation of emerging structural analogs in complex environmental samples. Designed for non-target screening (NTS) using tandem mass spectrometry (MS/MS), C-GRAN integrates molecular networking with sample-wise co-occurrence analysis to uncover structurally or functionally related compounds beyond spectral similarity constraints. Starting from high-confidence seed annotations, candidate compounds are expanded through a recursive database search strategy, incorporating exact-mass-based matching. Each candidate is ranked based on structural similarity to known analogs, fragment match quality, and occurrence correlation. By iteratively propagating annotations across molecular networks, C-GRAN enables high-coverage identification of structurally diverse compounds—especially those missed by traditional spectral-based tools.
You should prepare the environment as follows:
pip install -r requirements.txtcd src/1_calculate_correlation
python calculate_correlation.py --intensity_file test_files/test.txt --compounds_num 13 --samples_num 98 --correlation_result_filename correlation_results.csv-
intensity_file: Path to the input data file. -
compounds_num: Number of compounds in the dataset (i.e., number of rows in the input file). -
samples_num: Number of samples per compound (i.e., number of columns in the input file). -
correlation_result_filename: Name of the output CSV file that will store the computed correlation coefficients.
cd src/2_filter_high_correlation_compounds
python filter_high_correlation_compounds.py --correlation_file ../1_calculate_correlation/correlation_results.csv --seednode_file test_files/seednode.csv --correlation_threshold 0.7-
correlation_file: Path to the correlation_results from Step 1. -
seednode_file: Path to the seed node CSV file. This file should contain a list of initial compounds (including columns such as ID and SMILES) to be used for annotation. -
correlation_threshold: Correlation threshold (between 0 and 1).
cd src/3_construct_molecular_network
python construct_molecular_network.py --molecular_network_file test_files/source_target.csv --correlation_file ../1_calculate_correlation/correlation_results.csv --correlation_threshold 0.7 --RT_threshold 0.01 --edited_molecular_network_file source_target_cor_edit.csv-
molecular_network_file: Path to the molecular network file (CSV), containing columns such as Source, Target, and retention time (RT). -
correlation_file: Path to the correlation_results from Step 1. -
correlation_threshold: Correlation threshold (between 0 and 1). -
RT_threshold: Maximum allowed retention time difference between two nodes to include an edge. -
edited_molecular_network_file: Path to the edited molecular network CSV file with Source, Target, correlation, RT, etc.
if you need to prepare pubchem database from scratch, you should run this script first:
cd src/4_search_candidates
python process_pubchem_database.pyor you could download our prepared pubchem database from the google drive or baiduyun.
then, run this script for searching candidates:
python search_candidates.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --pubchem_database_path ./pubchem_database.pk --candidates_folder ./candidates/ --ppm_threshold 2 --is_filter_element --element_set 'C,H,O,N,P,S,F,Cl,Br,I'-
edited_molecular_network_file: Path to the edited molecular network CSV file with Source, Target, correlation, RT, etc from Step 3. -
pubchem_database_path: Path to the preprocessed PubChem database (pickle format). -
candidates_folder: Output folder to save retrieved candidate compounds for each node. -
ppm_threshold: Mass accuracy threshold in parts per million (ppm) for candidate matching. -
is_filter_element: Flag to enable filtering candidates by allowed elements. -
element_set: Comma-separated list of allowed chemical elements in candidate formulas.
5.1 Naive annotation
you could run the example step by step as follows:
cd src/5_annotation
python preprocess.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --candidates_folder ../4_search_candidates/candidates --top_k 10
python naive_prediction.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --tanimoto_similarity_threshold 0.5-
edited_molecular_network_file: Path to the edited molecular network CSV file with Source, Target, correlation, RT, etc from Step 3. -
seednode_file: Path to the seed node CSV file. This file should contain a list of initial compounds (including columns such as ID and SMILES) to be used for annotation. -
candidates_folder: Path to the folder with candidate compounds per node from Step 4. -
top_k: Number of top candidates to retain per node based on structural similarity. -
tanimoto_similarity_threshold: Minimum Tanimoto similarity to accept candidate annotations.
or you run the iterative annotation as follows:
cd src/5_annotation
python run_naive_iterative_annotation.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --candidates_folder ../4_search_candidates/candidates --tanimoto_similarity_threshold 0.5 --max_iterations 100 --top_k 10max_iterations: Maximum number of annotation rounds during the iterative annotation process.
5.2 Annotation with CFM-ID
First, you should prepare the CFM-ID environment, and then you could run the example step by step as follows:
cd src/5_annotation
python preprocess.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --candidates_folder ../4_search_candidates/candidates --top_k 10
python cfmid_prediction.py --num_containers 10 --tolerance 0.1 --energy_level 0 --ion_mode positive --spectrum_file ./test_files/compounds_spectrum.mgf --modified_cosine_similarity_threshold 0.7
python postprocess.py --seednode_file ./test_files/seednode.csv -
num_containers: Number of Docker containers to run in parallel for CFM-ID predictions. -
tolerance: Mass tolerance for matching predicted and experimental peaks. -
energy_level: Collision energy level (e.g., 0, 10, 20). -
ion_mode: Ionization mode, either positive or negative. -
spectrum_file: Path to the experimental spectrum file in MGF format. -
modified_cosine_similarity_threshold: Minimum modified cosine similarity to accept spectrum match.
or you run the iterative annotation as follows:
cd src/5_annotation
python run_iterative_annotation.py --molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --candidates_folder ../4_search_candidates/candidates --num_containers 10 --tolerance 0.1 --energy_level 0 --ion_mode positive --modified_cosine_similarity_threshold 0.5 --spectrum_file ./test_files/compounds_spectrum.mgf --max_iterations 100 --top_k 10max_iterations: Maximum number of annotation rounds during the iterative annotation process.
Finally, you could download the molecular structure images as follows:
python download_mol_imgs.py --annotation_result_file final_naive_annotation_results.csv --structure_image_folder naive_mol_imgs/-
annotation_result_file: CSV file containing final annotation results, with SMILES or molecular identifiers. -
structure_image_folder: Output folder to save downloaded molecular structure images.
This project is licensed under the MIT License.
- Shuping Zheng: [email protected]
- Jianping Zhou: [email protected]
If you find this repo useful, please cite our paper. Thanks for your attention.