Skip to content

C-GRAN (Community-Guided Recursive Annotation Network) is an open, network-based framework for the systematic discovery and annotation of emerging structural analogs in complex environmental samples.

License

Notifications You must be signed in to change notification settings

JeremyChou28/C_GRAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

C_GRAN (Community-Guided Recursive Annotation Network)

📗 Table of Contents

📖 About the Project

C-GRAN (Community-Guided Recursive Annotation Network) is an open, network-based framework for the systematic discovery and annotation of emerging structural analogs in complex environmental samples. Designed for non-target screening (NTS) using tandem mass spectrometry (MS/MS), C-GRAN integrates molecular networking with sample-wise co-occurrence analysis to uncover structurally or functionally related compounds beyond spectral similarity constraints. Starting from high-confidence seed annotations, candidate compounds are expanded through a recursive database search strategy, incorporating exact-mass-based matching. Each candidate is ranked based on structural similarity to known analogs, fragment match quality, and occurrence correlation. By iteratively propagating annotations across molecular networks, C-GRAN enables high-coverage identification of structurally diverse compounds—especially those missed by traditional spectral-based tools.

💻 Getting Started

Requirements

You should prepare the environment as follows:

pip install -r requirements.txt

Quick start

Step 1. Calculate correlation

cd src/1_calculate_correlation

python calculate_correlation.py --intensity_file test_files/test.txt --compounds_num 13 --samples_num 98 --correlation_result_filename correlation_results.csv
  • intensity_file: Path to the input data file.

  • compounds_num: Number of compounds in the dataset (i.e., number of rows in the input file).

  • samples_num: Number of samples per compound (i.e., number of columns in the input file).

  • correlation_result_filename: Name of the output CSV file that will store the computed correlation coefficients.

Step 2. Filter compounds with high correlation values

cd src/2_filter_high_correlation_compounds

python filter_high_correlation_compounds.py --correlation_file ../1_calculate_correlation/correlation_results.csv --seednode_file test_files/seednode.csv --correlation_threshold 0.7
  • correlation_file: Path to the correlation_results from Step 1.

  • seednode_file: Path to the seed node CSV file. This file should contain a list of initial compounds (including columns such as ID and SMILES) to be used for annotation.

  • correlation_threshold: Correlation threshold (between 0 and 1).

Step 3. Construct molecular network

cd src/3_construct_molecular_network

python construct_molecular_network.py --molecular_network_file test_files/source_target.csv --correlation_file ../1_calculate_correlation/correlation_results.csv --correlation_threshold 0.7 --RT_threshold 0.01 --edited_molecular_network_file source_target_cor_edit.csv
  • molecular_network_file: Path to the molecular network file (CSV), containing columns such as Source, Target, and retention time (RT).

  • correlation_file: Path to the correlation_results from Step 1.

  • correlation_threshold: Correlation threshold (between 0 and 1).

  • RT_threshold: Maximum allowed retention time difference between two nodes to include an edge.

  • edited_molecular_network_file: Path to the edited molecular network CSV file with Source, Target, correlation, RT, etc.

Step 4. Search candidates

if you need to prepare pubchem database from scratch, you should run this script first:

cd src/4_search_candidates

python process_pubchem_database.py

or you could download our prepared pubchem database from the google drive or baiduyun.

then, run this script for searching candidates:

python search_candidates.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --pubchem_database_path ./pubchem_database.pk --candidates_folder ./candidates/ --ppm_threshold 2 --is_filter_element --element_set 'C,H,O,N,P,S,F,Cl,Br,I'
  • edited_molecular_network_file: Path to the edited molecular network CSV file with Source, Target, correlation, RT, etc from Step 3.

  • pubchem_database_path: Path to the preprocessed PubChem database (pickle format).

  • candidates_folder: Output folder to save retrieved candidate compounds for each node.

  • ppm_threshold: Mass accuracy threshold in parts per million (ppm) for candidate matching.

  • is_filter_element: Flag to enable filtering candidates by allowed elements.

  • element_set: Comma-separated list of allowed chemical elements in candidate formulas.

Step 5. Annotation

5.1 Naive annotation

you could run the example step by step as follows:

cd src/5_annotation

python preprocess.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --candidates_folder ../4_search_candidates/candidates --top_k 10

python naive_prediction.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --tanimoto_similarity_threshold 0.5
  • edited_molecular_network_file: Path to the edited molecular network CSV file with Source, Target, correlation, RT, etc from Step 3.

  • seednode_file: Path to the seed node CSV file. This file should contain a list of initial compounds (including columns such as ID and SMILES) to be used for annotation.

  • candidates_folder: Path to the folder with candidate compounds per node from Step 4.

  • top_k: Number of top candidates to retain per node based on structural similarity.

  • tanimoto_similarity_threshold: Minimum Tanimoto similarity to accept candidate annotations.

or you run the iterative annotation as follows:

cd src/5_annotation

python run_naive_iterative_annotation.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --candidates_folder ../4_search_candidates/candidates --tanimoto_similarity_threshold 0.5 --max_iterations 100 --top_k 10
  • max_iterations: Maximum number of annotation rounds during the iterative annotation process.

5.2 Annotation with CFM-ID

First, you should prepare the CFM-ID environment, and then you could run the example step by step as follows:

cd src/5_annotation

python preprocess.py --edited_molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --candidates_folder ../4_search_candidates/candidates --top_k 10

python cfmid_prediction.py --num_containers 10 --tolerance 0.1 --energy_level 0 --ion_mode positive --spectrum_file ./test_files/compounds_spectrum.mgf --modified_cosine_similarity_threshold 0.7

python postprocess.py --seednode_file ./test_files/seednode.csv 
  • num_containers: Number of Docker containers to run in parallel for CFM-ID predictions.

  • tolerance: Mass tolerance for matching predicted and experimental peaks.

  • energy_level: Collision energy level (e.g., 0, 10, 20).

  • ion_mode: Ionization mode, either positive or negative.

  • spectrum_file: Path to the experimental spectrum file in MGF format.

  • modified_cosine_similarity_threshold: Minimum modified cosine similarity to accept spectrum match.

or you run the iterative annotation as follows:

cd src/5_annotation

python run_iterative_annotation.py --molecular_network_file ../3_construct_molecular_network/source_target_cor_edit.csv --seednode_file ./test_files/seednode.csv --candidates_folder ../4_search_candidates/candidates --num_containers 10 --tolerance 0.1 --energy_level 0 --ion_mode positive --modified_cosine_similarity_threshold 0.5 --spectrum_file ./test_files/compounds_spectrum.mgf --max_iterations 100 --top_k 10
  • max_iterations: Maximum number of annotation rounds during the iterative annotation process.

Finally, you could download the molecular structure images as follows:

python download_mol_imgs.py --annotation_result_file final_naive_annotation_results.csv --structure_image_folder naive_mol_imgs/
  • annotation_result_file: CSV file containing final annotation results, with SMILES or molecular identifiers.

  • structure_image_folder: Output folder to save downloaded molecular structure images.

📝 License

This project is licensed under the MIT License.

👥 Contact

🔗 Citation

If you find this repo useful, please cite our paper. Thanks for your attention.

About

C-GRAN (Community-Guided Recursive Annotation Network) is an open, network-based framework for the systematic discovery and annotation of emerging structural analogs in complex environmental samples.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages