StarSignDNA is a powerful algorithm for mutational signature analysis that provides both efficient refitting of known signatures and de novo signature extraction. The tool excels at:
- Deciphering well-differentiated signatures linked to known mutagenic mechanisms
- Demonstrating strong associations with patient clinical features
- Providing robust statistical analysis and visualization
- Handling both single-sample and cohort analyses
- Supporting various input formats including VCF files
- License: MIT license
- Paper: Preprint
- Source Code: https://github.com/uio-bmi/StarSignDNA
You can install StarSignDNA in one of two ways:
Via PyPI (Recommended):
pip install starsigndna
From Source:
git clone https://github.com/uio-bmi/StarSignDNA cd StarSignDNA pip install -e .
- Signature Refitting: Accurate refitting of known mutational signatures to new samples
- De Novo Discovery: Novel signature extraction from mutation catalogs
- Flexible Input: Support for both VCF files and mutation count matrices
- Statistical Robustness: Built-in bootstrapping and cross-validation
- Visualization: Comprehensive plotting functions for signature analysis
- Performance: Optimized algorithms for both small and large datasets
- Reference Genome Support: Built-in handling of various genome builds
- Enhanced CLI: Improved signature parsing supporting both comma and space-separated formats
Get help on available commands:
starsigndna --help
- count-mutation: Count mutation types in VCF files
- denovo: Perform de novo signature discovery
- refit: Refit known signatures to new samples
The refitting algorithm matches mutation patterns against known COSMIC signatures.
Basic Usage:
starsigndna refit <matrix_file> <signature_file> [OPTIONS]
Example with Specific Signatures:
starsigndna refit example_data/M_catalogue.txt example_data/COSMICv34.txt \
--output-folder /test_result \
--signature-names "SBS1,SBS3,SBS5,SBS6,SBS8"
Example with VCF Input:
starsigndna refit example_data/tcga_coad_single.vcf example_data/COSMICv34.txt \
--output-folder /test_result \
--signature-names "SBS1,SBS3,SBS5,SBS6,SBS8" \
--ref-genome GRCh38
- --ref_genome: Reference genome for VCF processing
- --n_bootstraps: Number of bootstrap iterations (default: 200)
- --opportunity_file: Custom mutation opportunity matrix
- --signature_names: Specific signatures to consider (minimum 5 signatures required)
- --n_iterations: Maximum optimization iterations (default: 1000)
Signature Names Format: The --signature-names parameter accepts both comma-separated and space-separated formats: * Comma-separated: "SBS1,SBS3,SBS5,SBS6,SBS8"
The refit command generates several output files in the specified output folder:
For single sample analysis: * StarSign_refit_exposure_median_{run_name}.txt: Median exposure values across bootstrap iterations * StarSign_refit_exposure_Exposure_{run_name}.txt: Full exposure matrix from bootstrap analysis * StarSign_refit_exposure_Exposure_{run_name}.png: Violin plot of exposure distributions
For cohort analysis: * refit_{run_name}_threshold.txt: Exposure matrix after signature filtering * average_refit_{run_name}.txt: Average exposure values across samples * starsign_refit_top5_signatures_{run_name}.png: Bar plot of top 5 signatures by average exposure * starsign_refit_cohort_{run_name}.png: Violin plot showing exposure distributions across cohort
For VCF input: * matrix.csv: Generated mutation count matrix from VCF file
The de novo algorithm extracts novel signatures from mutation patterns.
Basic Usage:
starsigndna denovo <matrix_file> <n_signatures> [OPTIONS]
Example with Optimal Parameters:
starsigndna denovo snakemake/results/data/M_catalogue.txt 4 \
--cosmic-file example_data/COSMICv34.txt \
--output-folder /test_result
In this example, we are using the default parameters for the de novo algorithm.
For systematic parameter optimization and cross-validation, StarSignDNA includes a Snakemake workflow that performs grid search over multiple parameters.
Prerequisites:
* Install Snakemake: pip install snakemake
* Ensure you have sufficient computational resources
Workflow Steps:
Navigate to Snakemake Directory:
cd snakemake
Configure Grid Search Parameters:
# Edit the Snakefile to set your desired parameter ranges vi Snakefile # Key parameters to modify: ks = list(range(2, 10)) # Number of signatures to test lambdas = [0, 0.01, 0.05, 0.1, 0.2] # Regularization parameters # For quick testing, use smaller ranges: ks = list(range(2, 3)) # Test fewer signatures lambdas = [0, 0.01] # Test fewer lambda values
Prepare Input Data:
# Ensure your mutation matrix is in the correct location # Default expected location: results/data/train_m.csv and results/data/test_m.csv # Or modify the Snakefile to point to your data files
Run Grid Search:
# Run with multiple CPU cores for faster processing snakemake -j 4 # Use 4 CPU cores snakemake -j <num_cpu> # Use specified number of cores # For dry run (check what will be executed): snakemake -n
Monitor Progress:
# The workflow will show progress and create multiple output files # Results are stored in: results/data/<fold>/<k>/<lambda>/ # Each combination generates: e.csv, s.csv, e_new.csv, logpmf.txt
Find Optimal Parameters:
# Sort results by log-likelihood to find best parameters sort -k3n,3 results/data/all.csv | head -10 # The output shows: fold, k, lambda, log-likelihood # Lower log-likelihood values indicate better model fit **Run de novo analysis with the optimal parameters:** starsigndna denovo example_data/M_catalogue.txt 4 --lambd 0.01 \ --cosmic-file example_data/COSMICv34.txt \ --output-folder /test_result
Understanding Snakemake Output:
- Cross-validation Results: Each parameter combination is tested across multiple folds
- Signature Files (s.csv): Extracted signatures for each parameter set
- Exposure Files (e.csv, e_new.csv): Training and test exposures
- Log-likelihood (logpmf.txt): Model fitness scores
- Summary (all.csv): Consolidated results across all parameter combinations
- --lambd: Regularization parameter (default: 0.7)
- --opportunity-file: Custom mutation opportunity matrix
- --cosmic-file: Reference signatures for comparison
- --max-em-iterations: Maximum EM iterations (default: 100)
- --max-gd-iterations: Maximum gradient descent steps (default: 50)
The denovo command generates several output files in the specified output folder:
- StarSign_denovo_{run_name}_signature.txt: Extracted mutational signatures matrix (signatures × mutation types)
- StarSign_denovo_{run_name}_exposures.txt: Signature exposures for each sample (samples × signatures)
- StarSign_denovo_{run_name}_profile.png: Visualization of extracted signatures
- StarSign_denovo{run_name}_cosine_similarity.txt: Similarity scores with known COSMIC signatures (if --cosmic-file provided)
For VCF input: * matrix.csv: Generated mutation count matrix from VCF file
- Maintainer: Christian D. Bope
- Email: [email protected]
- Institution:University of Oslo/University of Kinshasa
If you use StarSignDNA in your research, please cite our paper: [Citation details to be added after publication]