This pipeline extracts and analyzes DNA sequences for population structure and genetic differentiation analysis of Rhodnius prolixus populations across Colombia, Brazil, and Venezuela. The analysis follows methods from the reference paper (Nature Scientific Reports, 2025).
sequence_extraction_script.py- Main script to extract sequences from NCBI GenBankpopulation_genetics_analyzer.py- Basic population genetics analysis scriptrequirements.txt- Python package dependenciesMetadata.csv- Sample metadata with accession numbersREADME.md- This instruction file
Based on your metadata, the following 8 loci will be extracted:
- 28S - Nuclear ribosomal RNA gene
- CISP - Nuclear protein-coding gene
- TRNA - Transfer RNA gene
- CYTB - Mitochondrial cytochrome b gene
- UPMETAL - Nuclear protein-coding gene
- UPCA - Nuclear protein-coding gene
- PJH - Nuclear protein-coding gene
- LSM - Nuclear protein-coding gene
# Install Python packages
pip install -r requirements.txtEdit the sequence_extraction_script.py file and replace "[email protected]" with your actual email address (required by NCBI):
extractor = SequenceExtractor(email="[email protected]")python sequence_extraction_script.pyWhat this does:
- Reads accession numbers from
Metadata.csv - Downloads sequences from NCBI GenBank for all 8 gene loci
- Organizes sequences by gene type
- Creates mapping files and summary reports
Expected Output:
sequences/directory with FASTA files for each genesample_sequence_mapping.csv- Shows which sequences were retrieved per sampleextraction_summary.txt- Summary statisticssequence_extraction.log- Detailed log file
python population_analysis_script.pyWhat this does:
- Calculates basic genetic diversity statistics
- Groups samples by country and region
- Generates summary visualizations
- Prepares data for advanced analyses
Expected Output:
population_analysis_summary.txt- Comprehensive analysis summarypopulation_analysis_overview.png- Visualization plots
Based on the methods in your reference paper, you should obtain:
| Metric | Description |
|---|---|
| Number of sequences | Total sequences retrieved per locus |
| Sequence length | Base pairs per aligned locus |
| Segregating sites (S) | Variable nucleotide positions |
| Nucleotide diversity (π) | Average pairwise differences |
| Haplotype diversity (h) | Genetic variation measure |
- Sample distribution: Colombia (~94 samples), Venezuela (~29 samples), Brazil (~6 samples)
- Geographic clustering: Samples grouped by country/state
- Population differentiation: FST values between populations
- Gene flow patterns: Connectivity between regions
The extracted sequences can be used for:
- MAFFT alignment (as mentioned in your methods)
- STRUCTURE/ADMIXTURE clustering analysis
- IQ-Tree/PhyML phylogenetic reconstruction
- DnaSP neutrality tests (Tajima's D, Fu & Li's D)
After running these scripts, you'll need specialized software for advanced analyses:
# Using MAFFT (online or local installation)
mafft --auto sequences/28S_sequences.fasta > 28S_aligned.fasta- Input: Concatenated nuclear loci alignment
- Parameters: K=1-20, 100,000 MCMC iterations
- Use STRUCTURE HARVESTER to determine optimal K
iqtree -s concatenated_alignment.fasta -bb 1000 -alrt 1000- Calculate FST, neutrality tests
- Estimate divergence times
- Analyze gene flow patterns
Your analysis should reveal:
-
Population Structure:
- Primary clustering by country (Colombia, Venezuela, Brazil)
- Secondary structure by geographic regions within countries
- Possible isolation-by-distance patterns
-
Genetic Differentiation:
- Highest FST between Brazil and other countries
- Moderate differentiation between Colombian regions
- Gene flow corridors along major river systems
-
Diversity Patterns:
- Higher diversity in Colombian populations
- Reduced diversity in peripheral populations
- Different patterns between nuclear and mitochondrial loci
-
Evolutionary Implications:
- Recent population expansion
- Historical bottlenecks in some regions
- Adaptation to different ecotopes (sylvatic/domestic)
- NCBI connection errors: Add delays between requests, check internet connection
- Missing sequences: Some accession numbers may be invalid or restricted
- Memory issues: Process genes separately if dataset is too large
Check sequence_extraction.log for detailed error messages and failed accessions.
Methods Summary for Paper: "DNA sequences were retrieved from NCBI GenBank using accession numbers for 8 nuclear and mitochondrial loci. Sequences were aligned using MAFFT v7 and analyzed using custom Python scripts based on BioPython. Population structure was assessed using [specify methods used], genetic differentiation calculated via FST estimation, and phylogenetic relationships reconstructed using maximum likelihood in IQ-Tree."
For questions about this pipeline or evolutionary genetics analysis, consult:
- BioPython documentation: https://biopython.org/
- NCBI E-utilities: https://www.ncbi.nlm.nih.gov/books/NBK25501/
- Population genetics textbooks for theoretical background
Analysis pipeline for Rhodnius prolixus population genetics - November 2025