Skip to content

JasonJiang42/HK_One_Health_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

One Health genomic analysis of CTX-M-producing E. coli

This repository outlines the analysis pipeline for the paper S. Jiang. et al. Cross-sectoral sharing of CTX-M-producing Escherichia coli: A One Health analysis to understand dissemination modes (in review)

Genome assembly and QC assessment

High-quality reads were assembled using SPAdes (https://github.com/ablab/spades)

python spades.py --pe1-1 file1 --pe1-2 file2 -o assmebly --careful -k 21,33,55,77,99,127

Species confirmation by GTDB-Tk (https://github.com/Ecogenomics/GTDBTk)

gtdbtk classify_wf --genome_dir genomes --out_dir gtdbtk/classify --cpus 10 --skip_ani_screen

QC assessment by checkM (https://github.com/Ecogenomics/CheckM)

checkm lineage_wf -x fasta input_bins output_folder

Genome annotation and population genomics

Antibiotic resistance genes (ARGs) identification using AMRFinderPlus (https://github.com/ncbi/amr)

amrfinder -n seq.fna --organism Escherichia

The lineages were assigned by PopPUNK (https://poppunk.readthedocs.io/en/latest/index.html)

poppunk --create-db --output EC_database --r-files list.txt --threads 8
poppunk --fit-model lineages --ref-db EC --ranks 1,2,3
poppunk_visualise --ref-db EC --cytoscape --network-file EC/EC_graph.gt

Phylogenetic analysis

core genome alignment was generated using snippy (https://github.com/tseemann/snippy), and recombination sites were removed with Gubbins (https://github.com/nickjcroucher/gubbins). A maximum-likelihood phylogenetic tree was then constructed using IQ-TREE (http://www.iqtree.org/) based on clean core genome SNP alignments.

snippy --outdir mut1 --ref ref.gbk --ctgs mut1.fasta
run_gubbins.py -p gubbins clean.full.aln
snp-sites -c gubbins.filtered_polymorphic_sites.fasta > clean.core.aln
iqtree -s clean.core.aln --boot-trees --wbtl -m GTR+I+G -B 1000 -nt 18

Source prediction using DAPC

Call the core SNPs

snippy-core --ref ref.gbk s1.fna s2.fna ...   

Discriminant Analysis of Principal Components (DAPC) analysis

if (!requireNamespace("vcfR", quietly = TRUE)) install.packages("vcfR")
if (!requireNamespace("adegenet", quietly = TRUE)) install.packages("adegenet")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")

library(vcfR)
library(adegenet)
library(ggplot2)

train_vcf_file <- "train_population_data.vcf" #The train list for source prediction was uploaded in the repository
supplementary_vcf_file <- "HK_individuals.vcf" 
train_vcf <- read.vcfR(train_vcf_file)
supplementary_vcf <- read.vcfR(supplementary_vcf_file)
train_genlight <- vcfR2genlight(train_vcf)
predict_genlight <- vcfR2genlight(supplementary_vcf)

dapc <- dapc(train_genlight, grp$grp)
pred.sup <- predict.dapc(dapc, newdata=predict_genlight)
predict_coords <- pred.sup$ind.scores

Mobile genetic elements identification

PLASMe is used to identify plasmid contigs (https://github.com/HubertTang/PLASMe)

python PLASMe.py input.fasta plasme_predict.fna

All contigs were further mapped to the E. coli K-12 chromosome for validation

blastn_scripts.py -i contig.fna -db K-12 -o map.results.txt --minid 0.7 --mincov 0.7 -t 8

The predicted plasmids were further clustered by genetic distance

## plasmid gene annotation
prokka /path/to/"$sample".fasta --quiet --outdir /path/to/prokka_output/"$sample" --force --prefix $sample

## The pangenome of plasmids was generated by Roary
roary *.gff -cd 95 -f plasmid_pangenome

## Pairwise Jaccard similarity coefficient calculated between genomes using scripts _pw_similarity.py_
python pw_similarity.py -i binary_presc_absc.tsv -o example1 -r "isolates" -s "jaccard" -f 0

## The community detection was generated based on similarity using the Louvain algorithm (https://github.com/taynaud/python-louvain)
usage: louvain_community.py [-h] -i INPUT -o OUTPUT [--resolution RESOLUTION]

Calculate Louvain communities from Mash distance results.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input Mash distance results file (tab-delimited).
  -o OUTPUT, --output OUTPUT
                        Output file for Louvain community results.
  --resolution RESOLUTION
                        Resolution parameter for Louvain algorithm (default:
                        1.0)

The mobile genetic elements are predicted by mapping against a reference using conseq.py

usage: conseq.py [-h] -r REFERENCE -q QUERY [-p PREFIX] -c COVERAGE -t TAB_OUTPUT -o FASTA_OUTPUT

Run nucmer, calculate coverage, and filter query contigs by coverage.

options:
  -h, --help            show this help message and exit
  -r REFERENCE, --reference REFERENCE
                        Path to the reference input file.
  -q QUERY, --query QUERY
                        Path to the query input file.
  -p PREFIX, --prefix PREFIX
                        Prefix for nucmer output files (default: nucmer_output).
  -c COVERAGE, --coverage COVERAGE
                        Minimum coverage threshold for filtering contigs.
  -t TAB_OUTPUT, --tab_output TAB_OUTPUT
                        Path to the output tab file with contig lengths, coverage, and reference.
  -o FASTA_OUTPUT, --fasta_output FASTA_OUTPUT
                        Path to the output FASTA file for filtered contigs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages