Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dp24 refactoring #1

Merged
merged 8 commits into from
Aug 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 29 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,39 @@

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
Naming based on: [Mythical creatures](https://en.wikipedia.org/wiki/List_of_legendary_creatures_by_type).

## v1.0dev - [date]
## v1.0.0 - Aquatic Bahamut [21/08/2024]

Initial release of sanger-tol/ear, created with the [nf-core](https://nf-co.re/) template.
The current pipeline means the MVP for ear.

### `Added`
### Added
GFASTATS to generate statistics on the input primary genome.
MERQURY_FK to generate kmer graphs and analyses of the primary, haplotype and merged assembly.
BLOBTOOLKIT to generate busco files and blobtoolkit dataset/plots.
CURATIONPRETEXT to generate pretext plots and pngs.

### `Fixed`
### Parameters

### `Dependencies`
| Old parameter | New parameter |
| --------------- | ------------- |
| | --mapped |

### `Deprecated`
### Software dependencies

| Dependency | Old version | New version |
| ----------- | ------------- | ------------- |
| sanger-tol/blobtoolkit* | | draft_assemblies |
| sanger-tol/curationpretext* | | 1.0.0 (UNSC Cradle) |
| GFASTATS | | 1.3.6--hdcf5f25_3 |
| MERQUERY_FK | | 1.2 |
| MINIMAP2_ALIGN | | 2.28 |
| SAMTOOLS_MERGE | | 1.20--h50ea8bc_0 |
| SAMTOOLS_SORT | | 1.20--h50ea8bc_0 |
|

- Note: for pipelines, please check their own CHANGELOG file for a full list of software dependencies.

### Dependencies
The pipeline depends on a number of databases which are noted in [README](README.md) and [USAGE](docs/usage.md).
91 changes: 55 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,51 +10,74 @@

## Introduction

**sanger-tol/ear** is a bioinformatics pipeline that ...
**sanger-tol/ear** is a bioinformatics pipeline that generates the data files required for the the generation of ERGA Assembly Reports. Sanger-tol/ear nests two other sanger-tol pipelines (blobtoolkit and curationpretext).

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
1. Read the input yaml file (YAML_INPUT)
2. Run GFASTATS (GFASTARS)
3. Run MERQURYFK_MERQURYFK (MERQURYFK)
4. Run MAIN_MAPPING, longread single-end/paired-end mapping
5. Run GENERATE_SAMPLESHEET, generate a csv file required for SANGER_TOL_BTK.
6. Run SANGER_TOL_BTK, also known as SANGER-TOL/BLOBTOOLKIT a subpipline for SANGER-TOL/EAR
7. Run SANGER_TOL_CPRETEXT, also known as SANGER-TOL/CURATIONPRETEXT a subpipeline for SANGER-TOL/EAR.

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):

First, prepare a samplesheet with your input data that looks as follows:

`samplesheet.csv`:

```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
The sanger-tol/ear pipeline requires a number of databases in place in order to run the blobtoolkit pipeline.
These include:
- A blast nt database
- A Diamond blast uniprot database
- A Diamond blast nr database
- An NCBI taxdump
- An NCBI rankedlineage.dmp

Next, a yaml file containing the following should then be completed:

```yaml
# General Vales for all subpiplines and modules
assembly_id: <NAME OF ASSEMBLY>
reference_hap1: <LOCATION OF PRIMARY ASSEMBLY FILE .FA>
reference_hap2: <LOCATION OF HAPLOTYPE ASSEBMLY FILE .FA>
reference_haplotigs: <LOCATION OF THE HAPLOTIGS FILE, REMOVED DURING CURATION .FA>

# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore it and the pipeline will create it.
mapped_bam: <MAPPED BAM .BAM>

merquryfk:
fastk_hist: <THE PATH TO THE .HIST FILE>
fastk_ktab: <PATH TO THE DIRECTORY CONTAINING THE KTAB FILES, ENSURE THE HIDDEN FILES ARE HERE TOO>

# Used by both subpipelines
longread:
type: <hifi|clr|ont|illumina>
dir: <DIRECTORY OF LONGREAD FILES .FASTA.GZ>
curationpretext:
aligner: <minimap2|BWAMEM>
telomere_motif: <TELOMERE MOTIF OF SAMPLE>
hic_dir: <DIRECTORY OF HIC READ FILES .CRAM AND .CRAI>
btk:
taxid: 1464561
lineages: <CSV LIST OF DATABASES TO USE: "insecta_odb10,diptera_odb10">
gca_accession: GCA_0001 <DEFAULT, DO NOT CHANGE UNLESS YOU HAVE A GCA_ACCESSION FOR YOUR SPECIES>
nt_database: <DIRECTORY CONTAINING BLAST DB>
nt_database_prefix: <BLASTDB PREFIX>
diamond_uniprot_database_path: <PATH TO reference_proteomes.dmnd FROM UNIPROT>
diamond_nr_database_path: <PATH TO nr.dmnd>
ncbi_taxonomy_path: <DIRECTORY CONTAINING THE TAXDUMP>
ncbi_rankedlineage_path: <FOLDER CONTAINING THE rankedlineage.dmp FILE>
config: <PATH TO ear/conf/sanger-tol-btk.config TO OVERWRITE PROCESS LIMITS>
```

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).

-->

Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run sanger-tol/ear \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
nextflow run sanger-tol/ear -profile <singularity,docker> \\
--input assets/idCulLati1.yaml \\
--mapped TRUE \\ # OPTIONAL
--outdir test-truth
```

> [!WARNING]
Expand All @@ -65,10 +88,6 @@ nextflow run sanger-tol/ear \

sanger-tol/ear was originally written by DLBPointon.

We thank the following people for their extensive assistance in the development of this pipeline:

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->

## Contributions and Support

If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
Expand Down
33 changes: 33 additions & 0 deletions assets/idCulLati1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# General Vales for all subpiplines and modules
assembly_id: idCulLati1_ear
reference_hap1: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/primary.fa
reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/hap2.fa
reference_haplotigs: /

# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore.
mapped_bam: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/mapped_bam.bam

merquryfk:
fastk_hist: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/idCulLati1.k31.hist
fastk_ktab: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/

# Used by both subpipelines
longread:
type: hifi
dir: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati1/pacbio/fasta/
curationpretext:
aligner: minimap2
telomere_motif: TTAGG
hic_dir: /lustre/scratch122/tol/data/a/5/e/1/6/d/Culex_laticinctus/genomic_data/idCulLati2/hic-arima2/
btk:
taxid: 1464561
lineages: "insecta_odb10"
gca_accession: GCA_0001
nt_database: /data/blastdb/Supported/NT/current
nt_database_prefix: nt
diamond_uniprot_database_path: /lustre/scratch123/tol/resources/uniprot_reference_proteomes/latest/reference_proteomes.dmnd
diamond_nr_database_path: /lustre/scratch123/tol/resources/nr/latest/nr.dmnd
ncbi_taxonomy_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump/
ncbi_rankedlineage_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump/rankedlineage.dmp
btk_yaml: /nfs/users/nfs_d/dp24/sanger-tol-ear/assets/btk_draft.yaml
config: /nfs/treeoflife-01/teams/tola/users/dp24/ear/conf/sanger-tol-btk.config
45 changes: 45 additions & 0 deletions assets/real_pdf.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# SAMPLE INFORMATION
ToLID: idCulLati1
Species: Culex laticinctus
Sex: XX
Submitter: Michael Paulini
Affiliation: WSI
Tags: ERGA-BGE

# SEQUENCING DATA
DATA:
- PacBio HiFi: 51x
- Arima v2: 152x

# GENOME PROFILING DATA
PROFILING:
GenomeScope:
version: 2.0
results_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/

# ASSEMBLY DATA
ASSEMBLIES:
Pre-curation:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/

Curated:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.fa.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.insecta_odb10.busco/short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt
merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.ccs.merquryk/
hic_FullMap_png: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal_snapshots/idCulLati1.1_normal_FullMap.png
hic_FullMap_link: https://tolqc.cog.sanger.ac.uk/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal.pretext
blobplot_cont_png: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_primary_curated_btk_busco.blob.circle.png

# CURATION NOTES
NOTES:
Obs_Haploid_num: 3
Obs_Sex: XX
Interventions_per_Gb: 430
Contamination_notes: "Total length of scaffolds removed: 989,717 (0.1 %)\nScaffolds removed: 1 (0.2 %)\nLargest scaffold removed: (989,717)\nFCS-GX contaminant species (number of scaffolds; total length of scaffolds):\nWolbachia endosymbiont (group B) of Melanostoma mellinum, a-proteobacteria (1; 989,717)"
Other_notes: "Chromosomes named by size"
45 changes: 45 additions & 0 deletions assets/template_pdf.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# SAMPLE INFORMATION
ToLID: <SAMPLE_ID>
Species: <LATIN_NAME>
Sex: <EXPECTED_SEX>
Submitter: <CURATOR>
Affiliation: WSI
Tags: ERGA-BGE

# SEQUENCING DATA
DATA:
- PacBio HiFi: <PACBIO_COVERAGE>
- Arima v2: <ARIMA_COVERAGE>

# GENOME PROFILING DATA
PROFILING:
GenomeScope:
version: 2.0
results_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/genomic_data/idCulLati1/pacbio/kmer/k31/

# ASSEMBLY DATA
ASSEMBLIES:
Pre-curation:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/

Curated:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
pri:
gfastats--nstar-report_txt: idCulLati1.1.primary.curated.fa.gfastats
busco_short_summary_txt: short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt
merqury_folder: <POST_CURATION_MERQURY_FOLDER>
hic_FullMap_png: <CURATION_PRETEXT_PRETEXT_MAP_PNG>
hic_FullMap_link: https://tolqc.cog.sanger.ac.uk/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1_normal.pretext
blobplot_cont_png: idCulLati1.1_primary_curated_btk_busco.blob.circle.png

# CURATION NOTES
NOTES:
Obs_Haploid_num: <OBSERVED_HAPLOID_CHROMOSOME_COUNT>
Obs_Sex: <OBSERVED_SEX>
Interventions_per_Gb: <MANUAL_INTERVENTIONS_PER_GB>
Contamination_notes: "Total length of scaffolds removed: 989,717 (0.1 %)\nScaffolds removed: 1 (0.2 %)\nLargest scaffold removed: (989,717)\nFCS-GX contaminant species (number of scaffolds; total length of scaffolds):\nWolbachia endosymbiont (group B) of Melanostoma mellinum, a-proteobacteria (1; 989,717)"
Other_notes: "Chromosomes named by size"
1 change: 1 addition & 0 deletions assets/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/asccTinyTest_V2/as
longread:
type: hifi
dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/fasta/
mapped_bam: idCulLati1/mapped_bam.bam
curationpretext:
aligner: minimap2
telomere_motif: TTAGG
Expand Down
4 changes: 4 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ process {
maxRetries = 1
maxErrors = '-1'

withName: "SANGER_TOL_CPRETEXT|SANGER_TOL_BTK" {
time = { check_max( 70.h * task.attempt, 'time' ) }
}

// Process-specific resource requirements
// NOTE - Please try and re-use the labels below as much as possible.
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
Expand Down
30 changes: 21 additions & 9 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -12,30 +12,42 @@

process {

publishDir = [
path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
withName: "GFASTATS|MERQURYFK_MERQURYFK|SANGER_TOL_BTK|SANGER_TOL_CPRETEXT" {
publishDir = [
path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: GFASTATS {
ext.args = '--nstar-report'
ext.args = '--nstar-report'
}

withName: MERQURYFK_MERQURYFK {
ext.args = "-P."
ext.args = "-P."
}

withName: SAMTOOLS_SORT {
ext.prefix = { "${meta.id}_sorted"}
ext.prefix = { "${meta.id}_sorted"}
}

withName: SANGER_TOL_BTK {
ext.args = "--blastx_outext 'txt'"
ext.pipeline_name = "sanger-tol/blobtoolkit"
ext.args = ""
ext.executor = "bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]'"
ext.profiles = "singularity,sanger"
ext.get_versions = "lsid | head -n1 | cut -d ',' -f 1"
ext.version = "draft_assemblies"
}

withName: SANGER_TOL_CPRETEXT {
ext.pipeline_name = "sanger-tol/curationpretext"
ext.args = ""
ext.executor = "bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]'"
ext.profiles = "singularity,sanger"
ext.get_versions = "lsid | head -n1 | cut -d ',' -f 1"
ext.version = "1.0.0"
}

}
7 changes: 7 additions & 0 deletions conf/sanger-tol-btk.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
process {
withLabel:RUN_BLASTN:BLASTN_TAXON {
cpus = { check_max( 12 * task.attempt, 'cpus' ) }
memory = { check_max( 10.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }
}
}
Loading
Loading