Skip to content

Commit

Permalink
Merge pull request #2 from sanger-tol/dp24_testdata
Browse files Browse the repository at this point in the history
testing
  • Loading branch information
DLBPointon authored Sep 18, 2024
2 parents 844c575 + a8c8189 commit 37324f8
Show file tree
Hide file tree
Showing 34 changed files with 781 additions and 192 deletions.
47 changes: 43 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ on:

env:
NXF_ANSI_LOG: false
NXF_SINGULARITY_CACHEDIR: ${{ github.workspace }}/.singularity
NXF_SINGULARITY_LIBRARYDIR: ${{ github.workspace }}/.singularity

concurrency:
group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}"
Expand All @@ -24,9 +26,32 @@ jobs:
strategy:
matrix:
NXF_VER:
- "23.04.0"
- "24.04.0"
- "latest-everything"
steps:
- name: Get branch names
# Pulls the names of current branches in repo
# steps.branch-names.outputs.current_branch is used later and returns the name of the branch the PR is made FROM not to
id: branch-names
uses: tj-actions/branch-names@v8

- name: Setup apptainer
uses: eWaterCycle/setup-apptainer@main

- name: Set up Singularity
run: |
mkdir -p $NXF_SINGULARITY_CACHEDIR
mkdir -p $NXF_SINGULARITY_LIBRARYDIR
- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install nf-core
run: |
pip install nf-core
- name: Check out pipeline code
uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4

Expand All @@ -35,12 +60,26 @@ jobs:
with:
version: "${{ matrix.NXF_VER }}"

- name: Disk space cleanup
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
# This will only download the main pipeline containers, subpipelines need their own nf-download
- name: NF-Core Download - download singularity containers
run: |
nf-core download sanger-tol/ear --revision ${{ steps.branch-names.outputs.current_branch }} --compress none -d --force --outdir sanger-ear --container-cache-utilisation amend --container-system singularity
- name: Download Tiny test data
# Download A fungal test data set that is full enough to show some real output.
# Needs a kmer db for merqury
run: |
curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
cp TreeValTinyData/assembly/draft/grTriPseu1.fa TreeValTinyData/assembly/draft/grTriPseu1-hap.fa
cp TreeValTinyData/assembly/draft/grTriPseu1.fa TreeValTinyData/assembly/draft/grTriPseu1-all_hap.fa
# - name: Disk space cleanup
# uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1

- name: Run pipeline with test data
# TODO nf-core: You can customise CI pipeline run tests as required
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
# Skip BTK and CPRETEXT as they are already tested on their repos.
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results --steps btk,cpretext,merquryfk
1 change: 1 addition & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ lint:
- assets/nf-core-ear_logo_light.png
- docs/images/nf-core-ear_logo_light.png
- docs/images/nf-core-ear_logo_dark.png
- lib/nfcore_external_java_deps.jar
- .github/ISSUE_TEMPLATE/config.yml
- .github/workflows/awstest.yml
- .github/workflows/awsfulltest.yml
Expand Down
31 changes: 17 additions & 14 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,31 +10,34 @@ Initial release of sanger-tol/ear, created with the [nf-core](https://nf-co.re/)
The current pipeline means the MVP for ear.

### Added

GFASTATS to generate statistics on the input primary genome.
MERQURY_FK to generate kmer graphs and analyses of the primary, haplotype and merged assembly.
BLOBTOOLKIT to generate busco files and blobtoolkit dataset/plots.
CURATIONPRETEXT to generate pretext plots and pngs.

### Parameters

| Old parameter | New parameter |
| --------------- | ------------- |
| | --mapped |
| Old parameter | New parameter |
| ------------- | ------------- |
| | --mapped |

### Software dependencies

| Dependency | Old version | New version |
| ----------- | ------------- | ------------- |
| sanger-tol/blobtoolkit* | | draft_assemblies |
| sanger-tol/curationpretext* | | 1.0.0 (UNSC Cradle) |
| GFASTATS | | 1.3.6--hdcf5f25_3 |
| MERQUERY_FK | | 1.2 |
| MINIMAP2_ALIGN | | 2.28 |
| SAMTOOLS_MERGE | | 1.20--h50ea8bc_0 |
| SAMTOOLS_SORT | | 1.20--h50ea8bc_0 |
|
| Dependency | Old version | New version |
| ---------------------------- | ----------- | ------------------- |
| sanger-tol/blobtoolkit\* | | draft_assemblies |
| sanger-tol/curationpretext\* | | 1.0.0 (UNSC Cradle) |
| GFASTATS | | 1.3.6--hdcf5f25_3 |
| MERQUERY_FK | | 1.2 |
| MINIMAP2_ALIGN | | 2.28 |
| SAMTOOLS_MERGE | | 1.20--h50ea8bc_0 |
| SAMTOOLS_SORT | | 1.20--h50ea8bc_0 |

|

- Note: for pipelines, please check their own CHANGELOG file for a full list of software dependencies.

### Dependencies
The pipeline depends on a number of databases which are noted in [README](README.md) and [USAGE](docs/usage.md).

The pipeline depends on a number of databases which are noted in [README](README.md) and [USAGE](docs/usage.md).
17 changes: 8 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
[![GitHub Actions CI Status](https://github.com/sanger-tol/ear/actions/workflows/ci.yml/badge.svg)](https://github.com/sanger-tol/ear/actions/workflows/ci.yml)
[![GitHub Actions Linting Status](https://github.com/sanger-tol/ear/actions/workflows/linting.yml/badge.svg)](https://github.com/sanger-tol/ear/actions/workflows/linting.yml)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)

[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A524.04.0-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
Expand All @@ -15,7 +14,7 @@
1. Read the input yaml file (YAML_INPUT)
2. Run GFASTATS (GFASTARS)
3. Run MERQURYFK_MERQURYFK (MERQURYFK)
4. Run MAIN_MAPPING, longread single-end/paired-end mapping
4. Run MAIN_MAPPING, longread single-end/paired-end mapping
5. Run GENERATE_SAMPLESHEET, generate a csv file required for SANGER_TOL_BTK.
6. Run SANGER_TOL_BTK, also known as SANGER-TOL/BLOBTOOLKIT a subpipline for SANGER-TOL/EAR
7. Run SANGER_TOL_CPRETEXT, also known as SANGER-TOL/CURATIONPRETEXT a subpipeline for SANGER-TOL/EAR.
Expand All @@ -27,11 +26,12 @@
The sanger-tol/ear pipeline requires a number of databases in place in order to run the blobtoolkit pipeline.
These include:
- A blast nt database
- A Diamond blast uniprot database
- A Diamond blast nr database
- An NCBI taxdump
- An NCBI rankedlineage.dmp

- A blast nt database
- A Diamond blast uniprot database
- A Diamond blast nr database
- An NCBI taxdump
- An NCBI rankedlineage.dmp

Next, a yaml file containing the following should then be completed:

Expand Down Expand Up @@ -70,7 +70,6 @@ btk:
config: <PATH TO ear/conf/sanger-tol-btk.config TO OVERWRITE PROCESS LIMITS>
```
Now, you can run the pipeline using:
```bash
Expand Down
2 changes: 1 addition & 1 deletion assets/idCulLati1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
assembly_id: idCulLati1_ear
reference_hap1: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/primary.fa
reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/hap2.fa
reference_haplotigs: /
reference_haplotigs: /nfs/treeoflife-01/teams/tola/users/dp24/ear/haplotigs.fa

# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore.
mapped_bam: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/mapped_bam.bam
Expand Down
4 changes: 2 additions & 2 deletions assets/real_pdf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@ PROFILING:
# ASSEMBLY DATA
ASSEMBLIES:
Pre-curation:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/

Curated:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.fa.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.insecta_odb10.busco/short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt
Expand Down
4 changes: 2 additions & 2 deletions assets/template_pdf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@ PROFILING:
# ASSEMBLY DATA
ASSEMBLIES:
Pre-curation:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
pri:
gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/

Curated:
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
pri:
gfastats--nstar-report_txt: idCulLati1.1.primary.curated.fa.gfastats
busco_short_summary_txt: short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt
Expand Down
44 changes: 26 additions & 18 deletions assets/test.yaml
Original file line number Diff line number Diff line change
@@ -1,25 +1,33 @@
assembly_id: Oscheius_DF5033
reference_hap1: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/asccTinyTest_V2/assembly/pyoelii_tiny_testfile_with_adapters.fa
reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/asccTinyTest_V2/assembly/pyoelii_tiny_testfile_with_adapters.fa
# General Vales for all subpiplines and modules
assembly_id: grTriPseu1
reference_hap1: /home/runner/work/ear/ear/TreeValTinyData/assembly/draft/grTriPseu1.fa
reference_hap2: /home/runner/work/ear/ear/TreeValTinyData/assembly/draft/grTriPseu1-hap.fa
reference_haplotigs: /home/runner/work/ear/ear/TreeValTinyData/assembly/draft/grTriPseu1-all_hap.fa

# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore.
mapped_bam: []

merquryfk:
fastk_hist: "./"
fastk_ktab: "./"

# Used by both subpipelines
longread:
type: hifi
dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/fasta/
mapped_bam: idCulLati1/mapped_bam.bam
dir: /home/runner/work/ear/ear/TreeValTinyData/genomic_data/pacbio/

curationpretext:
aligner: minimap2
telomere_motif: TTAGG
hic_dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/hic-arima2/full/
merquryfk:
fastk_hist: "./"
fastk_ktab: "./"
telomere_motif: TTAGGG
hic_dir: /home/runner/work/ear/ear/TreeValTinyData/genomic_data/hic-arima/
btk:
nt_database: /lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20240704_blast_tiny_testdb/blastdb/
nt_database_prefix: tiny_plasmodium_blastdb.fa
diamond_uniprot_database_path: /lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20240704_diamond_tiny_testdb/ascc_tinytest_diamond_db.dmnd
diamond_nr_database_path: /lustre/scratch123/tol/resources/nr/latest/nr.dmnd
ncbi_taxonomy_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump
ncbi_rankedlineage_path: /lustre/scratch123/tol/teams/tola/users/ea10/databases/taxdump/rankedlineage.dmp
btk_yaml: /nfs/users/nfs_d/dp24/sanger-tol-ear/assets/btk_draft.yaml
taxid: 352914
gca_accession: GCA_0001
lineages: "diptera_odb10,insecta_odb10"
lineages: "fungi_odb10"
nt_database: /home/runner/work/ascc/ascc/NT_database/
nt_database_prefix: 18S_fungal_sequences
diamond_uniprot_database_path: /home/runner/work/ascc/ascc/diamond/UP000000212_1234679_tax.dmnd
diamond_nr_database_path: /home/runner/work/ascc/ascc/diamond/UP000000212_1234679_tax.dmnd
ncbi_taxonomy_path: /home/runner/work/ascc/ascc/ncbi_taxdump/
ncbi_rankedlineage_path: /home/runner/work/ascc/ascc/ncbi_taxdump/rankedlineage.dmp
config: /home/runner/work/ear/ear/conf/sanger-tol-btk.config
4 changes: 4 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ process {
]
}

withName: CAT_CAT {
ext.prefix = 'combined_haplos.fa'
}

withName: GFASTATS {
ext.args = '--nstar-report'
}
Expand Down
2 changes: 1 addition & 1 deletion conf/sanger-tol-btk.config
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ process {
memory = { check_max( 10.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }
}
}
}
14 changes: 5 additions & 9 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,10 @@ params {
config_profile_description = 'Minimal test dataset to check pipeline function'

// Limit resources so that this can run on GitHub Actions
max_cpus = 2
max_memory = '6.GB'
max_time = '6.h'
max_cpus = 2
max_memory = '6.GB'
max_time = '6.h'
input = "${projectDir}/assets/test.yaml"
outdir = "results"

// Input data
// TODO nf-core: Specify the paths to your test data on nf-core/test-datasets
// TODO nf-core: Give any required params for the test so that command line flags are not needed
input = params.pipelines_testdata_base_path + 'viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'


}
11 changes: 4 additions & 7 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,15 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

</details>

[GFASTATS](https://github.com/vgl-hub/gfastats) is a single fast and exhaustive tool for summary statistics and simultaneous *fa* (fasta, fastq, gfa [.gz]) genome assembly file manipulation.
[GFASTATS](https://github.com/vgl-hub/gfastats) is a single fast and exhaustive tool for summary statistics and simultaneous _fa_ (fasta, fastq, gfa [.gz]) genome assembly file manipulation.

### MERQURYFK

<details markdown="1">
<summary>Output files</summary>

- `merquryfk/`
- `*.completeness.stats`:
- `*.completeness.stats`:
- `*{"primary","haplotype",""}_only.bed`:
- `*{"primary","haplotype",""}.qv`:
- `*.spectra-asm.{fl,ln,st}.png`:
Expand All @@ -47,14 +47,13 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

Merqury is a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness.


## SANGER_TOL_BTK

<details markdown="1">
<summary>Output files</summary>

- `sanger/*_blobtoolkit_out/`
- `blobtoolkit/plots/*png`: Blobtoolkit plots
- `blobtoolkit/plots/*png`: Blobtoolkit plots
- `blobtoolkit/{ASSEMBLY_NAME}/*.json.gz`: Blobtoolkit dataset for use in BTK_viewer.
- `busco/*_odb10/*.{tsv,tar.gz,json,txt}`: Busco output
- `muliqc/`: MultiQC plots/data and report.html.
Expand All @@ -64,14 +63,13 @@ Merqury is a novel tool for reference-free assembly evaluation based on efficien

[SANGER_TOL_BTK](https://pipelines.tol.sanger.ac.uk/blobtoolkit) is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes.


## SANGER_TOL_CPRETEXT

<details markdown="1">
<summary>Output files</summary>

- `sanger/*_curationpretext_out/`
- `accessory_files/*.{bigWig,bed,bedgraph}`: Track files describing Telomere, gap, coverage data across the genome.
- `accessory_files/*.{bigWig,bed,bedgraph}`: Track files describing Telomere, gap, coverage data across the genome.
- `pretext_maps_raw`: Pre-accessory file ingestion pretext files.
- `pretext_maps_processed`: Post-accessory file ingestion pretext files, e.g. the final output.
- [`pipeline_info`](#pipeline-information)
Expand All @@ -80,7 +78,6 @@ Merqury is a novel tool for reference-free assembly evaluation based on efficien

[SANGER_TOL_CPRETEXT](https://pipelines.tol.sanger.ac.uk/curationpretext) is a bioinformatics pipeline typically used in conjunction with [TreeVal](https://pipelines.tol.sanger.ac.uk/treeval) to generate pretext maps (and optionally telomeric, gap, coverage, and repeat density plots which can be ingested into pretext) for the manual curation of high quality genomes.


### Pipeline information

<details markdown="1">
Expand Down
1 change: 0 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,6 @@ As in the Snakemake version [a YAML configuration file](https://github.com/blobt

The data in the YAML is currently ignored in the Nextflow pipeline version. The YAML file is retained only to allow compatibility with the BlobDir dataset generated by the [Snakemake version](https://github.com/blobtoolkit/blobtoolkit/tree/main/src/blobtoolkit-pipeline/src). The taxonomic information in the YAML file can be obtained from [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/).


## Running the pipeline

The typical command for running the pipeline is as follows:
Expand Down
Loading

0 comments on commit 37324f8

Please sign in to comment.