Merge pull request #2 from sanger-tol/dp24_testdata

testing
sanger-tol · Sep 18, 2024 · 37324f8 · 37324f8
2 parents 844c575 + a8c8189
commit 37324f8
Show file tree

Hide file tree

Showing 34 changed files with 781 additions and 192 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -10,6 +10,8 @@ on:
 
 env:
   NXF_ANSI_LOG: false
+  NXF_SINGULARITY_CACHEDIR: ${{ github.workspace }}/.singularity
+  NXF_SINGULARITY_LIBRARYDIR: ${{ github.workspace }}/.singularity
 
 concurrency:
   group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}"
@@ -24,9 +26,32 @@ jobs:
     strategy:
       matrix:
         NXF_VER:
-          - "23.04.0"
+          - "24.04.0"
           - "latest-everything"
     steps:
+      - name: Get branch names
+        # Pulls the names of current branches in repo
+        # steps.branch-names.outputs.current_branch is used later and returns the name of the branch the PR is made FROM not to
+        id: branch-names
+        uses: tj-actions/branch-names@v8
+
+      - name: Setup apptainer
+        uses: eWaterCycle/setup-apptainer@main
+
+      - name: Set up Singularity
+        run: |
+          mkdir -p $NXF_SINGULARITY_CACHEDIR
+          mkdir -p $NXF_SINGULARITY_LIBRARYDIR
+
+      - name: Install Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+
+      - name: Install nf-core
+        run: |
+          pip install nf-core
+
       - name: Check out pipeline code
         uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
 
@@ -35,12 +60,26 @@ jobs:
         with:
           version: "${{ matrix.NXF_VER }}"
 
-      - name: Disk space cleanup
-        uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+      # This will only download the main pipeline containers, subpipelines need their own nf-download
+      - name: NF-Core Download - download singularity containers
+        run: |
+          nf-core download sanger-tol/ear --revision ${{ steps.branch-names.outputs.current_branch }} --compress none -d --force --outdir sanger-ear --container-cache-utilisation amend --container-system singularity
+
+      - name: Download Tiny test data
+        # Download A fungal test data set that is full enough to show some real output.
+        # Needs a kmer db for merqury
+        run: |
+          curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
+          cp TreeValTinyData/assembly/draft/grTriPseu1.fa TreeValTinyData/assembly/draft/grTriPseu1-hap.fa
+          cp TreeValTinyData/assembly/draft/grTriPseu1.fa TreeValTinyData/assembly/draft/grTriPseu1-all_hap.fa
+
+      # - name: Disk space cleanup
+      #   uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
 
       - name: Run pipeline with test data
         # TODO nf-core: You can customise CI pipeline run tests as required
         # For example: adding multiple test runs with different parameters
         # Remember that you can parallelise this by using strategy.matrix
+        # Skip BTK and CPRETEXT as they are already tested on their repos.
         run: |
-          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
+          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results --steps btk,cpretext,merquryfk
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -4,6 +4,7 @@ lint:
     - assets/nf-core-ear_logo_light.png
     - docs/images/nf-core-ear_logo_light.png
     - docs/images/nf-core-ear_logo_dark.png
+    - lib/nfcore_external_java_deps.jar
     - .github/ISSUE_TEMPLATE/config.yml
     - .github/workflows/awstest.yml
     - .github/workflows/awsfulltest.yml

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,31 +10,34 @@ Initial release of sanger-tol/ear, created with the [nf-core](https://nf-co.re/)
 The current pipeline means the MVP for ear.
 
 ### Added
+
 GFASTATS to generate statistics on the input primary genome.
 MERQURY_FK to generate kmer graphs and analyses of the primary, haplotype and merged assembly.
 BLOBTOOLKIT to generate busco files and blobtoolkit dataset/plots.
 CURATIONPRETEXT to generate pretext plots and pngs.
 
 ### Parameters
 
-| Old parameter   | New parameter |
-| --------------- | ------------- |
-|                 | --mapped      |
+| Old parameter | New parameter |
+| ------------- | ------------- |
+|               | --mapped      |
 
 ### Software dependencies
 
-| Dependency  | Old version   | New version   |
-| ----------- | ------------- | ------------- |
-| sanger-tol/blobtoolkit* |               | draft_assemblies        |
-| sanger-tol/curationpretext* |   |  1.0.0 (UNSC Cradle) |
-| GFASTATS |  | 1.3.6--hdcf5f25_3   |
-| MERQUERY_FK  | | 1.2   |
-| MINIMAP2_ALIGN |  | 2.28  |
-| SAMTOOLS_MERGE |  | 1.20--h50ea8bc_0 |
-| SAMTOOLS_SORT  |  | 1.20--h50ea8bc_0 |
-| 
+| Dependency                   | Old version | New version         |
+| ---------------------------- | ----------- | ------------------- |
+| sanger-tol/blobtoolkit\*     |             | draft_assemblies    |
+| sanger-tol/curationpretext\* |             | 1.0.0 (UNSC Cradle) |
+| GFASTATS                     |             | 1.3.6--hdcf5f25_3   |
+| MERQUERY_FK                  |             | 1.2                 |
+| MINIMAP2_ALIGN               |             | 2.28                |
+| SAMTOOLS_MERGE               |             | 1.20--h50ea8bc_0    |
+| SAMTOOLS_SORT                |             | 1.20--h50ea8bc_0    |
+
+|
 
 - Note: for pipelines, please check their own CHANGELOG file for a full list of software dependencies.
 
 ### Dependencies
-The pipeline depends on a number of databases which are noted in  [README](README.md) and [USAGE](docs/usage.md).
+
+The pipeline depends on a number of databases which are noted in [README](README.md) and [USAGE](docs/usage.md).
diff --git a/README.md b/README.md
@@ -1,8 +1,7 @@
 [![GitHub Actions CI Status](https://github.com/sanger-tol/ear/actions/workflows/ci.yml/badge.svg)](https://github.com/sanger-tol/ear/actions/workflows/ci.yml)
 [![GitHub Actions Linting Status](https://github.com/sanger-tol/ear/actions/workflows/linting.yml/badge.svg)](https://github.com/sanger-tol/ear/actions/workflows/linting.yml)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
 [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
-
-[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
+[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A524.04.0-23aa62.svg)](https://www.nextflow.io/)
 [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
@@ -15,7 +14,7 @@
 1. Read the input yaml file (YAML_INPUT)
 2. Run GFASTATS (GFASTARS)
 3. Run MERQURYFK_MERQURYFK (MERQURYFK)
-4. Run MAIN_MAPPING, longread single-end/paired-end mapping 
+4. Run MAIN_MAPPING, longread single-end/paired-end mapping
 5. Run GENERATE_SAMPLESHEET, generate a csv file required for SANGER_TOL_BTK.
 6. Run SANGER_TOL_BTK, also known as SANGER-TOL/BLOBTOOLKIT a subpipline for SANGER-TOL/EAR
 7. Run SANGER_TOL_CPRETEXT, also known as SANGER-TOL/CURATIONPRETEXT a subpipeline for SANGER-TOL/EAR.
@@ -27,11 +26,12 @@
 
 The sanger-tol/ear pipeline requires a number of databases in place in order to run the blobtoolkit pipeline.
 These include:
-   - A blast nt database
-   - A Diamond blast uniprot database
-   - A Diamond blast nr database
-   - An NCBI taxdump
-   - An NCBI rankedlineage.dmp
+
+- A blast nt database
+- A Diamond blast uniprot database
+- A Diamond blast nr database
+- An NCBI taxdump
+- An NCBI rankedlineage.dmp
 
 Next, a yaml file containing the following should then be completed:
 
@@ -70,7 +70,6 @@ btk:
   config: <PATH TO ear/conf/sanger-tol-btk.config TO OVERWRITE PROCESS LIMITS>
 ```
 
-
 Now, you can run the pipeline using:
 
 ```bash

diff --git a/assets/idCulLati1.yaml b/assets/idCulLati1.yaml
@@ -2,7 +2,7 @@
 assembly_id: idCulLati1_ear
 reference_hap1: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/primary.fa
 reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/hap2.fa
-reference_haplotigs: /
+reference_haplotigs: /nfs/treeoflife-01/teams/tola/users/dp24/ear/haplotigs.fa
 
 # If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore.
 mapped_bam: /nfs/treeoflife-01/teams/tola/users/dp24/ear/idCulLati1/mapped_bam.bam

diff --git a/assets/real_pdf.yaml b/assets/real_pdf.yaml
@@ -20,14 +20,14 @@ PROFILING:
 # ASSEMBLY DATA
 ASSEMBLIES:
   Pre-curation:
-    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e,  yahs_v1.2a.2|]
+    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
     pri:
       gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
       busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
       merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/
 
   Curated:
-    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e,  yahs_v1.2a.2|, TreeVal_v1.1]
+    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
     pri:
       gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.fa.gfastats
       busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/assembly/curated/idCulLati1.1/ear/idCulLati1.1.primary.curated.insecta_odb10.busco/short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt

diff --git a/assets/template_pdf.yaml b/assets/template_pdf.yaml
@@ -20,14 +20,14 @@ PROFILING:
 # ASSEMBLY DATA
 ASSEMBLIES:
   Pre-curation:
-    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e,  yahs_v1.2a.2|]
+    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|]
     pri:
       gfastats--nstar-report_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa.gz.gfastats
       busco_short_summary_txt: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.insecta_odb10.busco/short_summary.specific.insecta_odb10.out_scaffolds_final.insecta_odb10.busco.txt
       merqury_folder: /lustre/scratch123/tol/tolqc/data/erga-bge/insects/Culex_laticinctus/working/idCulLati1.hifiasm.20240430/scaffolding/yahs/out.break.yahs/out_scaffolds_final.ccs.merquryk/
 
   Curated:
-    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e,  yahs_v1.2a.2|, TreeVal_v1.1]
+    pipeline: [hifiasm_v0.19.8-r603|--primary, purge_dups_v1.2.5|-e, yahs_v1.2a.2|, TreeVal_v1.1]
     pri:
       gfastats--nstar-report_txt: idCulLati1.1.primary.curated.fa.gfastats
       busco_short_summary_txt: short_summary.specific.insecta_odb10.idCulLati1.1.primary.curated.insecta_odb10.busco.txt

diff --git a/assets/test.yaml b/assets/test.yaml
@@ -1,25 +1,33 @@
-assembly_id: Oscheius_DF5033
-reference_hap1: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/asccTinyTest_V2/assembly/pyoelii_tiny_testfile_with_adapters.fa
-reference_hap2: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/asccTinyTest_V2/assembly/pyoelii_tiny_testfile_with_adapters.fa
+# General Vales for all subpiplines and modules
+assembly_id: grTriPseu1
+reference_hap1: /home/runner/work/ear/ear/TreeValTinyData/assembly/draft/grTriPseu1.fa
+reference_hap2: /home/runner/work/ear/ear/TreeValTinyData/assembly/draft/grTriPseu1-hap.fa
+reference_haplotigs: /home/runner/work/ear/ear/TreeValTinyData/assembly/draft/grTriPseu1-all_hap.fa
+
+# If a mapped bam already exists use the below + --mapped TRUE on the nextflow command else ignore.
+mapped_bam: []
+
+merquryfk:
+  fastk_hist: "./"
+  fastk_ktab: "./"
+
+# Used by both subpipelines
 longread:
   type: hifi
-  dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/fasta/
-mapped_bam: idCulLati1/mapped_bam.bam
+  dir: /home/runner/work/ear/ear/TreeValTinyData/genomic_data/pacbio/
+
 curationpretext:
   aligner: minimap2
-  telomere_motif: TTAGG
-  hic_dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/hic-arima2/full/
-merquryfk:
-  fastk_hist: "./"
-  fastk_ktab: "./"
+  telomere_motif: TTAGGG
+  hic_dir: /home/runner/work/ear/ear/TreeValTinyData/genomic_data/hic-arima/
 btk:
-  nt_database: /lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20240704_blast_tiny_testdb/blastdb/
-  nt_database_prefix: tiny_plasmodium_blastdb.fa
-  diamond_uniprot_database_path: /lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20240704_diamond_tiny_testdb/ascc_tinytest_diamond_db.dmnd
-  diamond_nr_database_path: /lustre/scratch123/tol/resources/nr/latest/nr.dmnd
-  ncbi_taxonomy_path: /lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump
-  ncbi_rankedlineage_path: /lustre/scratch123/tol/teams/tola/users/ea10/databases/taxdump/rankedlineage.dmp
-  btk_yaml: /nfs/users/nfs_d/dp24/sanger-tol-ear/assets/btk_draft.yaml
   taxid: 352914
   gca_accession: GCA_0001
-  lineages: "diptera_odb10,insecta_odb10"
+  lineages: "fungi_odb10"
+  nt_database: /home/runner/work/ascc/ascc/NT_database/
+  nt_database_prefix: 18S_fungal_sequences
+  diamond_uniprot_database_path: /home/runner/work/ascc/ascc/diamond/UP000000212_1234679_tax.dmnd
+  diamond_nr_database_path: /home/runner/work/ascc/ascc/diamond/UP000000212_1234679_tax.dmnd
+  ncbi_taxonomy_path: /home/runner/work/ascc/ascc/ncbi_taxdump/
+  ncbi_rankedlineage_path: /home/runner/work/ascc/ascc/ncbi_taxdump/rankedlineage.dmp
+  config: /home/runner/work/ear/ear/conf/sanger-tol-btk.config
diff --git a/conf/modules.config b/conf/modules.config
@@ -20,6 +20,10 @@ process {
         ]
     }
 
+    withName: CAT_CAT {
+        ext.prefix          = 'combined_haplos.fa'
+    }
+
     withName: GFASTATS {
         ext.args            = '--nstar-report'
     }

diff --git a/conf/sanger-tol-btk.config b/conf/sanger-tol-btk.config
@@ -4,4 +4,4 @@ process {
         memory = { check_max( 10.GB * task.attempt, 'memory'  ) }
         time   = { check_max( 16.h  * task.attempt, 'time'    ) }
     }
-}
+}
diff --git a/conf/test.config b/conf/test.config
@@ -15,14 +15,10 @@ params {
     config_profile_description = 'Minimal test dataset to check pipeline function'
 
     // Limit resources so that this can run on GitHub Actions
-    max_cpus   = 2
-    max_memory = '6.GB'
-    max_time   = '6.h'
+    max_cpus    = 2
+    max_memory  = '6.GB'
+    max_time    = '6.h'
+    input       = "${projectDir}/assets/test.yaml"
+    outdir      = "results"
 
-    // Input data
-    // TODO nf-core: Specify the paths to your test data on nf-core/test-datasets
-    // TODO nf-core: Give any required params for the test so that command line flags are not needed
-    input  = params.pipelines_testdata_base_path + 'viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'
-
-
 }
diff --git a/docs/output.md b/docs/output.md
@@ -27,15 +27,15 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 
 </details>
 
-[GFASTATS](https://github.com/vgl-hub/gfastats) is a single fast and exhaustive tool for summary statistics and simultaneous *fa* (fasta, fastq, gfa [.gz]) genome assembly file manipulation.
+[GFASTATS](https://github.com/vgl-hub/gfastats) is a single fast and exhaustive tool for summary statistics and simultaneous _fa_ (fasta, fastq, gfa [.gz]) genome assembly file manipulation.
 
 ### MERQURYFK
 
 <details markdown="1">
 <summary>Output files</summary>
 
 - `merquryfk/`
-  - `*.completeness.stats`: 
+  - `*.completeness.stats`:
   - `*{"primary","haplotype",""}_only.bed`:
   - `*{"primary","haplotype",""}.qv`:
   - `*.spectra-asm.{fl,ln,st}.png`:
@@ -47,14 +47,13 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 
 Merqury is a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness.
 
-
 ## SANGER_TOL_BTK
 
 <details markdown="1">
 <summary>Output files</summary>
 
 - `sanger/*_blobtoolkit_out/`
-  - `blobtoolkit/plots/*png`: Blobtoolkit plots 
+  - `blobtoolkit/plots/*png`: Blobtoolkit plots
   - `blobtoolkit/{ASSEMBLY_NAME}/*.json.gz`: Blobtoolkit dataset for use in BTK_viewer.
   - `busco/*_odb10/*.{tsv,tar.gz,json,txt}`: Busco output
   - `muliqc/`: MultiQC plots/data and report.html.
@@ -64,14 +63,13 @@ Merqury is a novel tool for reference-free assembly evaluation based on efficien
 
 [SANGER_TOL_BTK](https://pipelines.tol.sanger.ac.uk/blobtoolkit) is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes.
 
-
 ## SANGER_TOL_CPRETEXT
 
 <details markdown="1">
 <summary>Output files</summary>
 
 - `sanger/*_curationpretext_out/`
-  - `accessory_files/*.{bigWig,bed,bedgraph}`: Track files describing Telomere, gap, coverage data across the genome. 
+  - `accessory_files/*.{bigWig,bed,bedgraph}`: Track files describing Telomere, gap, coverage data across the genome.
   - `pretext_maps_raw`: Pre-accessory file ingestion pretext files.
   - `pretext_maps_processed`: Post-accessory file ingestion pretext files, e.g. the final output.
   - [`pipeline_info`](#pipeline-information)
@@ -80,7 +78,6 @@ Merqury is a novel tool for reference-free assembly evaluation based on efficien
 
 [SANGER_TOL_CPRETEXT](https://pipelines.tol.sanger.ac.uk/curationpretext) is a bioinformatics pipeline typically used in conjunction with [TreeVal](https://pipelines.tol.sanger.ac.uk/treeval) to generate pretext maps (and optionally telomeric, gap, coverage, and repeat density plots which can be ingested into pretext) for the manual curation of high quality genomes.
 
-
 ### Pipeline information
 
 <details markdown="1">

diff --git a/docs/usage.md b/docs/usage.md
@@ -166,7 +166,6 @@ As in the Snakemake version [a YAML configuration file](https://github.com/blobt
 
 The data in the YAML is currently ignored in the Nextflow pipeline version. The YAML file is retained only to allow compatibility with the BlobDir dataset generated by the [Snakemake version](https://github.com/blobtoolkit/blobtoolkit/tree/main/src/blobtoolkit-pipeline/src). The taxonomic information in the YAML file can be obtained from [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/).
 
-
 ## Running the pipeline
 
 The typical command for running the pipeline is as follows:
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,4 +4,4 @@ process { @@
             memory = { check_max( 10.GB * task.attempt, 'memory'  ) }
             time   = { check_max( 16.h  * task.attempt, 'time'    ) }
         }
-    }
+    }
Original file line number	Diff line number	Diff line change
Expand Up		@@ -166,7 +166,6 @@ As in the Snakemake version [a YAML configuration file](https://github.com/blobt

		The data in the YAML is currently ignored in the Nextflow pipeline version. The YAML file is retained only to allow compatibility with the BlobDir dataset generated by the [Snakemake version](https://github.com/blobtoolkit/blobtoolkit/tree/main/src/blobtoolkit-pipeline/src). The taxonomic information in the YAML file can be obtained from [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/).


		## Running the pipeline

		The typical command for running the pipeline is as follows:
Expand Down