Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate BUSCO to nf-core module #730

Open
wants to merge 29 commits into
base: dev
Choose a base branch
from

Conversation

dialvarezs
Copy link
Contributor

@dialvarezs dialvarezs commented Dec 12, 2024

This PR migrates from the local BUSCO module to the nf-core one.

Main changes

  • Updates to latest BUSCO version (v5.8.2)
  • Uses csvtk/concat to merge BUSCO and UNTAR to prepare database instead of the custom modules
  • BUSCO now runs in batch mode, so it runs one process per sample instead of one per bin
  • In the current version, BUSCO only generates a single file, not one per domain and one specifiic, so the GTDBTk part that uses QC metrics to filter bins got simplified.
  • Removed the logic related to collect "failed bins", they were not being used.
  • Handle --save_busco_db with publishDir directly.
  • Replace COMBINE_TSV local module by nf-core csvtk/concat

Breaking changes

  • --busco_clean is not supported in this new setup, so the param should be deprecated

PR checklist

Closes #484.

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@dialvarezs dialvarezs changed the title Migrate to nf-core BUSCO module (WIP) Migrate BUSCO to nf-core module Dec 12, 2024
@nf-core-bot
Copy link
Member

nf-core-bot commented Dec 12, 2024

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.1.0.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@dialvarezs
Copy link
Contributor Author

dialvarezs commented Dec 12, 2024

@jfy133 It's green now 🎉

Some questions:

  • How should I deprecate a param (busco_clean)?

  • About --save_busco_db, the current module doesn't have the db downloads as output, so I'm adding it here: Add busco downloads as output modules#7210. I will try with publishDir. I'll try using publishDir. Previously, the issue was that multiple BUSCO processes were publishing to the same directory, but now BUSCO runs per sample rather than per bin, which should result in far fewer simultaneous publish operations.

EDIT:

Regarding second point, I managed to get a pretty simple and elegant solution to store all the downloaded lineages across all the samples, basically puting each of the lineage directories as ouputs instead of the whole busco_downloads directory. So, the custom module for that is no longer necessary.

The PR should be ready merging nf-core/modules#7210.

@dialvarezs dialvarezs marked this pull request as ready for review December 12, 2024 19:55
@prototaxites
Copy link
Contributor

How should I deprecate a param (busco_clean)?

Is there scope to add this as an optional input to the BUSCO module itself? For some context, the problem I encountered with BUSCO was that the temporary working files take up a lot of storage space on disk, so running BUSCO on (in my case) 1000s of bins meant that my available scratch quota was quickly filling up and leading to mag runs that couldn't finish. So it would be good to try and keep this as an option if possible as I imagine this is a plausible issue for other mag users also!

@dialvarezs
Copy link
Contributor Author

Hi @prototaxites,
I don’t think there should be any issue with porting the "clean" part of the script to the nf-core module. I’ll go ahead and open a PR for it.
By the way, the temporary storage usage should be significantly reduced now, as it only runs one BUSCO instance per sample.

@prototaxites
Copy link
Contributor

By the way, the temporary storage usage should be significantly reduced now, as it only runs one BUSCO instance per sample.

If that's the case, maybe we don't need it?

@dialvarezs
Copy link
Contributor Author

The tests I ran for this PR look like this:

auto
image

auto_prok
image

How do these numbers look to you?
For 20 samples, I estimate we could trim ~150 GB in auto (and about half of that in auto_prok).

@dialvarezs
Copy link
Contributor Author

dialvarezs commented Jan 2, 2025

I made a last minor addition to this PR, I replaced the module used to combine bin qc summaries by a nf-core one (csvtk/concat).

conf/test.config Outdated Show resolved Hide resolved
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience @dialvarezs !

Some initial comments, however when comparing -profile test on your branch and current dev, I get the following results:

(nf-core) james@bionb103:~/git/nf-core/mag/testing (dev)$ cat results_dev/GenomeBinning/QC/busco_summary.tsv 
GenomeBin       Specific lineage dataset        %Complete (specific)    %Complete and single-copy (specific)    %Complete and duplicated (specific)     %Fragmented (specific)  %Missing (specific)        Total number (specific)
MEGAHIT-MetaBAT2-test_minigut_sample2.unbinned.2.fa     bacteria_odb10  0.0     0.0     0.0     0.0     100.0   124
SPAdes-MaxBin2-test_minigut_sample2.noclass.1.fa        bacteria_odb10  0.0     0.0     0.0     0.0     100.0   124
SPAdes-MetaBAT2-test_minigut.1.fa       bacteria_odb10  13.7    13.7    0.0     2.4     83.9    124
MEGAHIT-MetaBAT2-test_minigut.unbinned.2.fa     bacteria_odb10  0.0     0.0     0.0     0.8     99.2    124
MEGAHIT-MaxBin2-test_minigut.001.fa     bacteria_odb10  19.4    19.4    0.0     0.8     79.8    124
SPAdes-MetaBAT2-test_minigut.unbinned.2.fa      bacteria_odb10  0.0     0.0     0.0     0.0     100.0   124
MEGAHIT-MetaBAT2-test_minigut_sample2.unbinned.1.fa     bacteria_odb10  0.0     0.0     0.0     0.0     100.0   124
MEGAHIT-MetaBAT2-test_minigut.1.fa      bacteria_odb10  12.1    12.1    0.0     3.2     84.7    124
SPAdes-MetaBAT2-test_minigut_sample2.unbinned.1.fa      bacteria_odb10  0.0     0.0     0.0     0.0     100.0   124
SPAdes-MetaBAT2-test_minigut.2.fa       bacteria_odb10  20.2    20.2    0.0     1.6     78.2    124
SPAdes-MaxBin2-test_minigut_sample2.noclass.2.fa        bacteria_odb10  0.0     0.0     0.0     0.0     100.0   124
MEGAHIT-MetaBAT2-test_minigut.2.fa      bacteria_odb10  19.4    19.4    0.0     0.0     80.6    124
MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa     bacteria_odb10  0.0     0.0     0.0     0.0     100.0   124
MEGAHIT-MaxBin2-test_minigut.002.fa     bacteria_odb10  13.7    13.7    0.0     4.8     81.5    124
SPAdes-MetaBAT2-test_minigut_sample2.unbinned.2.fa      bacteria_odb10  0.0     0.0     0.0     0.0     100.0   124
SPAdes-MetaBAT2-test_minigut.unbinned.1.fa      bacteria_odb10  0.0     0.0     0.0     0.8     99.2    124
SPAdes-MaxBin2-test_minigut.001.fa      bacteria_odb10  20.2    20.2    0.0     1.6     78.2    124
SPAdes-MaxBin2-test_minigut_sample2.001.fa      bacteria_odb10  2.4     2.4     0.0     4.8     92.8    124
SPAdes-MaxBin2-test_minigut.002.fa      bacteria_odb10  13.7    13.7    0.0     3.2     83.1    124
(nf-core) james@bionb103:~/git/nf-core/mag/testing (dev)$ cat results_dialvarezs-busco/GenomeBinning/QC/BUSCO/test_minigut/test_minigut-bacteria_odb12-busco.batch_summary.txt 
Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
SPAdes-MetaBAT2-test_minigut.unbinned.2.fa      Run failed; check logs
SPAdes-MetaBAT2-test_minigut.unbinned.1.fa      Run failed; check logs

i.e, the runs on your BUSCO implementation are failing for some reason

We need to investigate if this is valid output or not given the new version of BUSCO and the database.

It would also be helpful if @skrakau (if she has time) could look at this to make sure we aren't removing anything important - I note that the custom busco_run.sh script of Sabrina's original module has A LOT more LoC than the new module has... (which maybe due to the new version of BUSCO) but I have 0 experience with BUSCO myself.

path: { "${params.outdir}/GenomeBinning/QC/BUSCO" },
mode: params.publish_dir_mode,
overwrite: false,
pattern: "busco_downloads/lineages/*",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this result in the same path structure in the --outdir as the old BUSCO_SVE_DOWNLOAD module (if it is, all good - if not might need to update output.md)

@@ -7,13 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Added`

- [#730](https://github.com/nf-core/mag/pull/730) - Migrated from local BUSCO module to nf-core one (added by @dialvarezs)

### `Changed`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I woul mention the default database change too

@@ -13,7 +13,7 @@ process CHECKM2_PREDICT {

output:
tuple val(meta), path("${prefix}") , emit: checkm2_output
tuple val(meta), path("${prefix}/quality_report.tsv"), emit: checkm2_tsv
tuple val(meta), path("${prefix}_checkm2_report.tsv"), emit: checkm2_tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reflected in the CHANGELOG and the OUTPUT.md

@@ -768,11 +768,6 @@
"description": "Save the used BUSCO lineage datasets provided via `--busco_db`.",
"help_text": "Useful to allow reproducibility, as BUSCO datasets are frequently updated and old versions do not always remain accessible."
},
"busco_clean": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm... my instinct here is not to remove this but just add that functionality to the nf-core module.

I've been doing this in some of the metagenomic database modules, where I have an input channel (boolean val() called 'keep intermediates') which if given don't do anything, but if remains false injects an rm command into the script block.

BUSCO_DB_PREPARATION(ch_busco_db)
ch_db_for_busco = BUSCO_DB_PREPARATION.out.db.map { meta, db ->
[[id: meta, lineage: 'Y'], db]
BUSCO_UNTAR([[id: 'busco_db'], ch_busco_db])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BUSCO_UNTAR([[id: 'busco_db'], ch_busco_db])
BUSCO_UNTAR(ch_busco_db.map{db -> [[id: 'busco_db'], db]})


qc_summaries = BUSCO_BUSCO.out.batch_summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these ``batch_summaryfiles different from the old BUSCO output files? If so we should check if the descriptions inoutput.md` need to be updated

.map { row ->
def completeness = -1
def contamination = -1
def missing, duplicated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deletion means that the pipeline is no longer backwards compatible with older versions of the database, correct? This is important to list in te changelog

@jfy133
Copy link
Member

jfy133 commented Jan 19, 2025

reason for failures I think:

********************************************************************************
Running BUSCO on file: /home/james/git/nf-core/mag/testing/work/24/90cfdc2ff2e2e1083f6c1f6e74b3ff/input_seqs/SPAdes-MetaBAT2-test_minigut.unbinned.2.fa
********************************************************************************


2025-01-19 15:55:58 INFO:busco.BuscoRunner      Input file is /home/james/git/nf-core/mag/testing/work/24/90cfdc2ff2e2e1083f6c1f6e74b3ff/input_seqs/SPAdes-MetaBAT2-test_minigut.unbinned.2.fa
2025-01-19 15:55:58 ERROR:busco.BuscoRunner     Unable to run BUSCO in offline mode. Dataset /home/james/git/nf-core/mag/testing/work/24/90cfdc2ff2e2e1083f6c1f6e74b3ff/busco_db/lineages/bacteria_odb12 does not exist.
2025-01-19 15:55:58 INFO:busco.BuscoRunner

********************************************************************************
Running BUSCO on file: /home/james/git/nf-core/mag/testing/work/24/90cfdc2ff2e2e1083f6c1f6e74b3ff/input_seqs/SPAdes-MetaBAT2-test_minigut.unbinned.1.fa
********************************************************************************


2025-01-19 15:55:58 INFO:busco.BuscoRunner      Input file is /home/james/git/nf-core/mag/testing/work/24/90cfdc2ff2e2e1083f6c1f6e74b3ff/input_seqs/SPAdes-MetaBAT2-test_minigut.unbinned.1.fa
2025-01-19 15:55:58 ERROR:busco.BuscoRunner     Unable to run BUSCO in offline mode. Dataset /home/james/git/nf-core/mag/testing/work/24/90cfdc2ff2e2e1083f6c1f6e74b3ff/busco_db/lineages/bacteria_odb12 does not exist.
2025-01-19 15:55:58 DEBUG:urllib3.connectionpool        Starting new HTTPS connection (1): busco-data.ezlab.org:443
2025-01-19 15:55:58 DEBUG:urllib3.connectionpool        Starting new HTTPS connection (1): busco-data.ezlab.org:443
2025-01-19 15:55:59 DEBUG:urllib3.connectionpool        https://busco-data.ezlab.org:443 "PUT /upload/rundata202501191555589915025.txt HTTP/11" 201 0
2025-01-19 15:55:59 DEBUG:urllib3.connectionpool        https://busco-data.ezlab.org:443 "PUT /upload/rundata202501191555589915025.txt HTTP/11" 201 0
2025-01-19 15:55:59 DEBUG:busco.run_BUSCO       File uploaded successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants