-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate BUSCO to nf-core module #730
base: dev
Are you sure you want to change the base?
Conversation
cab7dbd
to
25b72f1
Compare
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.1.0. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
@jfy133 It's green now 🎉 Some questions:
EDIT: Regarding second point, I managed to get a pretty simple and elegant solution to store all the downloaded lineages across all the samples, basically puting each of the lineage directories as ouputs instead of the whole The PR should be ready merging nf-core/modules#7210. |
Is there scope to add this as an optional input to the BUSCO module itself? For some context, the problem I encountered with BUSCO was that the temporary working files take up a lot of storage space on disk, so running BUSCO on (in my case) 1000s of bins meant that my available scratch quota was quickly filling up and leading to mag runs that couldn't finish. So it would be good to try and keep this as an option if possible as I imagine this is a plausible issue for other mag users also! |
Hi @prototaxites, |
If that's the case, maybe we don't need it? |
e5255ed
to
a44eb56
Compare
I made a last minor addition to this PR, I replaced the module used to combine bin qc summaries by a nf-core one (csvtk/concat). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patience @dialvarezs !
Some initial comments, however when comparing -profile test
on your branch and current dev
, I get the following results:
(nf-core) james@bionb103:~/git/nf-core/mag/testing (dev)$ cat results_dev/GenomeBinning/QC/busco_summary.tsv
GenomeBin Specific lineage dataset %Complete (specific) %Complete and single-copy (specific) %Complete and duplicated (specific) %Fragmented (specific) %Missing (specific) Total number (specific)
MEGAHIT-MetaBAT2-test_minigut_sample2.unbinned.2.fa bacteria_odb10 0.0 0.0 0.0 0.0 100.0 124
SPAdes-MaxBin2-test_minigut_sample2.noclass.1.fa bacteria_odb10 0.0 0.0 0.0 0.0 100.0 124
SPAdes-MetaBAT2-test_minigut.1.fa bacteria_odb10 13.7 13.7 0.0 2.4 83.9 124
MEGAHIT-MetaBAT2-test_minigut.unbinned.2.fa bacteria_odb10 0.0 0.0 0.0 0.8 99.2 124
MEGAHIT-MaxBin2-test_minigut.001.fa bacteria_odb10 19.4 19.4 0.0 0.8 79.8 124
SPAdes-MetaBAT2-test_minigut.unbinned.2.fa bacteria_odb10 0.0 0.0 0.0 0.0 100.0 124
MEGAHIT-MetaBAT2-test_minigut_sample2.unbinned.1.fa bacteria_odb10 0.0 0.0 0.0 0.0 100.0 124
MEGAHIT-MetaBAT2-test_minigut.1.fa bacteria_odb10 12.1 12.1 0.0 3.2 84.7 124
SPAdes-MetaBAT2-test_minigut_sample2.unbinned.1.fa bacteria_odb10 0.0 0.0 0.0 0.0 100.0 124
SPAdes-MetaBAT2-test_minigut.2.fa bacteria_odb10 20.2 20.2 0.0 1.6 78.2 124
SPAdes-MaxBin2-test_minigut_sample2.noclass.2.fa bacteria_odb10 0.0 0.0 0.0 0.0 100.0 124
MEGAHIT-MetaBAT2-test_minigut.2.fa bacteria_odb10 19.4 19.4 0.0 0.0 80.6 124
MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa bacteria_odb10 0.0 0.0 0.0 0.0 100.0 124
MEGAHIT-MaxBin2-test_minigut.002.fa bacteria_odb10 13.7 13.7 0.0 4.8 81.5 124
SPAdes-MetaBAT2-test_minigut_sample2.unbinned.2.fa bacteria_odb10 0.0 0.0 0.0 0.0 100.0 124
SPAdes-MetaBAT2-test_minigut.unbinned.1.fa bacteria_odb10 0.0 0.0 0.0 0.8 99.2 124
SPAdes-MaxBin2-test_minigut.001.fa bacteria_odb10 20.2 20.2 0.0 1.6 78.2 124
SPAdes-MaxBin2-test_minigut_sample2.001.fa bacteria_odb10 2.4 2.4 0.0 4.8 92.8 124
SPAdes-MaxBin2-test_minigut.002.fa bacteria_odb10 13.7 13.7 0.0 3.2 83.1 124
(nf-core) james@bionb103:~/git/nf-core/mag/testing (dev)$ cat results_dialvarezs-busco/GenomeBinning/QC/BUSCO/test_minigut/test_minigut-bacteria_odb12-busco.batch_summary.txt
Input_file Dataset Complete Single Duplicated Fragmented Missing n_markers Scaffold N50 Contigs N50 Percent gaps Number of scaffolds
SPAdes-MetaBAT2-test_minigut.unbinned.2.fa Run failed; check logs
SPAdes-MetaBAT2-test_minigut.unbinned.1.fa Run failed; check logs
i.e, the runs on your BUSCO implementation are failing for some reason
We need to investigate if this is valid output or not given the new version of BUSCO and the database.
It would also be helpful if @skrakau (if she has time) could look at this to make sure we aren't removing anything important - I note that the custom busco_run.sh
script of Sabrina's original module has A LOT more LoC than the new module has... (which maybe due to the new version of BUSCO) but I have 0 experience with BUSCO myself.
path: { "${params.outdir}/GenomeBinning/QC/BUSCO" }, | ||
mode: params.publish_dir_mode, | ||
overwrite: false, | ||
pattern: "busco_downloads/lineages/*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this result in the same path structure in the --outdir
as the old BUSCO_SVE_DOWNLOAD module (if it is, all good - if not might need to update output.md
)
@@ -7,13 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
|
|||
### `Added` | |||
|
|||
- [#730](https://github.com/nf-core/mag/pull/730) - Migrated from local BUSCO module to nf-core one (added by @dialvarezs) | |||
|
|||
### `Changed` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I woul mention the default database change too
@@ -13,7 +13,7 @@ process CHECKM2_PREDICT { | |||
|
|||
output: | |||
tuple val(meta), path("${prefix}") , emit: checkm2_output | |||
tuple val(meta), path("${prefix}/quality_report.tsv"), emit: checkm2_tsv | |||
tuple val(meta), path("${prefix}_checkm2_report.tsv"), emit: checkm2_tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be reflected in the CHANGELOG and the OUTPUT.md
@@ -768,11 +768,6 @@ | |||
"description": "Save the used BUSCO lineage datasets provided via `--busco_db`.", | |||
"help_text": "Useful to allow reproducibility, as BUSCO datasets are frequently updated and old versions do not always remain accessible." | |||
}, | |||
"busco_clean": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm... my instinct here is not to remove this but just add that functionality to the nf-core module.
I've been doing this in some of the metagenomic database modules, where I have an input channel (boolean val() called 'keep intermediates') which if given don't do anything, but if remains false
injects an rm
command into the script block.
BUSCO_DB_PREPARATION(ch_busco_db) | ||
ch_db_for_busco = BUSCO_DB_PREPARATION.out.db.map { meta, db -> | ||
[[id: meta, lineage: 'Y'], db] | ||
BUSCO_UNTAR([[id: 'busco_db'], ch_busco_db]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUSCO_UNTAR([[id: 'busco_db'], ch_busco_db]) | |
BUSCO_UNTAR(ch_busco_db.map{db -> [[id: 'busco_db'], db]}) |
|
||
qc_summaries = BUSCO_BUSCO.out.batch_summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these ``batch_summaryfiles different from the old BUSCO output files? If so we should check if the descriptions in
output.md` need to be updated
.map { row -> | ||
def completeness = -1 | ||
def contamination = -1 | ||
def missing, duplicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deletion means that the pipeline is no longer backwards compatible with older versions of the database, correct? This is important to list in te changelog
reason for failures I think:
|
This PR migrates from the local BUSCO module to the nf-core one.
Main changes
csvtk/concat
to merge BUSCO andUNTAR
to prepare database instead of the custom modules--save_busco_db
with publishDir directly.COMBINE_TSV
local module by nf-corecsvtk/concat
Breaking changes
--busco_clean
is not supported in this new setup, so the param should be deprecatedPR checklist
Closes #484.
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).