-
Notifications
You must be signed in to change notification settings - Fork 738
Description
Usage scenario
I'd like to be able to return a tuple with optional elements. For example, by defining the output as tuple val(id), path("output.txt"), path("output2.txt" optional: true)
, I'd like a process to be able to emit an event ["foo", path("output.txt"), null]
.
The process and downstream processes can take a while to run, so using a multi-channel output in combination with a groupTuple() (See Attempt 3) is very undesirable.
Suggested implementation
Probably this would require:
- altering TupleOutParam.groovy#L103-L105 so that FileOutParams can be optional even if the TupleOutParam is not optional.
- Change something to TaskProcessor.groovy#L1519-L1524 so that non-optional TupleOutParams with optional FileOutParams are emitted instead of throwing a MissingFileException.
Reproducible examples
I made several attempts at getting this to run with the current implementation of Nextflow. To summarise:
- Attempt 1: optional path in non-optional tuple → errors
- Attempt 2: optional tuple → tuple with missing file is not emitted
- Attempt 3: multi-channel output followed by a groupTuple → introduces a bottleneck in workflows with long execution times
- Attempt 4: a messy workaround solution to this problem
Attempt 1: optional path in tuple
Because of TupleOutParam.groovy#L103-L105, this optional value is overridden by the tuple's value for 'optional', namely false.
If I try to run the code following code, Nextflow will produce an error when output2.txt is missing.
Attempt 1 reprex
nextflow.enable.dsl=2
process test_process1 {
input:
tuple val(id)
output:
tuple val(id), path("output.txt"), path("output2.txt", optional: true)
script:
"""
echo $id > output.txt
if [[ "$id" == "foo" ]]; then
echo $id > output2.txt
fi
"""
}
workflow {
Channel.fromList( ["foo", "bar"] )
| view { "input: ${it}" }
| test_process1
| view { "output: ${it}" }
}
↓
$ NXF_VER=21.10.6 nextflow run test_outputs_opt1.nf
input: foo
input: bar
output: [foo, work/81/e866d5e329c9ac9980a0c9313d347b/output.txt, work/81/e866d5e329c9ac9980a0c9313d347b/output2.txt]
[8c/e39e04] NOTE: Missing output file(s) `output2.txt` expected by process `test_process1 (2)` -- Error is ignored
Attempt 2: make the whole tuple optional
By making the whole tuple optional, Nextflow doesn't produce an error anymore, but my whole tuple is removed, which is undesirable.
Attempt 2 reprex
nextflow.enable.dsl=2
process test_process1 {
input:
tuple val(id)
output:
tuple val(id), path("output.txt"), path("output2.txt") optional true
script:
"""
echo $id > output.txt
if [[ "$id" == "foo" ]]; then
echo $id > output2.txt
fi
"""
}
workflow {
Channel.fromList( ["foo", "bar"] )
| view { "input: ${it}" }
| test_process1
| view { "output: ${it}" }
}
↓
$ NXF_VER=21.10.6 nextflow run test_outputs_opt2.nf
input: foo
input: bar
output: [foo, work/95/0e07ee0b94834d4587509b152aa354/output.txt, /home/rcannoodwork/95/0e07ee0b94834d4587509b152aa354/output2.txt]
Attempt 3: multichannel output
This approach is what is proposed in #1980. However, having to use 'groupTuple()' to merge the multichannel output back into a single event is also undesirable, as now the whole Channel needs to be executed before any events can be emitted downstream. Note that setting size: 2
doesn't work in this case, since some tuples should have one element, others two.
Attempt 3 reprex
nextflow.enable.dsl=2
process test_process2 {
input:
tuple val(id)
output:
tuple val(id), val("output1"), path("output.txt")
tuple val(id), val("output2"), path("output2.txt") optional true
script:
"""
echo $id > output.txt
if [[ "$id" == "foo" ]]; then
echo $id > output2.txt
fi
"""
}
workflow {
Channel.fromList( ["foo", "bar"] )
| view { "input: ${it}" }
| test_process2
| mix
| groupTuple(by: 0)
| map{ [ it[0], [it[1], it[2]].transpose().collectEntries() ]}
| view { "output: ${it}" }
}
↓
$ NXF_VER=21.10.6 nextflow run test_outputs_opt3.nf
input: foo
input: bar
output: [bar, [output1:work/9c/97b3a2884f97594532a19923e6c748/output.txt]]
output: [foo, [output1:work/60/984231826c9a9cc2a1e1cf29e16fdb/output.txt, output2:work/60/984231826c9a9cc2a1e1cf29e16fdb/output2.txt]]
Attempt 4: add junk to output
By adding a file known to exist (e.g. ".command.sh") to the output, I can force the Channel to always return a tuple. This works, but the code looks quite messy and I need to do postprocessing to remove the additional file.
Attempt 4 reprex
nextflow.enable.dsl=2
process test_process3 {
input:
tuple val(id)
output:
tuple val(id), path{[".command.sh", "output.txt"]}, path{[".command.sh", "output2.txt"]}
script:
"""
echo $id > output.txt
if [[ "$id" == "foo" ]]; then
echo $id > output2.txt
fi
"""
}
workflow {
Channel.fromList( ["foo", "bar"] )
| view { "input: ${it}" }
| test_process3
| map { output ->
map = [["output1", "output2"], output.drop(1)].transpose()
map_without_dummy = map.collectEntries{ key, out ->
if (out instanceof List && out.size() > 2) {
[ key, out.drop(1) ]
} else if (out instanceof List && out.size == 2) {
[ key, out[1] ]
} else {
[ key, null ]
}
}
[ output[0], map_without_dummy ]
}
| view { "output: ${it}" }
}
↓
$ NXF_VER=21.10.6 nextflow run test_outputs_opt4.nf
input: foo
input: bar
output: [foo, [output1:work/96/a51f95280ee3332f50b6b05a12596b/output.txt, output2:work/96/a51f95280ee3332f50b6b05a12596b/output2.txt]]
output: [bar, [output1:work/ec/87149bfea74975d37307d6a115c812/output.txt, output2:null]]