You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This can be thought of as the next iteration on the datalad run record format. This established format uses one commit/record to capture one computation that can produce any number of annex keys.
A primary objective here is to design a specification that can support computing any number of annex-key's, individually, without requiring one commit/record per key (think: datasets for a large number of files that can be computed in some structured fashion, individually(.
Moreover, the side-car record is using a content-based filename. Here we need to identify the instruction template somehow, but we also want to be able to edit/fix an instruction template without having to fix all references to it. See #2
It would make sense to use development from https://concepts.datalad.org in a revision of the run-record format. Rather than be completely implicitly defined, we can offer a user the ability to record semantics of parameters in the fashion of Property in the https://concepts.datalad.org/s/thing/unreleased/ schema.
But see #1 for a readily available specification (and see CWL section below).
(Per annex key) Parameter set
Here we need to find a format and place to store parameters. See #4 for a dedicated issue.
CWL-based solution
A fully defined compute instruction is a two-step CWL workflow linked to the necessary inputs.
Input declaration can be linked to a workflow definition to form a single, joint record (see #7 (comment)cp.inputs.yaml for an example).
The inputs are the specification of the working environment needed to perform a computation (ie. the parameters to remake-provision#12), plus any parameters of the actual computation (non-file arguments, association of provisioned files to workflow arguments).
In order to get a complete record for producing a single key, we need a declaration that identify the key in the workflow output, based on some workflow output values (e.g. output dir plus relpath etc.). This is not (necessarily related to #13, because in a special remote implementation we need to capture such an output in a dataset, but only serve it to a temporary location given by git-annex.
The text was updated successfully, but these errors were encountered:
mih
transferred this issue from another repository
May 2, 2024
This can be thought of as the next iteration on the datalad run record format. This established format uses one commit/record to capture one computation that can produce any number of annex keys.
A primary objective here is to design a specification that can support computing any number of annex-key's, individually, without requiring one commit/record per key (think: datasets for a large number of files that can be computed in some structured fashion, individually(.
The key to this is likely going to be a parameterizable instruction set. @mih added basic support for this to the
run
machinery in datalad/datalad#6424; see http://docs.datalad.org/en/stable/design/provenance_capture.html#placeholders-in-commands-and-io-specificationsIf this is the path, a specification needs to consider two components:
Instruction template
The closest established concept in datalad is a side-car run-record (see http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record). However, this format needs a revision. A few pointers for candidate developments are
Moreover, the side-car record is using a content-based filename. Here we need to identify the instruction template somehow, but we also want to be able to edit/fix an instruction template without having to fix all references to it. See #2
It would make sense to use development from https://concepts.datalad.org in a revision of the run-record format. Rather than be completely implicitly defined, we can offer a user the ability to record semantics of parameters in the fashion of
Property
in the https://concepts.datalad.org/s/thing/unreleased/ schema.But see #1 for a readily available specification (and see CWL section below).
(Per annex key) Parameter set
Here we need to find a format and place to store parameters. See #4 for a dedicated issue.
CWL-based solution
A fully defined compute instruction is a two-step CWL workflow linked to the necessary inputs.
Input declaration can be linked to a workflow definition to form a single, joint record (see #7 (comment)
cp.inputs.yaml
for an example).The inputs are the specification of the working environment needed to perform a computation (ie. the parameters to
remake-provision
#12), plus any parameters of the actual computation (non-file arguments, association of provisioned files to workflow arguments).In order to get a complete record for producing a single key, we need a declaration that identify the key in the workflow output, based on some workflow output values (e.g. output dir plus relpath etc.). This is not (necessarily related to #13, because in a special remote implementation we need to capture such an output in a dataset, but only serve it to a temporary location given by git-annex.
The text was updated successfully, but these errors were encountered: