Define specification for compute instructions #5

mih · 2024-04-29T06:19:20Z

This can be thought of as the next iteration on the datalad run record format. This established format uses one commit/record to capture one computation that can produce any number of annex keys.

A primary objective here is to design a specification that can support computing any number of annex-key's, individually, without requiring one commit/record per key (think: datasets for a large number of files that can be computed in some structured fashion, individually(.

The key to this is likely going to be a parameterizable instruction set. @mih added basic support for this to the run machinery in datalad/datalad#6424; see http://docs.datalad.org/en/stable/design/provenance_capture.html#placeholders-in-commands-and-io-specifications

If this is the path, a specification needs to consider two components:

the instruction template (with a declaration of parameters)
the (per annex key) parameterization for an instruction

Instruction template

The closest established concept in datalad is a side-car run-record (see http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record). However, this format needs a revision. A few pointers for candidate developments are

Moreover, the side-car record is using a content-based filename. Here we need to identify the instruction template somehow, but we also want to be able to edit/fix an instruction template without having to fix all references to it. See #2

It would make sense to use development from https://concepts.datalad.org in a revision of the run-record format. Rather than be completely implicitly defined, we can offer a user the ability to record semantics of parameters in the fashion of Property in the https://concepts.datalad.org/s/thing/unreleased/ schema.

But see #1 for a readily available specification (and see CWL section below).

(Per annex key) Parameter set

Here we need to find a format and place to store parameters. See #4 for a dedicated issue.

CWL-based solution

A fully defined compute instruction is a two-step CWL workflow linked to the necessary inputs.
Input declaration can be linked to a workflow definition to form a single, joint record (see #7 (comment) cp.inputs.yaml for an example).

The inputs are the specification of the working environment needed to perform a computation (ie. the parameters to remake-provision #12), plus any parameters of the actual computation (non-file arguments, association of provisioned files to workflow arguments).

In order to get a complete record for producing a single key, we need a declaration that identify the key in the workflow output, based on some workflow output values (e.g. output dir plus relpath etc.). This is not (necessarily related to #13, because in a special remote implementation we need to capture such an output in a dataset, but only serve it to a temporary location given by git-annex.

The text was updated successfully, but these errors were encountered:

mih transferred this issue from another repository May 2, 2024

mih added this to DataLad remake May 2, 2024

github-project-automation bot moved this to discussion needed in DataLad remake May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define specification for compute instructions #5

Define specification for compute instructions #5

mih commented Apr 29, 2024 •

edited

Loading

Define specification for compute instructions #5

Define specification for compute instructions #5

Comments

mih commented Apr 29, 2024 • edited Loading

Instruction template

(Per annex key) Parameter set

CWL-based solution

mih commented Apr 29, 2024 •

edited

Loading