Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define specification for compute instructions #5

Open
mih opened this issue Apr 29, 2024 · 0 comments
Open

Define specification for compute instructions #5

mih opened this issue Apr 29, 2024 · 0 comments

Comments

@mih
Copy link
Member

mih commented Apr 29, 2024

This can be thought of as the next iteration on the datalad run record format. This established format uses one commit/record to capture one computation that can produce any number of annex keys.

A primary objective here is to design a specification that can support computing any number of annex-key's, individually, without requiring one commit/record per key (think: datasets for a large number of files that can be computed in some structured fashion, individually(.

The key to this is likely going to be a parameterizable instruction set. @mih added basic support for this to the run machinery in datalad/datalad#6424; see http://docs.datalad.org/en/stable/design/provenance_capture.html#placeholders-in-commands-and-io-specifications

If this is the path, a specification needs to consider two components:

  • the instruction template (with a declaration of parameters)
  • the (per annex key) parameterization for an instruction

Instruction template

The closest established concept in datalad is a side-car run-record (see http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record). However, this format needs a revision. A few pointers for candidate developments are

Moreover, the side-car record is using a content-based filename. Here we need to identify the instruction template somehow, but we also want to be able to edit/fix an instruction template without having to fix all references to it. See #2

It would make sense to use development from https://concepts.datalad.org in a revision of the run-record format. Rather than be completely implicitly defined, we can offer a user the ability to record semantics of parameters in the fashion of Property in the https://concepts.datalad.org/s/thing/unreleased/ schema.

But see #1 for a readily available specification (and see CWL section below).

(Per annex key) Parameter set

Here we need to find a format and place to store parameters. See #4 for a dedicated issue.

CWL-based solution

A fully defined compute instruction is a two-step CWL workflow linked to the necessary inputs.
Input declaration can be linked to a workflow definition to form a single, joint record (see #7 (comment) cp.inputs.yaml for an example).

The inputs are the specification of the working environment needed to perform a computation (ie. the parameters to remake-provision #12), plus any parameters of the actual computation (non-file arguments, association of provisioned files to workflow arguments).

In order to get a complete record for producing a single key, we need a declaration that identify the key in the workflow output, based on some workflow output values (e.g. output dir plus relpath etc.). This is not (necessarily related to #13, because in a special remote implementation we need to capture such an output in a dataset, but only serve it to a temporary location given by git-annex.

@mih mih transferred this issue from another repository May 2, 2024
@github-project-automation github-project-automation bot moved this to discussion needed in DataLad remake May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: discussion needed
Development

No branches or pull requests

1 participant