Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes from brain storming session #15

Open
mih opened this issue May 17, 2024 · 0 comments
Open

Notes from brain storming session #15

mih opened this issue May 17, 2024 · 0 comments

Comments

@mih
Copy link
Member

mih commented May 17, 2024

Q1 Why not use git?

A1.1 Fairly Big project: most problems due to limitations in git or due to bad alignment of goal and technology

why export to CWL (apart from being a standard language): condor eats CWL (no need for adaptors because tools like condor or slurm already provide them)

there are other things like CWL, but we haven't yet found one which is as capable

DataLad remake:

  • Collect use cases for datalad-remake (and underlying tooling) #3
  • initial idea was to have a special remote for recomputing outputs (deterministically), for which the functionality would be indistinguishable from a get operation (for user). This would be hugely beneficial in terms of limiting storage requirements
  • existing similar extension: https://pypi.org/project/datalad-getexec/
    • assumes compute tooling is available
    • assumes inputs are available
    • The idea requires, that a special remote must have access to all necessary compute mechanisms and all necessary inputs

Datalad run:

  • records provenance records, how a particular state of data came to be
  • runs arbitrary shell commands
  • does not explicitly parameterize specific arguments that go into the custom command
    • "provisions" a git branch checkout
    • "compute": executes the shell command
    • "extract": extracts outputs and feeds them to git repo
  • run record does not have the notion that the user wanted a specific output (out of an arbitrary number of outputs, i.e. it does not record --output --explicit flags)

In an ideal system:

  • we would want to be able to update all steps ("provision", "compute", "extract") without updating (an unnecessary amount of parts of?) the dataset, in the least expensive way

In a special remote scenario:

  • we want to compute everything based on annex key (git-annex starts by asking "can you give me this file", etc.)
  • the information on how to compute that key needs to be stored somewhere / somehow
  • the information may need to be updated (e.g. container technology changed)

Single git-annex branch is incompatible with the notion of version history of data:

  • we want to be able to reexecute some historic version of a file
  • we want to be able to generate the latest version of a file

But metadata opens up world of opportunities here...

Implementation ideas:

  • provisioning is basically the same as metadata-driven dataset generation

  • have an API for each component: provision, compute, extract

  • there can be more than one "provisioner", one can be datalad clone based, other can do metadata based provisioning

  • In metadata of git-annex key we would need to be able to find information for all three components

    • can key metadata remain minimal and additional instructions live elsewhere?
    • layer of signed trusted authority is necessary ("i only want to be able to recompute files that I provided instructions for myself")
    • git notes are another possible place where (signed) information can be placed

Concrete use cases:

  • fmriprep
  • digital photography (raw + sidecar -> jpeg)
  • get sub-clip from a larger video file (distribits talks use case)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant