You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A1.1 Fairly Big project: most problems due to limitations in git or due to bad alignment of goal and technology
why export to CWL (apart from being a standard language): condor eats CWL (no need for adaptors because tools like condor or slurm already provide them)
there are other things like CWL, but we haven't yet found one which is as capable
initial idea was to have a special remote for recomputing outputs (deterministically), for which the functionality would be indistinguishable from a get operation (for user). This would be hugely beneficial in terms of limiting storage requirements
The idea requires, that a special remote must have access to all necessary compute mechanisms and all necessary inputs
Datalad run:
records provenance records, how a particular state of data came to be
runs arbitrary shell commands
does not explicitly parameterize specific arguments that go into the custom command
"provisions" a git branch checkout
"compute": executes the shell command
"extract": extracts outputs and feeds them to git repo
run record does not have the notion that the user wanted a specific output (out of an arbitrary number of outputs, i.e. it does not record --output --explicit flags)
In an ideal system:
we would want to be able to update all steps ("provision", "compute", "extract") without updating (an unnecessary amount of parts of?) the dataset, in the least expensive way
In a special remote scenario:
we want to compute everything based on annex key (git-annex starts by asking "can you give me this file", etc.)
the information on how to compute that key needs to be stored somewhere / somehow
the information may need to be updated (e.g. container technology changed)
Single git-annex branch is incompatible with the notion of version history of data:
we want to be able to reexecute some historic version of a file
we want to be able to generate the latest version of a file
But metadata opens up world of opportunities here...
Implementation ideas:
provisioning is basically the same as metadata-driven dataset generation
have an API for each component: provision, compute, extract
there can be more than one "provisioner", one can be datalad clone based, other can do metadata based provisioning
In metadata of git-annex key we would need to be able to find information for all three components
can key metadata remain minimal and additional instructions live elsewhere?
layer of signed trusted authority is necessary ("i only want to be able to recompute files that I provided instructions for myself")
git notes are another possible place where (signed) information can be placed
Concrete use cases:
fmriprep
digital photography (raw + sidecar -> jpeg)
get sub-clip from a larger video file (distribits talks use case)
The text was updated successfully, but these errors were encountered:
Q1 Why not use git?
A1.1 Fairly Big project: most problems due to limitations in git or due to bad alignment of goal and technology
why export to CWL (apart from being a standard language): condor eats CWL (no need for adaptors because tools like condor or slurm already provide them)
there are other things like CWL, but we haven't yet found one which is as capable
DataLad remake:
Datalad run:
In an ideal system:
In a special remote scenario:
Single git-annex branch is incompatible with the notion of version history of data:
But metadata opens up world of opportunities here...
Implementation ideas:
provisioning is basically the same as metadata-driven dataset generation
have an API for each component: provision, compute, extract
there can be more than one "provisioner", one can be datalad clone based, other can do metadata based provisioning
In metadata of git-annex key we would need to be able to find information for all three components
Concrete use cases:
The text was updated successfully, but these errors were encountered: