Genocrate defines RO Crate profiles and provides validators to package genomic datasets for research studies. In typical genomics projects, data for multiple participants is often organized into batches, with each batch grouped by participant identifiers, reflecting sequencing output or transfer times. Multiple batches or sequencing runs usually belong to the same study. Genocrate helps you build RO Crates for each batch to ensure files are well documented, and also create an overarching crate to organize and track all files across batches within a study.
Below is an example of the output folder structure generated by Genocrate. For more details, see the layout description:
./tests/fixtures/test-batches
├── batch-001
│ ├── bag-info.txt
│ ├── bagit.txt
│ ├── data
│ │ ├── (genomic file set ...)
│ │ └── ro-crate-metadata.json
│ ├── manifest-md5.txt
│ └── tagmanifest-md5.txt
├── batch-002
│ ├── bag-info.txt
│ ├── bagit.txt
│ ├── data
│ │ ├── (genomic file set ...)
│ │ └── ro-crate-metadata.json
│ ├── manifest-md5.txt
│ └── tagmanifest-md5.txt
├── batch-003
│ ├── bag-info.txt
│ ├── bagit.txt
│ ├── data
│ │ ├── ( some genomic file set ...)
│ │ └── ro-crate-metadata.json
│ ├── manifest-md5.txt
│ └── tagmanifest-md5.txt
├── ro-crate-metadata.json
└── ro-crate-preview.html
There are 2 profiles to capture genomic file set:
- batch-submission: Describes a smaller set of genomic files submitted as a batch, typically representing data generated or transferred together as part of the same study.
- study-dataset: Describes the complete dataset for a study, aggregating information from multiple batch RO Crates to provide an overview of all files and participants in the study.
build: Create a root RO Crate or merge an existing root RO Crate with a new batch RO Crate that conforms to thestudy-datasetprofile. This command reads through batch submission crates to assemble or update the study-level crate.csv2genocrate: Convert a CSV manifest file into an RO Crate that conforms to thebatch-submissionprofile within the batch submission folder. This command also validates checksums defined in the CSV.diff: Show differences between the root study-dataset RO Crate and a new batch RO Crate before merging, helping you review changes prior to running thebuildcommand.validate-batch: Validate that a folder conforms to thebatch-submissionRO Crate profile, including checks for checksums and BagIt specification compliance.validate-dataset: Validate the study-level RO Crate against thestudy-datasetprofile. Skips content / integrity checks (e.g., checksums, BagIt) handled by validate-batch. For more details on each command, see the CLI documentation.
Install this tool using pip:
pip install genocrateFor help, run:
genocrate --helpYou can also use:
python -m genocrate --helpDetailed command-line documentation is available in the CLI docs.
To contribute to this tool, first checkout the code. Then create a new virtual environment:
uv venv
source .venv/bin/activateNow install the dependencies and test dependencies:
pip install -e '.[test]'To run the tests:
python -m pytestTo update the CLI docs:
python ./scripts/generate_cli_docs.py