Skip to content

umccr/genocrate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

genocrate

PyPI Changelog Tests License

Overview

Genocrate defines RO Crate profiles and provides validators to package genomic datasets for research studies. In typical genomics projects, data for multiple participants is often organized into batches, with each batch grouped by participant identifiers, reflecting sequencing output or transfer times. Multiple batches or sequencing runs usually belong to the same study. Genocrate helps you build RO Crates for each batch to ensure files are well documented, and also create an overarching crate to organize and track all files across batches within a study.

Expected Output Structure

Below is an example of the output folder structure generated by Genocrate. For more details, see the layout description:

./tests/fixtures/test-batches
├── batch-001
│   ├── bag-info.txt
│   ├── bagit.txt
│   ├── data
│   │   ├── (genomic file set ...)
│   │   └── ro-crate-metadata.json
│   ├── manifest-md5.txt
│   └── tagmanifest-md5.txt
├── batch-002
│   ├── bag-info.txt
│   ├── bagit.txt
│   ├── data
│   │   ├── (genomic file set ...)
│   │   └── ro-crate-metadata.json
│   ├── manifest-md5.txt
│   └── tagmanifest-md5.txt
├── batch-003
│   ├── bag-info.txt
│   ├── bagit.txt
│   ├── data
│   │   ├── ( some genomic file set ...)
│   │   └── ro-crate-metadata.json
│   ├── manifest-md5.txt
│   └── tagmanifest-md5.txt
├── ro-crate-metadata.json
└── ro-crate-preview.html

Profiles

There are 2 profiles to capture genomic file set:

  • batch-submission: Describes a smaller set of genomic files submitted as a batch, typically representing data generated or transferred together as part of the same study.
  • study-dataset: Describes the complete dataset for a study, aggregating information from multiple batch RO Crates to provide an overview of all files and participants in the study.

CLI suite

  • build: Create a root RO Crate or merge an existing root RO Crate with a new batch RO Crate that conforms to the study-dataset profile. This command reads through batch submission crates to assemble or update the study-level crate.
  • csv2genocrate: Convert a CSV manifest file into an RO Crate that conforms to the batch-submission profile within the batch submission folder. This command also validates checksums defined in the CSV.
  • diff: Show differences between the root study-dataset RO Crate and a new batch RO Crate before merging, helping you review changes prior to running the build command.
  • validate-batch: Validate that a folder conforms to the batch-submission RO Crate profile, including checks for checksums and BagIt specification compliance.
  • validate-dataset: Validate the study-level RO Crate against the study-dataset profile. Skips content / integrity checks (e.g., checksums, BagIt) handled by validate-batch. For more details on each command, see the CLI documentation.

Installation

Install this tool using pip:

pip install genocrate

Usage

For help, run:

genocrate --help

You can also use:

python -m genocrate --help

CLI Documentation

Detailed command-line documentation is available in the CLI docs.

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

uv venv
source .venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

python -m pytest

To update the CLI docs:

python ./scripts/generate_cli_docs.py

About

CLI suite for creating/validating genomics datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •