Skip to content

V-12 references role/control_type in aggregated_data obs, but columns are undeclared #16

@aofei-liu

Description

@aofei-liu

Problem

Validation rule V-12 states:

When role is "control" in aggregated_data.h5ad obs, control_type MUST be present and MUST be one of "non-targeting" or "intergenic". Values MUST match the corresponding control_type in perturbation_library.csv.

However, the aggregated-data spec does not declare role or control_type as obs columns. The only declared obs columns are perturbation_id, the columns named in observation_unit, and optional cluster_group_{N}.

This means V-12 requires columns that the spec never tells submitters to include.

Proposed fix: rewrite V-12 as an FK-join validation

aggregated_data.h5ad already has perturbation_id as a foreign key to perturbation_library.csv, and the library spec already defines role and control_type with its own validation rules (V-10, V-11). Rather than duplicating those columns into the aggregated data, V-12 should validate through the join:

Current V-12:

ID Condition Constraint
V-12 role is "control" in aggregated_data.h5ad obs control_type MUST be present and MUST be one of "non-targeting" or "intergenic". Values MUST match the corresponding control_type in perturbation_library.csv.

Proposed V-12:

ID Condition Constraint
V-12 aggregated_data.h5ad obs row joins to a perturbation_library.csv row where role is "control" (via perturbation_id FK) The joined library row MUST satisfy V-10 (i.e., control_type is present and is one of "non-targeting" or "intergenic"). No role or control_type column is required in aggregated_data.h5ad obs.

Why this is better

  • No data duplication. role and control_type are properties of the perturbation, not the aggregation unit. Copying them into obs creates a sync hazard.
  • V-10/V-11 already cover the constraint. The library-side rules ensure every control perturbation has a valid control_type. V-12 only needs to confirm the FK resolves to a row that passes those rules.
  • Aggregated-data spec stays clean. No need to add columns that exist solely to re-check a constraint the library already enforces.

Context

Surfaced during the aggregate_id / observation_unit redesign (#13 follow-up). Pre-existing issue — not introduced by that change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions