Problem
Validation rule V-12 states:
When role is "control" in aggregated_data.h5ad obs, control_type MUST be present and MUST be one of "non-targeting" or "intergenic". Values MUST match the corresponding control_type in perturbation_library.csv.
However, the aggregated-data spec does not declare role or control_type as obs columns. The only declared obs columns are perturbation_id, the columns named in observation_unit, and optional cluster_group_{N}.
This means V-12 requires columns that the spec never tells submitters to include.
Proposed fix: rewrite V-12 as an FK-join validation
aggregated_data.h5ad already has perturbation_id as a foreign key to perturbation_library.csv, and the library spec already defines role and control_type with its own validation rules (V-10, V-11). Rather than duplicating those columns into the aggregated data, V-12 should validate through the join:
Current V-12:
| ID |
Condition |
Constraint |
| V-12 |
role is "control" in aggregated_data.h5ad obs |
control_type MUST be present and MUST be one of "non-targeting" or "intergenic". Values MUST match the corresponding control_type in perturbation_library.csv. |
Proposed V-12:
| ID |
Condition |
Constraint |
| V-12 |
aggregated_data.h5ad obs row joins to a perturbation_library.csv row where role is "control" (via perturbation_id FK) |
The joined library row MUST satisfy V-10 (i.e., control_type is present and is one of "non-targeting" or "intergenic"). No role or control_type column is required in aggregated_data.h5ad obs. |
Why this is better
- No data duplication.
role and control_type are properties of the perturbation, not the aggregation unit. Copying them into obs creates a sync hazard.
- V-10/V-11 already cover the constraint. The library-side rules ensure every control perturbation has a valid
control_type. V-12 only needs to confirm the FK resolves to a row that passes those rules.
- Aggregated-data spec stays clean. No need to add columns that exist solely to re-check a constraint the library already enforces.
Context
Surfaced during the aggregate_id / observation_unit redesign (#13 follow-up). Pre-existing issue — not introduced by that change.
Problem
Validation rule V-12 states:
However, the aggregated-data spec does not declare
roleorcontrol_typeas obs columns. The only declared obs columns areperturbation_id, the columns named inobservation_unit, and optionalcluster_group_{N}.This means V-12 requires columns that the spec never tells submitters to include.
Proposed fix: rewrite V-12 as an FK-join validation
aggregated_data.h5adalready hasperturbation_idas a foreign key toperturbation_library.csv, and the library spec already definesroleandcontrol_typewith its own validation rules (V-10, V-11). Rather than duplicating those columns into the aggregated data, V-12 should validate through the join:Current V-12:
roleis"control"inaggregated_data.h5adobscontrol_typeMUST be present and MUST be one of"non-targeting"or"intergenic". Values MUST match the correspondingcontrol_typeinperturbation_library.csv.Proposed V-12:
aggregated_data.h5adobs row joins to aperturbation_library.csvrow whereroleis"control"(viaperturbation_idFK)control_typeis present and is one of"non-targeting"or"intergenic"). Noroleorcontrol_typecolumn is required inaggregated_data.h5adobs.Why this is better
roleandcontrol_typeare properties of the perturbation, not the aggregation unit. Copying them into obs creates a sync hazard.control_type. V-12 only needs to confirm the FK resolves to a row that passes those rules.Context
Surfaced during the
aggregate_id/observation_unitredesign (#13 follow-up). Pre-existing issue — not introduced by that change.