Preserve numeric-like IDs in QC and add regression test#307
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent “numeric-like” sample/library identifiers (especially those with leading zeros) from being coerced into numbers during parsing, with a regression test to ensure QC dashboard parsing preserves IDs as strings.
Changes:
- Force string dtypes when reading sample sheets / sample metadata in Snakemake Python contexts.
- Add
read_table_preserve_ids()helper to the QC RMarkdown dashboard and switch keyread.table()call sites to preserve ID columns as character. - Add a regression test that extracts and executes the R helper to verify leading zeros are preserved for headered and headerless tables.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| workflow/rules/common.smk | Adds dtype mappings for sample sheet + metadata read_csv to preserve numeric-like IDs. |
| workflow/modules/qc/Snakefile | Adds dtype mappings when reading samples/metadata for QC module configuration. |
| workflow/modules/qc/scripts/qc_dashboard_interactive.Rmd | Introduces read_table_preserve_ids() and updates table reads to avoid ID coercion + disables name mangling/factors. |
| workflow/modules/postprocess/Snakefile | Adds dtype mappings when reading samples/metadata for postprocess module configuration. |
| tests/tests.py | Adds R helper extractor + regression test; expands mapfile test coverage and adds FASTQ numeric-like ID dry-run test. |
| tests/unit_tests.py | No functional change (line normalization). |
| docs/setup.md | Removes trailing blank line. |
Comments suppressed due to low confidence (1)
workflow/rules/common.smk:407
pd.read_csv(..., dtype=SAMPLE_SHEET_DTYPES)will raise a ValueError if the input CSV does not contain all keys in the dtype mapping.library_idis optional perworkflow/schemas/samples.schema.yaml(not inrequired), and the code below explicitly handles the missing column, so this read will fail before that fallback runs (e.g., sample sheets liketests/sample_sheets/local_gvcf.csvhave nolibrary_id). Consider building the dtype mapping from the actual header (e.g., readnrows=0first and intersect keys), or only forcingsample_idat read time and handlinglibrary_idwith a conditional second pass.
SAMPLE_SHEET_DTYPES = {
"sample_id": "string",
"library_id": "string",
}
SAMPLE_METADATA_DTYPES = {
"sample_id": "string",
}
samples_df = pd.read_csv(config["samples"], dtype=SAMPLE_SHEET_DTYPES)
global_mark_duplicates = bool(config["reads"]["mark_duplicates"])
if "library_id" not in samples_df.columns:
samples_df["library_id"] = samples_df["sample_id"]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _SAMPLE_SHEET_DTYPES = { | ||
| "sample_id": "string", | ||
| "library_id": "string", | ||
| } | ||
|
|
||
| _SAMPLE_METADATA_DTYPES = { | ||
| "sample_id": "string", | ||
| } | ||
|
|
||
|
|
||
| _samples_df = pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES) | ||
| _ALL_SAMPLES = _samples_df["sample_id"].unique().tolist() | ||
|
|
||
| _metadata_df = None | ||
| _meta_path = config.get("sample_metadata", "") | ||
| if _meta_path: | ||
| _metadata_df = pd.read_csv(_meta_path) | ||
| _metadata_df = pd.read_csv(_meta_path, dtype=_SAMPLE_METADATA_DTYPES) |
There was a problem hiding this comment.
pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES) will error when the samples CSV does not include library_id (dtype mappings require all specified columns to be present). This module only needs sample_id, and some fixtures (e.g. tests/data/qc/samples.csv) are headerless aside from sample_id, so this change is likely to break the QC module. Suggest intersecting the dtype map with the parsed header (or only specifying {"sample_id": "string"} here).
There was a problem hiding this comment.
@copilot apply changes based on this feedback. we should intersect dtype map with parsed header
There was a problem hiding this comment.
Done in commit 2d39d2d. Added a _read_csv_safe helper in both workflow/modules/qc/Snakefile and workflow/modules/postprocess/Snakefile that reads the CSV header first (nrows=0, zero data rows so essentially free), then filters the dtype map to only the columns actually present before doing the full read. This avoids the ValueError when library_id is absent while still preserving leading-zero IDs at parse time.
| _SAMPLE_SHEET_DTYPES = { | ||
| "sample_id": "string", | ||
| "library_id": "string", | ||
| } | ||
|
|
||
| _SAMPLE_METADATA_DTYPES = { | ||
| "sample_id": "string", | ||
| } | ||
|
|
||
|
|
||
| _samples_df = pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES) | ||
| _ALL_SAMPLES = _samples_df["sample_id"].unique().tolist() | ||
|
|
||
| _metadata_df = None | ||
| _meta_path = config.get("sample_metadata", "") | ||
| if _meta_path: | ||
| _metadata_df = pd.read_csv(_meta_path) | ||
| _metadata_df = pd.read_csv(_meta_path, dtype=_SAMPLE_METADATA_DTYPES) |
There was a problem hiding this comment.
pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES) will raise if library_id is absent, but library_id is optional in the samples schema and this module only uses sample_id. To avoid breaking valid sample sheets, either only force sample_id to string here, or compute dtype by intersecting _SAMPLE_SHEET_DTYPES with the columns present in the CSV header before calling read_csv.
There was a problem hiding this comment.
@copilot apply changes based on this feedback we should intersect dtype map with parsed header
…s Snakefiles Agent-Logs-Url: https://github.com/harvardinformatics/snparcher/sessions/ab77797b-0889-48eb-8ac3-fd6d1ca54c86 Co-authored-by: tsackton <[email protected]>
…stprocess Snakefiles" This reverts commit 2d39d2d.
pd.read_csverrors when dtype map contains column names absent in the CSVworkflow/modules/qc/Snakefile: add_read_csv_safehelper that intersects_SAMPLE_SHEET_DTYPESwith actual CSV columns beforeread_csvworkflow/modules/postprocess/Snakefile: same fix