Preserve numeric-like IDs in QC and add regression test by tsackton · Pull Request #307 · harvardinformatics/snparcher

tsackton · 2026-04-07T14:43:24Z

Understand the issue: pd.read_csv errors when dtype map contains column names absent in the CSV
Fix workflow/modules/qc/Snakefile: add _read_csv_safe helper that intersects _SAMPLE_SHEET_DTYPES with actual CSV columns before read_csv
Fix workflow/modules/postprocess/Snakefile: same fix
Validate changes with parallel_validation (CodeQL clean, code review passed)

Copilot

Pull request overview

This PR aims to prevent “numeric-like” sample/library identifiers (especially those with leading zeros) from being coerced into numbers during parsing, with a regression test to ensure QC dashboard parsing preserves IDs as strings.

Changes:

Force string dtypes when reading sample sheets / sample metadata in Snakemake Python contexts.
Add read_table_preserve_ids() helper to the QC RMarkdown dashboard and switch key read.table() call sites to preserve ID columns as character.
Add a regression test that extracts and executes the R helper to verify leading zeros are preserved for headered and headerless tables.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
workflow/rules/common.smk	Adds dtype mappings for sample sheet + metadata `read_csv` to preserve numeric-like IDs.
workflow/modules/qc/Snakefile	Adds dtype mappings when reading samples/metadata for QC module configuration.
workflow/modules/qc/scripts/qc_dashboard_interactive.Rmd	Introduces `read_table_preserve_ids()` and updates table reads to avoid ID coercion + disables name mangling/factors.
workflow/modules/postprocess/Snakefile	Adds dtype mappings when reading samples/metadata for postprocess module configuration.
tests/tests.py	Adds R helper extractor + regression test; expands mapfile test coverage and adds FASTQ numeric-like ID dry-run test.
tests/unit_tests.py	No functional change (line normalization).
docs/setup.md	Removes trailing blank line.

Comments suppressed due to low confidence (1)

workflow/rules/common.smk:407

pd.read_csv(..., dtype=SAMPLE_SHEET_DTYPES) will raise a ValueError if the input CSV does not contain all keys in the dtype mapping. library_id is optional per workflow/schemas/samples.schema.yaml (not in required), and the code below explicitly handles the missing column, so this read will fail before that fallback runs (e.g., sample sheets like tests/sample_sheets/local_gvcf.csv have no library_id). Consider building the dtype mapping from the actual header (e.g., read nrows=0 first and intersect keys), or only forcing sample_id at read time and handling library_id with a conditional second pass.

SAMPLE_SHEET_DTYPES = {
    "sample_id": "string",
    "library_id": "string",
}

SAMPLE_METADATA_DTYPES = {
    "sample_id": "string",
}


samples_df = pd.read_csv(config["samples"], dtype=SAMPLE_SHEET_DTYPES)
global_mark_duplicates = bool(config["reads"]["mark_duplicates"])

if "library_id" not in samples_df.columns:
    samples_df["library_id"] = samples_df["sample_id"]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T14:48:17Z

+_SAMPLE_SHEET_DTYPES = {
+    "sample_id": "string",
+    "library_id": "string",
+}
+
+_SAMPLE_METADATA_DTYPES = {
+    "sample_id": "string",
+}
+
+
+_samples_df = pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES)
 _ALL_SAMPLES = _samples_df["sample_id"].unique().tolist()

 _metadata_df = None
 _meta_path = config.get("sample_metadata", "")
 if _meta_path:
-    _metadata_df = pd.read_csv(_meta_path)
+    _metadata_df = pd.read_csv(_meta_path, dtype=_SAMPLE_METADATA_DTYPES)


pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES) will error when the samples CSV does not include library_id (dtype mappings require all specified columns to be present). This module only needs sample_id, and some fixtures (e.g. tests/data/qc/samples.csv) are headerless aside from sample_id, so this change is likely to break the QC module. Suggest intersecting the dtype map with the parsed header (or only specifying {"sample_id": "string"} here).

@copilot apply changes based on this feedback. we should intersect dtype map with parsed header

Done in commit 2d39d2d. Added a _read_csv_safe helper in both workflow/modules/qc/Snakefile and workflow/modules/postprocess/Snakefile that reads the CSV header first (nrows=0, zero data rows so essentially free), then filters the dtype map to only the columns actually present before doing the full read. This avoids the ValueError when library_id is absent while still preserving leading-zero IDs at parse time.

Copilot · 2026-04-07T14:48:18Z

+_SAMPLE_SHEET_DTYPES = {
+    "sample_id": "string",
+    "library_id": "string",
+}
+
+_SAMPLE_METADATA_DTYPES = {
+    "sample_id": "string",
+}
+
+
+_samples_df = pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES)
 _ALL_SAMPLES = _samples_df["sample_id"].unique().tolist()

 _metadata_df = None
 _meta_path = config.get("sample_metadata", "")
 if _meta_path:
-    _metadata_df = pd.read_csv(_meta_path)
+    _metadata_df = pd.read_csv(_meta_path, dtype=_SAMPLE_METADATA_DTYPES)


pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES) will raise if library_id is absent, but library_id is optional in the samples schema and this module only uses sample_id. To avoid breaking valid sample sheets, either only force sample_id to string here, or compute dtype by intersecting _SAMPLE_SHEET_DTYPES with the columns present in the CSV header before calling read_csv.

@copilot apply changes based on this feedback we should intersect dtype map with parsed header

…s Snakefiles Agent-Logs-Url: https://github.com/harvardinformatics/snparcher/sessions/ab77797b-0889-48eb-8ac3-fd6d1ca54c86 Co-authored-by: tsackton <[email protected]>

…stprocess Snakefiles" This reverts commit 2d39d2d.

Tim Sackton and others added 4 commits April 6, 2026 20:10

Preserve numeric-like sample identifiers

214a163

Trim trailing whitespace in setup docs

fde9e9f

Add trailing newline to unit tests

ffe9d1f

Fix QC numeric ID preservation and add regression test

441c133

tsackton requested a review from Copilot April 7, 2026 14:44

Copilot started reviewing on behalf of tsackton April 7, 2026 14:44 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Copilot started work on behalf of tsackton April 7, 2026 14:50 View session

fix: intersect dtype map with actual CSV columns in qc and postproces…

2d39d2d

…s Snakefiles Agent-Logs-Url: https://github.com/harvardinformatics/snparcher/sessions/ab77797b-0889-48eb-8ac3-fd6d1ca54c86 Co-authored-by: tsackton <[email protected]>

Copilot finished work on behalf of tsackton April 7, 2026 14:57

Revert "fix: intersect dtype map with actual CSV columns in qc and po…

c1cffd3

…stprocess Snakefiles" This reverts commit 2d39d2d.

tsackton merged commit a39d44b into main Apr 7, 2026
13 checks passed

tsackton deleted the codex/allow-numeric-ids branch April 7, 2026 16:44

tsackton mentioned this pull request Apr 7, 2026

problems with sample IDs that contain only numbers #306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve numeric-like IDs in QC and add regression test#307

Preserve numeric-like IDs in QC and add regression test#307
tsackton merged 6 commits into
mainfrom
codex/allow-numeric-ids

tsackton commented Apr 7, 2026 •

edited by Copilot AI

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

tsackton Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

tsackton Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tsackton commented Apr 7, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

tsackton Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

tsackton Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tsackton commented Apr 7, 2026 •

edited by Copilot AI

Loading