Skip to content

Preserve numeric-like IDs in QC and add regression test#307

Merged
tsackton merged 6 commits into
mainfrom
codex/allow-numeric-ids
Apr 7, 2026
Merged

Preserve numeric-like IDs in QC and add regression test#307
tsackton merged 6 commits into
mainfrom
codex/allow-numeric-ids

Conversation

@tsackton
Copy link
Copy Markdown
Contributor

@tsackton tsackton commented Apr 7, 2026

  • Understand the issue: pd.read_csv errors when dtype map contains column names absent in the CSV
  • Fix workflow/modules/qc/Snakefile: add _read_csv_safe helper that intersects _SAMPLE_SHEET_DTYPES with actual CSV columns before read_csv
  • Fix workflow/modules/postprocess/Snakefile: same fix
  • Validate changes with parallel_validation (CodeQL clean, code review passed)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent “numeric-like” sample/library identifiers (especially those with leading zeros) from being coerced into numbers during parsing, with a regression test to ensure QC dashboard parsing preserves IDs as strings.

Changes:

  • Force string dtypes when reading sample sheets / sample metadata in Snakemake Python contexts.
  • Add read_table_preserve_ids() helper to the QC RMarkdown dashboard and switch key read.table() call sites to preserve ID columns as character.
  • Add a regression test that extracts and executes the R helper to verify leading zeros are preserved for headered and headerless tables.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
workflow/rules/common.smk Adds dtype mappings for sample sheet + metadata read_csv to preserve numeric-like IDs.
workflow/modules/qc/Snakefile Adds dtype mappings when reading samples/metadata for QC module configuration.
workflow/modules/qc/scripts/qc_dashboard_interactive.Rmd Introduces read_table_preserve_ids() and updates table reads to avoid ID coercion + disables name mangling/factors.
workflow/modules/postprocess/Snakefile Adds dtype mappings when reading samples/metadata for postprocess module configuration.
tests/tests.py Adds R helper extractor + regression test; expands mapfile test coverage and adds FASTQ numeric-like ID dry-run test.
tests/unit_tests.py No functional change (line normalization).
docs/setup.md Removes trailing blank line.
Comments suppressed due to low confidence (1)

workflow/rules/common.smk:407

  • pd.read_csv(..., dtype=SAMPLE_SHEET_DTYPES) will raise a ValueError if the input CSV does not contain all keys in the dtype mapping. library_id is optional per workflow/schemas/samples.schema.yaml (not in required), and the code below explicitly handles the missing column, so this read will fail before that fallback runs (e.g., sample sheets like tests/sample_sheets/local_gvcf.csv have no library_id). Consider building the dtype mapping from the actual header (e.g., read nrows=0 first and intersect keys), or only forcing sample_id at read time and handling library_id with a conditional second pass.
SAMPLE_SHEET_DTYPES = {
    "sample_id": "string",
    "library_id": "string",
}

SAMPLE_METADATA_DTYPES = {
    "sample_id": "string",
}


samples_df = pd.read_csv(config["samples"], dtype=SAMPLE_SHEET_DTYPES)
global_mark_duplicates = bool(config["reads"]["mark_duplicates"])

if "library_id" not in samples_df.columns:
    samples_df["library_id"] = samples_df["sample_id"]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +51 to +67
_SAMPLE_SHEET_DTYPES = {
"sample_id": "string",
"library_id": "string",
}

_SAMPLE_METADATA_DTYPES = {
"sample_id": "string",
}


_samples_df = pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES)
_ALL_SAMPLES = _samples_df["sample_id"].unique().tolist()

_metadata_df = None
_meta_path = config.get("sample_metadata", "")
if _meta_path:
_metadata_df = pd.read_csv(_meta_path)
_metadata_df = pd.read_csv(_meta_path, dtype=_SAMPLE_METADATA_DTYPES)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES) will error when the samples CSV does not include library_id (dtype mappings require all specified columns to be present). This module only needs sample_id, and some fixtures (e.g. tests/data/qc/samples.csv) are headerless aside from sample_id, so this change is likely to break the QC module. Suggest intersecting the dtype map with the parsed header (or only specifying {"sample_id": "string"} here).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback. we should intersect dtype map with parsed header

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 2d39d2d. Added a _read_csv_safe helper in both workflow/modules/qc/Snakefile and workflow/modules/postprocess/Snakefile that reads the CSV header first (nrows=0, zero data rows so essentially free), then filters the dtype map to only the columns actually present before doing the full read. This avoids the ValueError when library_id is absent while still preserving leading-zero IDs at parse time.

Comment on lines +51 to +67
_SAMPLE_SHEET_DTYPES = {
"sample_id": "string",
"library_id": "string",
}

_SAMPLE_METADATA_DTYPES = {
"sample_id": "string",
}


_samples_df = pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES)
_ALL_SAMPLES = _samples_df["sample_id"].unique().tolist()

_metadata_df = None
_meta_path = config.get("sample_metadata", "")
if _meta_path:
_metadata_df = pd.read_csv(_meta_path)
_metadata_df = pd.read_csv(_meta_path, dtype=_SAMPLE_METADATA_DTYPES)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.read_csv(config["samples"], dtype=_SAMPLE_SHEET_DTYPES) will raise if library_id is absent, but library_id is optional in the samples schema and this module only uses sample_id. To avoid breaking valid sample sheets, either only force sample_id to string here, or compute dtype by intersecting _SAMPLE_SHEET_DTYPES with the columns present in the CSV header before calling read_csv.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback we should intersect dtype map with parsed header

@tsackton tsackton merged commit a39d44b into main Apr 7, 2026
13 checks passed
@tsackton tsackton deleted the codex/allow-numeric-ids branch April 7, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants