Add identifier partitioning tests for shared identifiers#692
Add identifier partitioning tests for shared identifiers#692
Conversation
SCR (Supplementary Concept Record) terms lack tree numbers, so they bypassed the existing D12.776/D05/D08 exclusion logic. This adds a SPARQL query that checks meshv:mappedTo and meshv:preferredMappedTo relationships to find SCR terms mapped to descriptors under excluded trees and marks them as EXCLUDE. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Protein-related MeSH terms excluded from chemicals (D12.776 Proteins, D05.500 Multiprotein Complexes, D05.875 Protein Aggregates, D08.811 Enzymes, D08.622 Enzyme Precursors, D08.244 Cytochromes) were previously dropped entirely. This adds them to the protein compendium instead, along with SCR_Chemical terms mapped to protein descriptor trees. Also adds scr_include_trees parameter to mesh.write_ids() (the inverse of scr_exclude_trees) and documents the MeSH tree assignments in both chemicals.py and protein.py with TODOs for unifying them. Some D05/D08 subtrees (Polymers, Smart Materials, Micelles, Coenzymes) remain in neither compendium — documented as a known gap. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ins-from-chemicals
Tests cover write_ids() parameter validation, SCR exclude/include tree filtering logic (mock-based), and get_scr_terms_mapped_to_trees() via an inline pyoxigraph store. Also documents pipeline-only test scenarios in tests/README.md. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Creates tests/pipeline/ with a session-scoped fixture that loads babel_downloads/MESH/mesh.nt and runs both chemicals.write_mesh_ids() and protein.write_mesh_ids(), then asserts four invariants: non-empty outputs, zero ID overlap (the core correctness property), and no D05/D08/D12.776 descriptor terms in the chemicals output. Updates tests/README.md to document the new test file and retire the Out of Scope stubs. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replace the manual pre-population requirement with a `_download_or_skip` helper in `tests/pipeline/conftest.py`. The new layered fixture pattern splits each datasource into a download fixture (`mesh_nt`) and a processing fixture (`mesh_pipeline_outputs`) — pytest propagates skips automatically if the download fails. Update tests/README.md to document the new pattern and provide a template for adding future datasources. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…sion) Extend the self-downloading pipeline test framework to cover the seven compendia that call write_umls_ids() (chemicals, protein, anatomy, disease/phenotype, process/activity, taxon, gene). The new umls_rrf_files fixture checks for UMLS_API_KEY before attempting any download (since download_umls() calls exit(1) rather than raising on a missing key) and skips gracefully if the key is absent or the download fails. Nine tests: seven parametrized non-empty checks, one mutual-exclusivity assertion that no UMLS:CUI appears in more than one compendium, and one targeted test guarding against protein-tree IDs leaking into chemicals. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The API key check was unconditional, so tests would skip even when MRCONSO.RRF and MRSTY.RRF were already present in babel_downloads/. Move the UMLS_API_KEY check inside the "files are missing" branch so that cached runs proceed without needing the key in the environment. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replace the per-vocabulary boilerplate approach with a VOCABULARY_REGISTRY in conftest.py and a single test_vocabulary_partitioning.py that runs non-empty and mutual-exclusivity tests for every registered vocabulary. Changes: - conftest.py: add _read_ids helper; extend mesh_pipeline_outputs to all five MeSH compendia (anatomy, diseasephenotype, taxon added); add omim_mim2gene + omim_pipeline_outputs fixtures; add ubergraph_connection fixture shared by NCIT and GO; add ncit_pipeline_outputs and go_pipeline_outputs fixtures; add VOCABULARY_REGISTRY and vocab_outputs parametrized fixture using request.getfixturevalue() - test_vocabulary_partitioning.py: new file — 10 tests (5 vocabs × 2) automatically updated when VOCABULARY_REGISTRY grows - test_mesh_pipeline.py: strip to single MeSH-specific tree-exclusion test - test_umls_pipeline.py: strip to single UMLS-specific protein-tree test - tests/README.md: update Pipeline section and Future Plans To add a new vocabulary in future: add download + processing fixtures to conftest.py and one entry in VOCABULARY_REGISTRY. No new test file needed. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…erate
Processing fixtures previously wrote ID files to pytest's tmp_path_factory,
a throwaway directory recreated on every run. Large write_X_ids() calls
(UMLS reads 3+ GB of RRF files) would re-run needlessly on repeated runs.
Changes:
- tests/pipeline/conftest.py: add _intermediate_id_path() which resolves
{intermediate_directory}/{semantic_type}/ids/{vocab}, matching Snakemake's
exact paths so a prior full pipeline run can be reused directly. Add
_maybe_run() which calls write_X_ids() only when the output file is absent
or --regenerate is passed. Add regenerate session fixture. Replace
tmp_path_factory with regenerate in all five processing fixtures.
- tests/conftest.py: add --regenerate CLI option; update --pipeline help text.
- tests/README.md: add Caching subsection explaining the stable paths and
how to use --regenerate or manual deletion for clean re-runs.
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Adds a reusable pipeline-test framework to verify that shared vocabularies (UMLS/MeSH/OMIM/NCIT/GO) are cleanly partitioned across compendia, and updates MeSH/UMLS handling to support those checks.
Changes:
- Introduces parametrized “vocabulary partitioning” pipeline tests (non-empty outputs + no ID overlaps) driven by a registry in
tests/pipeline/conftest.py. - Adds MeSH/UMLS-specific targeted pipeline assertions plus a
--regenerateflag for rebuilding cached intermediate ID files. - Extends MeSH handling to route protein-tree IDs (and SCR-mapped terms) appropriately, adds protein MeSH IDs to the Snakemake workflow/config, and hardens UMLS download failures by raising exceptions.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/pipeline/conftest.py | Session-scoped pipeline fixtures, caching behavior, and VOCABULARY_REGISTRY driving parametrized tests |
| tests/pipeline/test_vocabulary_partitioning.py | Generic per-vocabulary non-empty + mutual-exclusivity checks |
| tests/pipeline/test_mesh_pipeline.py | MeSH-specific exclusion assertion (chemicals must exclude protein descriptor trees) |
| tests/pipeline/test_umls_pipeline.py | UMLS-specific overlap guard (chemicals vs protein semantic tree) |
| tests/pipeline/init.py | Declares the pipeline test package |
| tests/datahandlers/test_mesh.py | Unit tests for MeSH SCR filtering and new mapping query helper |
| tests/conftest.py | Adds --regenerate; updates --pipeline CLI help |
| tests/README.md | Documents pipeline caching/regeneration and new pipeline test suite structure |
| src/datahandlers/mesh.py | Adds SCR→descriptor-tree mapping query + new write_ids() options for SCR include/exclude |
| src/createcompendia/chemicals.py | Excludes SCR terms mapped under excluded protein/macromolecule/enzyme trees |
| src/createcompendia/protein.py | Adds write_mesh_ids() for protein MeSH trees + SCR inclusion |
| src/snakefiles/protein.snakefile | Adds Snakemake rule to produce protein MeSH IDs |
| config.yaml | Adds MESH to protein_ids so it’s included in protein compendium builds |
| src/datahandlers/umls.py | Replaces exit(1) with raised RuntimeError for download failures |
| CLAUDE.md | Adds guidance on running/debugging and using intermediate artifacts |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
|
||
| from src.babel_utils import make_local_name | ||
| from src.createcompendia import anatomy, chemicals, diseasephenotype, protein, taxon | ||
| from src.datahandlers.mesh import Mesh, pull_mesh |
| _maybe_run(p("chemicals"), lambda: chemicals.write_mesh_ids(p("chemicals")), regenerate) | ||
| _maybe_run(p("protein"), lambda: protein.write_mesh_ids(p("protein")), regenerate) | ||
| _maybe_run(p("anatomy"), lambda: anatomy.write_mesh_ids(p("anatomy")), regenerate) | ||
| _maybe_run(p("diseasephenotype"), lambda: diseasephenotype.write_mesh_ids(p("diseasephenotype")), regenerate) | ||
| _maybe_run(p("taxon"), lambda: taxon.write_mesh_ids(p("taxon")), regenerate) | ||
|
|
||
| m = Mesh() | ||
| excluded_tree_terms = set() | ||
| for tree in ["D05", "D08", "D12.776"]: | ||
| excluded_tree_terms.update(m.get_terms_in_tree(tree)) | ||
|
|
||
| return { |
src/datahandlers/mesh.py
Outdated
| for top_treenum in top_treenums: | ||
| s = f""" PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#> | ||
| PREFIX mesh: <http://id.nlm.nih.gov/mesh/> | ||
|
|
||
| SELECT DISTINCT ?term | ||
| WHERE {{ VALUES ?mappingPred {{ meshv:mappedTo meshv:preferredMappedTo }} | ||
| ?term ?mappingPred ?descriptor . | ||
| ?descriptor meshv:treeNumber ?treenum . | ||
| ?treenum meshv:parentTreeNumber* mesh:{top_treenum} | ||
| }} | ||
| ORDER BY ?term | ||
| """ | ||
| qres = self.m.query(s) | ||
| for row in list(qres): | ||
| iterm = str(row["term"]) | ||
| meshid = iterm[:-1].split("/")[-1] | ||
| terms.add(f"{MESH}:{meshid}") |
tests/README.md
Outdated
| | `pipeline` | Invokes Snakemake rules; downloads prerequisite data automatically | Yes | Minutes–hours | 3600 s | | ||
|
|
||
| ### Default behavior | ||
|
|
||
| - `pytest` alone: runs `unit` and `slow` tests; skips `network` and `pipeline` | ||
| - `pytest --network`: also runs `network` tests | ||
| - `pytest --pipeline`: also runs `pipeline` tests (ensure `babel_downloads/` exists first) | ||
| - `pytest --pipeline`: also runs `pipeline` tests (prerequisite data is downloaded automatically) | ||
| - `pytest --pipeline --regenerate`: re-runs `write_X_ids()` even if intermediate files already | ||
| exist (useful after changing compendium filtering logic; see **Caching** below) |
pytest-cov + pytest-xdist causes an execnet OSError in pytest_sessionfinish even when workers are not running pipeline tests. The new pytest_configure hook detects -n > 0 and sets no_cov = True automatically, printing a note so the user knows why coverage is absent. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
New unexpected duplicates still fail hard; known ones trigger xfail so the test suite passes while the underlying issues remain unresolved. Each entry is documented so it doubles as a bug inventory — remove an entry once the fix lands. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The hardcoded _COMPENDIUM_TO_SNAKEMAKE_DIR dict in the test conftest is replaced by a compendium_directories key in config.yaml, making it the single authoritative source co-located with all other path config. Also renames the "processactivity" fixture key to "processactivitypathway" to match the actual module name, making it easier to trace from fixture to source code. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- mesh.py: combine N+1 SPARQL queries in get_scr_terms_mapped_to_trees()
into a single query using VALUES ?topTree { ... }
- umls.py: convert download_rxnorm() print+exit(1) to raise RuntimeError,
matching the pattern already used in download_umls(); also fix string
interpolation bug (${var} → {var}) in the error message
- pipeline/conftest.py: add _output_paths() helper that filters a vocab
outputs dict down to string-valued paths
- test_vocabulary_partitioning.py: use _output_paths() in both test
functions instead of duplicating the dict-comprehension
- CLAUDE.md: document raise-over-exit(1) convention
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- mesh.py: guard get_scr_terms_mapped_to_trees() against empty input list to avoid a malformed VALUES clause - tests/conftest.py: remove duplicate marker registrations from pytest_configure; pyproject.toml is the canonical source (and already has a comment saying conftest should not re-register them) - tests/conftest.py: improve --regenerate help text with typical use case - test_umls_pipeline.py: clarify that test_chemicals_excludes_protein_semantic_tree is a stricter sentinel (no KNOWN_DUPLICATES carve-out) for the specific chem/protein pair, not a redundant test - protein.py: document the SCR D05/D08 approximation trade-off — SCRs mapped to non-protein subtrees (Polymers, Coenzymes) will be classified as PROTEIN rather than falling into neither compendium - pipeline/conftest.py: document why the second Mesh() construction is necessary and that the cost is paid once per session Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Add `pythonpath = ["."]` to [tool.pytest.ini_options] in pyproject.toml so that `uv run pytest` works without the PYTHONPATH=. prefix. Remove the prefix from all pytest invocations in CLAUDE.md and tests/README.md. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… local server babel_utils.py: fix timedelta.microseconds → total_seconds() in the throttle sleep. microseconds returns only the sub-second component, so any configured delta ≥ 1 s would sleep far too short. test_ThrottledRequester.py: replace mocky.io (unreliable third-party service) with an ephemeral local HTTPServer that runs in a daemon thread. _local_server(delay_ms=N) gives precise control over response latency, making both tests deterministic and offline. Both tests are re-marked unit (were network) and test_no_throttling's xfail is removed. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR introduces a reusable pipeline-test framework to enforce cross-compendium identifier partitioning (no ID appears in more than one compendium for shared vocabularies), and updates MeSH handling so protein-related MeSH terms (including protein-mapped SCR terms) are routed to the protein compendium instead of leaking into chemicals.
Changes:
- Add parametrized
pipelinetests that (a) require non-empty per-compendium ID outputs and (b) enforce mutual exclusivity for shared vocabularies (MESH, UMLS, OMIM, NCIT, GO), with targeted MeSH/UMLS sentinel tests. - Add
protein.write_mesh_ids()+ Snakemake rule/config wiring for MeSH in the protein compendium; update MeSH SCR filtering viascr_exclude_trees/scr_include_trees. - Make unit tests and docs more hermetic/consistent (local HTTP server for throttling tests, pytest
pythonpathconfig,--regenerateoption for pipeline fixtures, improve UMLS download error handling).
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_ThrottledRequester.py | Replaces external network dependency with an in-process HTTP server for deterministic throttling tests. |
| tests/README.md | Updates test-running docs; adds pipeline caching/regeneration guidance and new pipeline test descriptions. |
| tests/pipeline/test_vocabulary_partitioning.py | New generic mutual-exclusivity + non-empty ID tests for all registered vocabularies. |
| tests/pipeline/test_umls_pipeline.py | Adds stricter targeted UMLS chem-vs-protein overlap sentinel test. |
| tests/pipeline/test_mesh_pipeline.py | Adds targeted MeSH exclusion-from-chemicals test for protein descriptor trees. |
| tests/pipeline/conftest.py | Adds session fixtures for downloading/prereqs + stable intermediate-path processing + vocabulary registry. |
| tests/pipeline/init.py | Declares pipeline test package. |
| tests/datahandlers/test_mesh.py | Adds unit tests for MeSH SCR filtering and SPARQL mapping logic using mocks + in-memory pyoxigraph. |
| tests/conftest.py | Adds --regenerate CLI flag for pipeline fixture recomputation; keeps mark-based skipping/timeouts. |
| src/snakefiles/protein.snakefile | Adds protein_mesh_ids rule to generate protein MeSH IDs in the pipeline. |
| src/datahandlers/umls.py | Replaces print + exit(1) with exceptions for missing API key / download failures. |
| src/datahandlers/mesh.py | Adds SCR-mapping query helper + new write_ids options to include/exclude SCR terms by mapped trees. |
| src/createcompendia/protein.py | Adds write_mesh_ids() to route protein-related MeSH (and SCR-mapped-to-protein) terms to protein compendium. |
| src/createcompendia/chemicals.py | Updates MeSH chemicals filtering and excludes SCR terms mapped to excluded trees. |
| src/babel_utils.py | Fixes throttling sleep calculation using timedelta.total_seconds(). |
| pyproject.toml | Sets pytest pythonpath = ["."] so tests don’t require PYTHONPATH=. in commands. |
| config.yaml | Adds compendium_directories mapping + adds MESH to protein_ids. |
| CLAUDE.md | Updates documented pytest commands and adds compendium directory mapping note. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| PYTHONPATH=. uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v | ||
| Run a single vocabulary: | ||
| PYTHONPATH=. uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v -k MESH |
There was a problem hiding this comment.
These usage examples still instruct running pytest with PYTHONPATH=. even though this PR sets pythonpath = ["."] in pyproject.toml and updates other docs to remove PYTHONPATH. To avoid conflicting guidance, update these docstring commands to match the new standard uv run pytest ... invocation (or explicitly explain when PYTHONPATH is needed).
| PYTHONPATH=. uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v | |
| Run a single vocabulary: | |
| PYTHONPATH=. uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v -k MESH | |
| uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v | |
| Run a single vocabulary: | |
| uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v -k MESH |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
This PR adds a test framework that verifies no identifier appears in more than one compendium's output for vocabularies shared across semantic types (MESH, UMLS, OMIM, NCIT, GO). It also adds protein.write_mesh_ids() — a new function routing protein-tree MeSH terms to the protein compendium — and fixes
SCR (Supplementary Concept Record) filtering so protein-mapped SCR terms no longer leak into chemicals. The design is extensible: adding a new vocabulary takes 3 steps and zero changes to the test file itself.
WIP. Extends and replaces #687 and PR #690
TODO: