Skip to content

Add identifier partitioning tests for shared identifiers#692

Open
gaurav wants to merge 34 commits intomasterfrom
add-pipeline-tests-for-shared-identifiers
Open

Add identifier partitioning tests for shared identifiers#692
gaurav wants to merge 34 commits intomasterfrom
add-pipeline-tests-for-shared-identifiers

Conversation

@gaurav
Copy link
Copy Markdown
Collaborator

@gaurav gaurav commented Mar 12, 2026

This PR adds a test framework that verifies no identifier appears in more than one compendium's output for vocabularies shared across semantic types (MESH, UMLS, OMIM, NCIT, GO). It also adds protein.write_mesh_ids() — a new function routing protein-tree MeSH terms to the protein compendium — and fixes
SCR (Supplementary Concept Record) filtering so protein-mapped SCR terms no longer leak into chemicals. The design is extensible: adding a new vocabulary takes 3 steps and zero changes to the test file itself.

WIP. Extends and replaces #687 and PR #690

TODO:

  • {'UMLS:C5443441': ['chemicals', 'anatomy'], 'UMLS:C5443442': ['chemicals', 'anatomy']}
  • {'MESH:D022041': ['protein', 'anatomy'], 'MESH:D035321': ['protein', 'anatomy'], 'MESH:D006570': ['protein', 'anatomy'], 'MESH:D009707': ['protein', 'anatomy'], 'MESH:D064448': ['protein', 'anatomy'], 'MESH:D035341': ['protein', 'anatomy'], 'MESH:D045524': ['protein', 'anatomy'], 'MESH:D007106': ['protein', 'anatomy'], 'MESH:D000091083': ['chemicals', 'anatomy'], 'MESH:D000067816': ['protein', 'anatomy'], 'MESH:D014688': ['chemicals', 'anatomy'], 'MESH:D000089804': ['protein', 'anatomy'], 'MESH:D002843': ['protein', 'anatomy'], 'MESH:D017358': ['protein', 'anatomy'], 'MESH:D008894': ['protein', 'anatomy'], 'MESH:D000961': ['protein', 'anatomy'], 'MESH:D009360': ['anatomy', 'diseasephenotype'], 'MESH:D000072717': ['anatomy', 'diseasephenotype'], 'MESH:D018404': ['anatomy', 'diseasephenotype'], 'MESH:D065309': ['anatomy', 'diseasephenotype'], 'MESH:D003809': ['anatomy', 'diseasephenotype'], 'MESH:D008467': ['anatomy', 'diseasephenotype'], 'MESH:D012303': ['anatomy', 'diseasephenotype'], 'MESH:D017439': ['anatomy', 'diseasephenotype'], 'MESH:D000153': ['anatomy', 'diseasephenotype'], 'MESH:D014097': ['anatomy', 'diseasephenotype'], 'MESH:D010677': ['anatomy', 'diseasephenotype'], 'MESH:D000072662': ['anatomy', 'diseasephenotype'], 'MESH:D003750': ['anatomy', 'diseasephenotype'], 'MESH:D008551': ['anatomy', 'diseasephenotype'], 'MESH:D048629': ['anatomy', 'diseasephenotype'], 'MESH:D002921': ['anatomy', 'diseasephenotype'], 'MESH:D007627': ['anatomy', 'diseasephenotype'], 'MESH:D036226': ['anatomy', 'taxon'], 'MESH:D013172': ['anatomy', 'taxon'], 'MESH:D052940': ['anatomy', 'taxon'], 'MESH:D033761': ['anatomy', 'taxon'], 'MESH:D053058': ['anatomy', 'taxon'], 'MESH:D002523': ['anatomy', 'taxon'], 'MESH:D038821': ['anatomy', 'taxon'], 'MESH:D033661': ['anatomy', 'taxon'], 'MESH:D013171': ['anatomy', 'taxon'], 'MESH:D034101': ['anatomy', 'taxon'], 'MESH:D013104': ['anatomy', 'taxon'], 'MESH:D052939': ['anatomy', 'taxon'], 'MESH:D013170': ['anatomy', 'taxon']}

gaurav and others added 16 commits March 8, 2026 00:18
SCR (Supplementary Concept Record) terms lack tree numbers, so they
bypassed the existing D12.776/D05/D08 exclusion logic. This adds a
SPARQL query that checks meshv:mappedTo and meshv:preferredMappedTo
relationships to find SCR terms mapped to descriptors under excluded
trees and marks them as EXCLUDE.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Protein-related MeSH terms excluded from chemicals (D12.776 Proteins,
D05.500 Multiprotein Complexes, D05.875 Protein Aggregates, D08.811
Enzymes, D08.622 Enzyme Precursors, D08.244 Cytochromes) were previously
dropped entirely. This adds them to the protein compendium instead, along
with SCR_Chemical terms mapped to protein descriptor trees.

Also adds scr_include_trees parameter to mesh.write_ids() (the inverse
of scr_exclude_trees) and documents the MeSH tree assignments in both
chemicals.py and protein.py with TODOs for unifying them.

Some D05/D08 subtrees (Polymers, Smart Materials, Micelles, Coenzymes)
remain in neither compendium — documented as a known gap.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Tests cover write_ids() parameter validation, SCR exclude/include tree
filtering logic (mock-based), and get_scr_terms_mapped_to_trees() via
an inline pyoxigraph store. Also documents pipeline-only test scenarios
in tests/README.md.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Creates tests/pipeline/ with a session-scoped fixture that loads
babel_downloads/MESH/mesh.nt and runs both chemicals.write_mesh_ids()
and protein.write_mesh_ids(), then asserts four invariants: non-empty
outputs, zero ID overlap (the core correctness property), and no
D05/D08/D12.776 descriptor terms in the chemicals output. Updates
tests/README.md to document the new test file and retire the Out of
Scope stubs.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replace the manual pre-population requirement with a `_download_or_skip`
helper in `tests/pipeline/conftest.py`. The new layered fixture pattern
splits each datasource into a download fixture (`mesh_nt`) and a processing
fixture (`mesh_pipeline_outputs`) — pytest propagates skips automatically
if the download fails. Update tests/README.md to document the new pattern
and provide a template for adding future datasources.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…sion)

Extend the self-downloading pipeline test framework to cover the seven
compendia that call write_umls_ids() (chemicals, protein, anatomy,
disease/phenotype, process/activity, taxon, gene). The new umls_rrf_files
fixture checks for UMLS_API_KEY before attempting any download (since
download_umls() calls exit(1) rather than raising on a missing key) and
skips gracefully if the key is absent or the download fails. Nine tests:
seven parametrized non-empty checks, one mutual-exclusivity assertion that
no UMLS:CUI appears in more than one compendium, and one targeted test
guarding against protein-tree IDs leaking into chemicals.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The API key check was unconditional, so tests would skip even when
MRCONSO.RRF and MRSTY.RRF were already present in babel_downloads/.
Move the UMLS_API_KEY check inside the "files are missing" branch so
that cached runs proceed without needing the key in the environment.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replace the per-vocabulary boilerplate approach with a VOCABULARY_REGISTRY
in conftest.py and a single test_vocabulary_partitioning.py that runs
non-empty and mutual-exclusivity tests for every registered vocabulary.

Changes:
- conftest.py: add _read_ids helper; extend mesh_pipeline_outputs to all
  five MeSH compendia (anatomy, diseasephenotype, taxon added); add
  omim_mim2gene + omim_pipeline_outputs fixtures; add ubergraph_connection
  fixture shared by NCIT and GO; add ncit_pipeline_outputs and
  go_pipeline_outputs fixtures; add VOCABULARY_REGISTRY and vocab_outputs
  parametrized fixture using request.getfixturevalue()
- test_vocabulary_partitioning.py: new file — 10 tests (5 vocabs × 2)
  automatically updated when VOCABULARY_REGISTRY grows
- test_mesh_pipeline.py: strip to single MeSH-specific tree-exclusion test
- test_umls_pipeline.py: strip to single UMLS-specific protein-tree test
- tests/README.md: update Pipeline section and Future Plans

To add a new vocabulary in future: add download + processing fixtures to
conftest.py and one entry in VOCABULARY_REGISTRY. No new test file needed.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@gaurav gaurav changed the base branch from master to get-test-suite-working-again March 12, 2026 22:34
…erate

Processing fixtures previously wrote ID files to pytest's tmp_path_factory,
a throwaway directory recreated on every run.  Large write_X_ids() calls
(UMLS reads 3+ GB of RRF files) would re-run needlessly on repeated runs.

Changes:
- tests/pipeline/conftest.py: add _intermediate_id_path() which resolves
  {intermediate_directory}/{semantic_type}/ids/{vocab}, matching Snakemake's
  exact paths so a prior full pipeline run can be reused directly.  Add
  _maybe_run() which calls write_X_ids() only when the output file is absent
  or --regenerate is passed.  Add regenerate session fixture.  Replace
  tmp_path_factory with regenerate in all five processing fixtures.
- tests/conftest.py: add --regenerate CLI option; update --pipeline help text.
- tests/README.md: add Caching subsection explaining the stable paths and
  how to use --regenerate or manual deletion for clean re-runs.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@gaurav gaurav marked this pull request as ready for review March 13, 2026 00:48
@gaurav gaurav requested a review from Copilot March 13, 2026 00:48
@gaurav gaurav moved this from Backlog to In progress in Babel sprints Mar 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reusable pipeline-test framework to verify that shared vocabularies (UMLS/MeSH/OMIM/NCIT/GO) are cleanly partitioned across compendia, and updates MeSH/UMLS handling to support those checks.

Changes:

  • Introduces parametrized “vocabulary partitioning” pipeline tests (non-empty outputs + no ID overlaps) driven by a registry in tests/pipeline/conftest.py.
  • Adds MeSH/UMLS-specific targeted pipeline assertions plus a --regenerate flag for rebuilding cached intermediate ID files.
  • Extends MeSH handling to route protein-tree IDs (and SCR-mapped terms) appropriately, adds protein MeSH IDs to the Snakemake workflow/config, and hardens UMLS download failures by raising exceptions.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/pipeline/conftest.py Session-scoped pipeline fixtures, caching behavior, and VOCABULARY_REGISTRY driving parametrized tests
tests/pipeline/test_vocabulary_partitioning.py Generic per-vocabulary non-empty + mutual-exclusivity checks
tests/pipeline/test_mesh_pipeline.py MeSH-specific exclusion assertion (chemicals must exclude protein descriptor trees)
tests/pipeline/test_umls_pipeline.py UMLS-specific overlap guard (chemicals vs protein semantic tree)
tests/pipeline/init.py Declares the pipeline test package
tests/datahandlers/test_mesh.py Unit tests for MeSH SCR filtering and new mapping query helper
tests/conftest.py Adds --regenerate; updates --pipeline CLI help
tests/README.md Documents pipeline caching/regeneration and new pipeline test suite structure
src/datahandlers/mesh.py Adds SCR→descriptor-tree mapping query + new write_ids() options for SCR include/exclude
src/createcompendia/chemicals.py Excludes SCR terms mapped under excluded protein/macromolecule/enzyme trees
src/createcompendia/protein.py Adds write_mesh_ids() for protein MeSH trees + SCR inclusion
src/snakefiles/protein.snakefile Adds Snakemake rule to produce protein MeSH IDs
config.yaml Adds MESH to protein_ids so it’s included in protein compendium builds
src/datahandlers/umls.py Replaces exit(1) with raised RuntimeError for download failures
CLAUDE.md Adds guidance on running/debugging and using intermediate artifacts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.


from src.babel_utils import make_local_name
from src.createcompendia import anatomy, chemicals, diseasephenotype, protein, taxon
from src.datahandlers.mesh import Mesh, pull_mesh
Comment on lines +134 to +145
_maybe_run(p("chemicals"), lambda: chemicals.write_mesh_ids(p("chemicals")), regenerate)
_maybe_run(p("protein"), lambda: protein.write_mesh_ids(p("protein")), regenerate)
_maybe_run(p("anatomy"), lambda: anatomy.write_mesh_ids(p("anatomy")), regenerate)
_maybe_run(p("diseasephenotype"), lambda: diseasephenotype.write_mesh_ids(p("diseasephenotype")), regenerate)
_maybe_run(p("taxon"), lambda: taxon.write_mesh_ids(p("taxon")), regenerate)

m = Mesh()
excluded_tree_terms = set()
for tree in ["D05", "D08", "D12.776"]:
excluded_tree_terms.update(m.get_terms_in_tree(tree))

return {
Comment on lines +55 to +71
for top_treenum in top_treenums:
s = f""" PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>

SELECT DISTINCT ?term
WHERE {{ VALUES ?mappingPred {{ meshv:mappedTo meshv:preferredMappedTo }}
?term ?mappingPred ?descriptor .
?descriptor meshv:treeNumber ?treenum .
?treenum meshv:parentTreeNumber* mesh:{top_treenum}
}}
ORDER BY ?term
"""
qres = self.m.query(s)
for row in list(qres):
iterm = str(row["term"])
meshid = iterm[:-1].split("/")[-1]
terms.add(f"{MESH}:{meshid}")
tests/README.md Outdated
Comment on lines +23 to +31
| `pipeline` | Invokes Snakemake rules; downloads prerequisite data automatically | Yes | Minutes–hours | 3600 s |

### Default behavior

- `pytest` alone: runs `unit` and `slow` tests; skips `network` and `pipeline`
- `pytest --network`: also runs `network` tests
- `pytest --pipeline`: also runs `pipeline` tests (ensure `babel_downloads/` exists first)
- `pytest --pipeline`: also runs `pipeline` tests (prerequisite data is downloaded automatically)
- `pytest --pipeline --regenerate`: re-runs `write_X_ids()` even if intermediate files already
exist (useful after changing compendium filtering logic; see **Caching** below)
pytest-cov + pytest-xdist causes an execnet OSError in pytest_sessionfinish
even when workers are not running pipeline tests. The new pytest_configure
hook detects -n > 0 and sets no_cov = True automatically, printing a note
so the user knows why coverage is absent.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@gaurav gaurav mentioned this pull request Mar 15, 2026
Base automatically changed from get-test-suite-working-again to master March 16, 2026 02:45
gaurav and others added 9 commits March 21, 2026 03:24
New unexpected duplicates still fail hard; known ones trigger xfail so the
test suite passes while the underlying issues remain unresolved. Each entry
is documented so it doubles as a bug inventory — remove an entry once the
fix lands.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The hardcoded _COMPENDIUM_TO_SNAKEMAKE_DIR dict in the test conftest is
replaced by a compendium_directories key in config.yaml, making it the
single authoritative source co-located with all other path config.

Also renames the "processactivity" fixture key to "processactivitypathway"
to match the actual module name, making it easier to trace from fixture
to source code.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- mesh.py: combine N+1 SPARQL queries in get_scr_terms_mapped_to_trees()
  into a single query using VALUES ?topTree { ... }
- umls.py: convert download_rxnorm() print+exit(1) to raise RuntimeError,
  matching the pattern already used in download_umls(); also fix string
  interpolation bug (${var} → {var}) in the error message
- pipeline/conftest.py: add _output_paths() helper that filters a vocab
  outputs dict down to string-valued paths
- test_vocabulary_partitioning.py: use _output_paths() in both test
  functions instead of duplicating the dict-comprehension
- CLAUDE.md: document raise-over-exit(1) convention

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- mesh.py: guard get_scr_terms_mapped_to_trees() against empty input list
  to avoid a malformed VALUES clause
- tests/conftest.py: remove duplicate marker registrations from
  pytest_configure; pyproject.toml is the canonical source (and already
  has a comment saying conftest should not re-register them)
- tests/conftest.py: improve --regenerate help text with typical use case
- test_umls_pipeline.py: clarify that test_chemicals_excludes_protein_semantic_tree
  is a stricter sentinel (no KNOWN_DUPLICATES carve-out) for the specific
  chem/protein pair, not a redundant test
- protein.py: document the SCR D05/D08 approximation trade-off — SCRs
  mapped to non-protein subtrees (Polymers, Coenzymes) will be classified
  as PROTEIN rather than falling into neither compendium
- pipeline/conftest.py: document why the second Mesh() construction is
  necessary and that the cost is paid once per session

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Add `pythonpath = ["."]` to [tool.pytest.ini_options] in pyproject.toml
so that `uv run pytest` works without the PYTHONPATH=. prefix. Remove
the prefix from all pytest invocations in CLAUDE.md and tests/README.md.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… local server

babel_utils.py: fix timedelta.microseconds → total_seconds() in the
throttle sleep. microseconds returns only the sub-second component, so
any configured delta ≥ 1 s would sleep far too short.

test_ThrottledRequester.py: replace mocky.io (unreliable third-party
service) with an ephemeral local HTTPServer that runs in a daemon thread.
_local_server(delay_ms=N) gives precise control over response latency,
making both tests deterministic and offline. Both tests are re-marked
unit (were network) and test_no_throttling's xfail is removed.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a reusable pipeline-test framework to enforce cross-compendium identifier partitioning (no ID appears in more than one compendium for shared vocabularies), and updates MeSH handling so protein-related MeSH terms (including protein-mapped SCR terms) are routed to the protein compendium instead of leaking into chemicals.

Changes:

  • Add parametrized pipeline tests that (a) require non-empty per-compendium ID outputs and (b) enforce mutual exclusivity for shared vocabularies (MESH, UMLS, OMIM, NCIT, GO), with targeted MeSH/UMLS sentinel tests.
  • Add protein.write_mesh_ids() + Snakemake rule/config wiring for MeSH in the protein compendium; update MeSH SCR filtering via scr_exclude_trees / scr_include_trees.
  • Make unit tests and docs more hermetic/consistent (local HTTP server for throttling tests, pytest pythonpath config, --regenerate option for pipeline fixtures, improve UMLS download error handling).

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_ThrottledRequester.py Replaces external network dependency with an in-process HTTP server for deterministic throttling tests.
tests/README.md Updates test-running docs; adds pipeline caching/regeneration guidance and new pipeline test descriptions.
tests/pipeline/test_vocabulary_partitioning.py New generic mutual-exclusivity + non-empty ID tests for all registered vocabularies.
tests/pipeline/test_umls_pipeline.py Adds stricter targeted UMLS chem-vs-protein overlap sentinel test.
tests/pipeline/test_mesh_pipeline.py Adds targeted MeSH exclusion-from-chemicals test for protein descriptor trees.
tests/pipeline/conftest.py Adds session fixtures for downloading/prereqs + stable intermediate-path processing + vocabulary registry.
tests/pipeline/init.py Declares pipeline test package.
tests/datahandlers/test_mesh.py Adds unit tests for MeSH SCR filtering and SPARQL mapping logic using mocks + in-memory pyoxigraph.
tests/conftest.py Adds --regenerate CLI flag for pipeline fixture recomputation; keeps mark-based skipping/timeouts.
src/snakefiles/protein.snakefile Adds protein_mesh_ids rule to generate protein MeSH IDs in the pipeline.
src/datahandlers/umls.py Replaces print + exit(1) with exceptions for missing API key / download failures.
src/datahandlers/mesh.py Adds SCR-mapping query helper + new write_ids options to include/exclude SCR terms by mapped trees.
src/createcompendia/protein.py Adds write_mesh_ids() to route protein-related MeSH (and SCR-mapped-to-protein) terms to protein compendium.
src/createcompendia/chemicals.py Updates MeSH chemicals filtering and excludes SCR terms mapped to excluded trees.
src/babel_utils.py Fixes throttling sleep calculation using timedelta.total_seconds().
pyproject.toml Sets pytest pythonpath = ["."] so tests don’t require PYTHONPATH=. in commands.
config.yaml Adds compendium_directories mapping + adds MESH to protein_ids.
CLAUDE.md Updates documented pytest commands and adds compendium directory mapping note.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +20 to +22
PYTHONPATH=. uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v
Run a single vocabulary:
PYTHONPATH=. uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v -k MESH
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These usage examples still instruct running pytest with PYTHONPATH=. even though this PR sets pythonpath = ["."] in pyproject.toml and updates other docs to remove PYTHONPATH. To avoid conflicting guidance, update these docstring commands to match the new standard uv run pytest ... invocation (or explicitly explain when PYTHONPATH is needed).

Suggested change
PYTHONPATH=. uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v
Run a single vocabulary:
PYTHONPATH=. uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v -k MESH
uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v
Run a single vocabulary:
uv run pytest tests/pipeline/test_vocabulary_partitioning.py --pipeline --no-cov -v -k MESH

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants