Skip to content

Eliminate duplicate UMLS type warnings#565

Merged
gaurav merged 13 commits intomasterfrom
eliminate-duplicate-umls-type-warnings
Mar 30, 2026
Merged

Eliminate duplicate UMLS type warnings#565
gaurav merged 13 commits intomasterfrom
eliminate-duplicate-umls-type-warnings

Conversation

@gaurav
Copy link
Copy Markdown
Collaborator

@gaurav gaurav commented Sep 9, 2025

This PR fixes duplicate log/report entries in leftover_umls.py. Because the same CURIE can appear multiple times in MRCONSO, warnings and report lines for NO_UMLS_TYPE and MULTIPLE_UMLS_TYPES were being emitted once per MRCONSO row rather than once per CURIE. The fix tracks seen CURIEs in sets and skips repeated log/report writes. It also migrates all logging.xxx calls to a module-level logger = get_logger(__name__).

Also includes some unrelated uv.lock version changes and some Snakemake file reformatting that really should have been fixed elsewhere >.<

This is because the same CURIE can have multiple entries in MRCONSO. We
now keep track of the CURIEs being found, and to only print a warning if
we haven't seen this CURIE before.
Base automatically changed from babel-v1.13 to master September 24, 2025 17:21
@gaurav gaurav requested a review from Copilot March 1, 2026 08:23
@gaurav gaurav moved this from Backlog to Needs review in Babel sprints Mar 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces noisy duplicate logging when processing UMLS MRCONSO entries by deduplicating “no type” / “multiple types” messages per CURIE, and standardizes this module’s logging to use the repo’s get_logger() helper.

Changes:

  • Introduces per-CURIE tracking sets to avoid emitting repeated “no UMLS type” / “multiple UMLS types” messages for duplicate MRCONSO rows.
  • Replaces logging.* calls with a module-level logger = get_logger(__name__).
  • Adjusts log levels/messages and end-of-run summaries to report unique CURIE counts.
Comments suppressed due to low confidence (1)

src/createcompendia/leftover_umls.py:129

  • umls_type_to_biolink_type is defined inside the MRCONSO line loop, so it gets re-created for every MRCONSO row. Since it only depends on biolink_toolkit, it can be defined once (outside the loop) to reduce overhead and keep the loop body simpler.
                    logger.debug(f"UMLS ID {umls_id} has already been included in this compendium, skipping.")
                    continue

                # The STR value should be the label.
                label = x[14]


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
gaurav and others added 5 commits March 3, 2026 03:40
….info call

Converts the remaining `logging.info()` call (missed in the original PR) to
`logger.info()` and removes the now-unused `import logging`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
gaurav and others added 2 commits March 30, 2026 18:14
…er_umls

MRCONSO can have many rows per CUI. CURIEs skipped due to type resolution
failure (no type or multiple types) were not being short-circuited on
subsequent rows, causing repeated biolink_toolkit lookups. Add an early
continue once a CURIE has already been evaluated and skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gaurav gaurav merged commit 4c03ed6 into master Mar 30, 2026
5 checks passed
@github-project-automation github-project-automation bot moved this from Needs review to Done in Babel sprints Mar 30, 2026
@gaurav gaurav deleted the eliminate-duplicate-umls-type-warnings branch March 30, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants