Skip to content

CDCgov/dibbs-text-to-code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

195 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

DIBBs Text to Code

codecov python

General disclaimer This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Related documents

Table of Contents

Overview

DIBBs Text to Code (TTC) is a CDC public health tool that maps nonstandard clinical text in eICR (Electronic Initial Case Report) documents to standardized medical codes — primarily LOINC and SNOMED CT — using vector embeddings and approximate nearest-neighbor search.

Public health reporting relies on eICR documents that often contain free-text lab names and results that vary across labs and EHR systems. TTC bridges that gap by finding the best-fit standardized code for each piece of clinical text and writing it back into the document.

How It Works

TTC has two sequential workflows:

1. Text-to-Code (TTC)

Given an eICR XML document and a corresponding Schematron validation report identifying relevant errors, TTC:

  1. Reads the Schematron report to identify which sections of the eICR contain errors that need standardized codes
  2. Parses the XML and extracts text candidates for each configured data field (e.g., lab test names) using XPath expressions
  3. Selects the best candidate text using priority-based evaluation criteria (e.g., prefers LOINC-sourced text over free text)
  4. Embeds the selected text as a vector using a fine-tuned intfloat/e5-large-v2 SentenceTransformer model
  5. Queries an OpenSearch KNN index to find the nearest-neighbor standardized codes
  6. Returns ranked TTCAugmentation objects containing the matched code, display name, and source location in the document

2. Augmentation

Given TTC results, the augmenter:

  1. Updates clinical document headers (ID, effective time, version number) to create a new derived document
  2. Preserves the original eICR as a relatedDocument reference
  3. Inserts an author entry identifying the TTC system at the clinical document level and for every updated observation
  4. Writes <translation> elements at each code location with the matched standardized codes

Repository Structure

This is a uv workspace (Python) with a separate npm workspace (TypeScript/React frontend). All Python packages live under packages/; the frontend lives under frontend/.

Package Role
shared-models Pydantic models shared across packages: DataField, TTCAugmentation, TTCAugmenterInput
text-to-code Core TTC logic: XML parsing, candidate evaluation, embedding, and OpenSearch query building
augmentation Writes TTC results back into eICR XML as <translation> elements
text-to-code-lambda AWS Lambda handler for the TTC workflow, triggered by S3 → SQS events
augmentation-lambda AWS Lambda handler for the augmentation workflow, triggered by SQS events
utils Path, regex, and LOINC name parsing utilities
data-curation Scripts for pulling terminology data from LOINC, SNOMED, UMLS, and HL7 APIs; generates training data
model-tuning Fine-tunes SentenceTransformer models and builds HNSW indexes for OpenSearch
api FastAPI service exposing /api endpoints; serves the built frontend in non-local environments
frontend React 19 + TypeScript + Vite demo application for interacting with the API

Architecture Diagram

              ┌─────────────────────────────────────────────────────┐
              │                   AWS Infrastructure                │
              │                                                     │
  eICR XML    │  SQS ──► text-to-code-lambda                        │
  (from S3)   │                    │                                │
              │         ┌──────────▼──────────┐                     │
              │         │    text-to-code     │                     │
              │         │  ┌───────────────┐  │                     │
              │         │  │ EicrProcessor │  │  XPath extraction   │
              │         │  │   Evaluator   │  │  Candidate selection│
              │         │  │   Embedder    │  │  Vector embedding   │
              │         │  │ QueryBuilder  │  │  KNN query          │
              │         │  └───────┬───────┘  │                     │
              │         └──────────┼──────────┘                     │
              │                    │                                │
              │         ┌──────────▼──────────┐                     │
              │         │     OpenSearch      │  KNN / HNSW index   │
              │         └──────────┬──────────┘                     │
              │                    │ TTCAugmentation results        │
              │  SQS ──► augmentation-lambda                        │
              │                    │                                │
              │         ┌──────────▼──────────┐                     │
              │         │    augmentation     │  XML modification   │
              │         └──────────┬──────────┘                     │
              └──────────────────┬─┴────────────────────────────────┘
                                 │
                      Augmented eICR XML (to S3)

A demo site (FastAPI + React frontend) is available for local testing of the API, though it is not currently under active development. In production, the two Lambda functions handle large-scale eICR processing.

Key Design Patterns

  • Registry pattern: EICR_REGISTRY and EVALUATION_REGISTRY map DataField enum values to their XPath/evaluation configuration. Adding support for a new clinical field only requires adding a new entry to each registry.
  • Config-driven extraction: XPath expressions for text candidate extraction are defined per data field in subclasses of BaseLabField, keeping extraction logic declarative and field-specific.
  • Pluggable evaluation: BaseEvaluationCriteria subclasses define candidate selection rules (priority ordering, code system preference) independently from the extraction logic.

Getting Started

Pre-requisites

Setup

Requirements

After installing the above requirements run just bootstrap to initiate the Python environment and install pre-commit:

just bootstrap

To start the demo site and API:

just dev up

The demo site can be accessed at http:localhost:8081

To run tests:

just test

Build and Verify

Use the following to build the lambda image and verify that its accepting requests: (requires Docker Compose)

docker compose up -d
curl -XPOST "http://localhost:8080/2015-03-31/functions/function/invocations" -d '{"input": "test"}'
docker compose down

Quality Assurance

NOTE: By default, pre-commit hooks are installed to run linting and formatting checks on each commit. These hooks will attempt to automatically fix any issues encountered. To force a commit without running the pre-commit hooks, use the following command:

git commit --no-verify

Unit tests

The unit tests require access to a private Hugging Face model. To run them locally, create a Hugging Face access token with read permissions and export it in your shell config (e.g., ~/.zshrc or ~/.bashrc):

export HF_TOKEN="hf_your_token_here"

To run all the unit tests, use the following command:

pytest

To run a single unit test, use the following command:

pytest tests/unit/test_utils.py::test_function

To update snapshots:

pytest --snapshot-update

Type checks

To run type checks, use the following command:

ty check

To type check a specific file, use the following command:

ty check path/to/file.py

Linting

To run linting checks, use the following command:

ruff check

To lint a specific file, use the following command:

ruff check path/to/file.py

Formatting

To run code formatting, use the following command:

ruff format

To format a specific file, use the following command:

ruff format path/to/file.py

Releases

See the Releases page for details.

Standard Notices

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors