General disclaimer This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.
- Open Practices
- Rules of Behavior
- Thanks and Acknowledgements
- Disclaimer
- Contribution Notice
- Code of Conduct
DIBBs Text to Code (TTC) is a CDC public health tool that maps nonstandard clinical text in eICR (Electronic Initial Case Report) documents to standardized medical codes — primarily LOINC and SNOMED CT — using vector embeddings and approximate nearest-neighbor search.
Public health reporting relies on eICR documents that often contain free-text lab names and results that vary across labs and EHR systems. TTC bridges that gap by finding the best-fit standardized code for each piece of clinical text and writing it back into the document.
TTC has two sequential workflows:
1. Text-to-Code (TTC)
Given an eICR XML document and a corresponding Schematron validation report identifying relevant errors, TTC:
- Reads the Schematron report to identify which sections of the eICR contain errors that need standardized codes
- Parses the XML and extracts text candidates for each configured data field (e.g., lab test names) using XPath expressions
- Selects the best candidate text using priority-based evaluation criteria (e.g., prefers LOINC-sourced text over free text)
- Embeds the selected text as a vector using a fine-tuned
intfloat/e5-large-v2SentenceTransformer model - Queries an OpenSearch KNN index to find the nearest-neighbor standardized codes
- Returns ranked
TTCAugmentationobjects containing the matched code, display name, and source location in the document
2. Augmentation
Given TTC results, the augmenter:
- Updates clinical document headers (ID, effective time, version number) to create a new derived document
- Preserves the original eICR as a
relatedDocumentreference - Inserts an author entry identifying the TTC system at the clinical document level and for every updated observation
- Writes
<translation>elements at each code location with the matched standardized codes
This is a uv workspace (Python) with a separate npm workspace (TypeScript/React frontend). All Python packages live under packages/; the frontend lives under frontend/.
| Package | Role |
|---|---|
shared-models |
Pydantic models shared across packages: DataField, TTCAugmentation, TTCAugmenterInput |
text-to-code |
Core TTC logic: XML parsing, candidate evaluation, embedding, and OpenSearch query building |
augmentation |
Writes TTC results back into eICR XML as <translation> elements |
text-to-code-lambda |
AWS Lambda handler for the TTC workflow, triggered by S3 → SQS events |
augmentation-lambda |
AWS Lambda handler for the augmentation workflow, triggered by SQS events |
utils |
Path, regex, and LOINC name parsing utilities |
data-curation |
Scripts for pulling terminology data from LOINC, SNOMED, UMLS, and HL7 APIs; generates training data |
model-tuning |
Fine-tunes SentenceTransformer models and builds HNSW indexes for OpenSearch |
api |
FastAPI service exposing /api endpoints; serves the built frontend in non-local environments |
frontend |
React 19 + TypeScript + Vite demo application for interacting with the API |
┌─────────────────────────────────────────────────────┐
│ AWS Infrastructure │
│ │
eICR XML │ SQS ──► text-to-code-lambda │
(from S3) │ │ │
│ ┌──────────▼──────────┐ │
│ │ text-to-code │ │
│ │ ┌───────────────┐ │ │
│ │ │ EicrProcessor │ │ XPath extraction │
│ │ │ Evaluator │ │ Candidate selection│
│ │ │ Embedder │ │ Vector embedding │
│ │ │ QueryBuilder │ │ KNN query │
│ │ └───────┬───────┘ │ │
│ └──────────┼──────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ OpenSearch │ KNN / HNSW index │
│ └──────────┬──────────┘ │
│ │ TTCAugmentation results │
│ SQS ──► augmentation-lambda │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ augmentation │ XML modification │
│ └──────────┬──────────┘ │
└──────────────────┬─┴────────────────────────────────┘
│
Augmented eICR XML (to S3)
A demo site (FastAPI + React frontend) is available for local testing of the API, though it is not currently under active development. In production, the two Lambda functions handle large-scale eICR processing.
- Registry pattern:
EICR_REGISTRYandEVALUATION_REGISTRYmapDataFieldenum values to their XPath/evaluation configuration. Adding support for a new clinical field only requires adding a new entry to each registry. - Config-driven extraction: XPath expressions for text candidate extraction are defined per data field in subclasses of
BaseLabField, keeping extraction logic declarative and field-specific. - Pluggable evaluation:
BaseEvaluationCriteriasubclasses define candidate selection rules (priority ordering, code system preference) independently from the extraction logic.
- Python 3.11 or higher
- Docker
- Docker Compose [optional]
- just - command runner
- uv - to manage Python
- pre-commit - Pre-commit hooks
After installing the above requirements run just bootstrap to initiate the Python environment and install pre-commit:
just bootstrapTo start the demo site and API:
just dev upThe demo site can be accessed at http:localhost:8081
To run tests:
just testUse the following to build the lambda image and verify that its accepting requests: (requires Docker Compose)
docker compose up -d
curl -XPOST "http://localhost:8080/2015-03-31/functions/function/invocations" -d '{"input": "test"}'
docker compose downNOTE: By default, pre-commit hooks are installed to run linting and formatting checks on each commit. These hooks will attempt to automatically fix any issues encountered. To force a commit without running the pre-commit hooks, use the following command:
git commit --no-verifyThe unit tests require access to a private Hugging Face model. To run them locally, create a Hugging Face access token with read permissions and export it in your shell config (e.g., ~/.zshrc or ~/.bashrc):
export HF_TOKEN="hf_your_token_here"To run all the unit tests, use the following command:
pytestTo run a single unit test, use the following command:
pytest tests/unit/test_utils.py::test_functionTo update snapshots:
pytest --snapshot-updateTo run type checks, use the following command:
ty checkTo type check a specific file, use the following command:
ty check path/to/file.pyTo run linting checks, use the following command:
ruff checkTo lint a specific file, use the following command:
ruff check path/to/file.pyTo run code formatting, use the following command:
ruff formatTo format a specific file, use the following command:
ruff format path/to/file.pySee the Releases page for details.
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.
This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.
This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.
You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html
The source code forked from other open source projects will inherit its license.
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.
Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.
This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.
Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.