DIBBs Text to Code

General disclaimer This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Overview

DIBBs Text to Code (TTC) is a CDC public health tool that maps nonstandard clinical text in eICR (Electronic Initial Case Report) documents to standardized medical codes — primarily LOINC and SNOMED CT — using vector embeddings and approximate nearest-neighbor search.

Public health reporting relies on eICR documents that often contain free-text lab names and results that vary across labs and EHR systems. TTC bridges that gap by finding the best-fit standardized code for each piece of clinical text and writing it back into the document.

How It Works

TTC has two sequential workflows:

1. Text-to-Code (TTC)

Given an eICR XML document and a corresponding Schematron validation report identifying relevant errors, TTC:

Reads the Schematron report to identify which sections of the eICR contain errors that need standardized codes
Parses the XML and extracts text candidates for each configured data field (e.g., lab test names) using XPath expressions
Selects the best candidate text using priority-based evaluation criteria (e.g., prefers LOINC-sourced text over free text)
Embeds the selected text as a vector using a fine-tuned intfloat/e5-large-v2 SentenceTransformer model
Queries an OpenSearch KNN index to find the nearest-neighbor standardized codes
Returns ranked TTCAugmentation objects containing the matched code, display name, and source location in the document

2. Augmentation

Given TTC results, the augmenter:

Updates clinical document headers (ID, effective time, version number) to create a new derived document
Preserves the original eICR as a relatedDocument reference
Inserts an author entry identifying the TTC system at the clinical document level and for every updated observation
Writes <translation> elements at each code location with the matched standardized codes

Repository Structure

This is a uv workspace (Python) with a separate npm workspace (TypeScript/React frontend). All Python packages live under packages/; the frontend lives under frontend/.

Package	Role
`shared-models`	Pydantic models shared across packages: `DataField`, `TTCAugmentation`, `TTCAugmenterInput`
`text-to-code`	Core TTC logic: XML parsing, candidate evaluation, embedding, and OpenSearch query building
`augmentation`	Writes TTC results back into eICR XML as `<translation>` elements
`text-to-code-lambda`	AWS Lambda handler for the TTC workflow, triggered by S3 → SQS events
`augmentation-lambda`	AWS Lambda handler for the augmentation workflow, triggered by SQS events
`utils`	Path, regex, and LOINC name parsing utilities
`data-curation`	Scripts for pulling terminology data from LOINC, SNOMED, UMLS, and HL7 APIs; generates training data
`model-tuning`	Fine-tunes SentenceTransformer models and builds HNSW indexes for OpenSearch
`api`	FastAPI service exposing `/api` endpoints; serves the built frontend in non-local environments
`frontend`	React 19 + TypeScript + Vite demo application for interacting with the API

Architecture Diagram

              ┌─────────────────────────────────────────────────────┐
              │                   AWS Infrastructure                │
              │                                                     │
  eICR XML    │  SQS ──► text-to-code-lambda                        │
  (from S3)   │                    │                                │
              │         ┌──────────▼──────────┐                     │
              │         │    text-to-code     │                     │
              │         │  ┌───────────────┐  │                     │
              │         │  │ EicrProcessor │  │  XPath extraction   │
              │         │  │   Evaluator   │  │  Candidate selection│
              │         │  │   Embedder    │  │  Vector embedding   │
              │         │  │ QueryBuilder  │  │  KNN query          │
              │         │  └───────┬───────┘  │                     │
              │         └──────────┼──────────┘                     │
              │                    │                                │
              │         ┌──────────▼──────────┐                     │
              │         │     OpenSearch      │  KNN / HNSW index   │
              │         └──────────┬──────────┘                     │
              │                    │ TTCAugmentation results        │
              │  SQS ──► augmentation-lambda                        │
              │                    │                                │
              │         ┌──────────▼──────────┐                     │
              │         │    augmentation     │  XML modification   │
              │         └──────────┬──────────┘                     │
              └──────────────────┬─┴────────────────────────────────┘
                                 │
                      Augmented eICR XML (to S3)

A demo site (FastAPI + React frontend) is available for local testing of the API, though it is not currently under active development. In production, the two Lambda functions handle large-scale eICR processing.

Key Design Patterns

Registry pattern: EICR_REGISTRY and EVALUATION_REGISTRY map DataField enum values to their XPath/evaluation configuration. Adding support for a new clinical field only requires adding a new entry to each registry.
Config-driven extraction: XPath expressions for text candidate extraction are defined per data field in subclasses of BaseLabField, keeping extraction logic declarative and field-specific.
Pluggable evaluation: BaseEvaluationCriteria subclasses define candidate selection rules (priority ordering, code system preference) independently from the extraction logic.

Getting Started

Pre-requisites

Python 3.11 or higher
Docker
Docker Compose [optional]

Setup

Requirements

just - command runner
uv - to manage Python
pre-commit - Pre-commit hooks

After installing the above requirements run just bootstrap to initiate the Python environment and install pre-commit:

just bootstrap

To start the demo site and API:

just dev up

The demo site can be accessed at http:localhost:8081

To run tests:

just test

Build and Verify

Use the following to build the lambda image and verify that its accepting requests: (requires Docker Compose)

docker compose up -d
curl -XPOST "http://localhost:8080/2015-03-31/functions/function/invocations" -d '{"input": "test"}'
docker compose down

Quality Assurance

NOTE: By default, pre-commit hooks are installed to run linting and formatting checks on each commit. These hooks will attempt to automatically fix any issues encountered. To force a commit without running the pre-commit hooks, use the following command:

git commit --no-verify

Unit tests

The unit tests require access to a private Hugging Face model. To run them locally, create a Hugging Face access token with read permissions and export it in your shell config (e.g., ~/.zshrc or ~/.bashrc):

export HF_TOKEN="hf_your_token_here"

To run all the unit tests, use the following command:

pytest

To run a single unit test, use the following command:

pytest tests/unit/test_utils.py::test_function

To update snapshots:

pytest --snapshot-update

Type checks

To run type checks, use the following command:

ty check

To type check a specific file, use the following command:

ty check path/to/file.py

Linting

To run linting checks, use the following command:

ruff check

To lint a specific file, use the following command:

ruff check path/to/file.py

Formatting

To run code formatting, use the following command:

ruff format

To format a specific file, use the following command:

ruff format path/to/file.py

Releases

See the Releases page for details.

Standard Notices

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github		.github
.justscripts/just		.justscripts/just
azure_scripts		azure_scripts
data		data
dev_scripts		dev_scripts
docs		docs
e2e/assets		e2e/assets
frontend		frontend
packages		packages
terraform		terraform
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc		.prettierrc
.taplo.toml		.taplo.toml
CONTRIBUTING.md		CONTRIBUTING.md
DISCLAIMER.md		DISCLAIMER.md
Dockerfile.app		Dockerfile.app
Dockerfile.augmentation		Dockerfile.augmentation
Dockerfile.index		Dockerfile.index
Dockerfile.ttc		Dockerfile.ttc
LICENSE		LICENSE
README.md		README.md
code-of-conduct.md		code-of-conduct.md
conftest.py		conftest.py
docker-compose.yaml		docker-compose.yaml
justfile		justfile
open_practices.md		open_practices.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
rules_of_behavior.md		rules_of_behavior.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIBBs Text to Code

Related documents

Table of Contents

Overview

How It Works

Repository Structure

Architecture Diagram

Key Design Patterns

Getting Started

Pre-requisites

Setup

Requirements

Build and Verify

Quality Assurance

Unit tests

Type checks

Linting

Formatting

Releases

Standard Notices

Public Domain Standard Notice

License Standard Notice

Privacy Standard Notice

Contributing Standard Notice

Records Management Standard Notice

Additional Standard Notices

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DIBBs Text to Code

Related documents

Table of Contents

Overview

How It Works

Repository Structure

Architecture Diagram

Key Design Patterns

Getting Started

Pre-requisites

Setup

Requirements

Build and Verify

Quality Assurance

Unit tests

Type checks

Linting

Formatting

Releases

Standard Notices

Public Domain Standard Notice

License Standard Notice

Privacy Standard Notice

Contributing Standard Notice

Records Management Standard Notice

Additional Standard Notices

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages