CausalKnowledgeTrace: Interactive Literature-Based Causal Structure Mapping, Graph Generation, Visualization, and Refinement

Overview

CausalKnowledgeTrace (CKT) helps researchers build causal knowledge graphs from published biomedical literature. The system automatically extracts and organizes causal relationships between biological concepts (genes, proteins, diseases, drugs, etc.) to support hypothesis generation and study design in observational research.

Users specify an exposure and outcome of interest. They can constrain the search by publication year, causal predicate type, and minimum number of supporting articles per relationship. CKT constructs initial partially directed acyclic graphs (PDAGs) representing causal structures between biomedical concepts. Users then edit these graphs interactively to remove unnecessary nodes and edges. CKT can export graphs and evidence from the literature for downstream analysis.

Quick Links

Test/Development Site: http://10.234.117.212:3837/ (Requires UNM HSC network connection or VPN)
Production Site: https://habanero.health.unm.edu/CKT/
User Manual: CKT Usage Guide

Data Source

CKT queries SemMedDB, a database containing subject-predicate-object triples (e.g., "Smoking CAUSES Lung Cancer") extracted from 37+ million PubMed titles and abstracts using the SemRep natural language processing system. Each relationship is linked to its supporting literature, allowing users to trace claims back to primary evidence.

System Architecture

CKT is built as a modern Django web application with the following components:

Django web framework: Provides the web interface, user authentication, and request handling
Python graph engine: Implements graph construction, causal structure learning algorithms, and exports results for statistical analysis
PostgreSQL database: Stores SemMedDB data and application state
Interactive visualization: Browser-based DAG exploration with zoom, pan, and node interaction capabilities

Workflow

Query Configuration: Users specify an exposure and outcome of interest using UMLS (Unified Medical Language System) identifiers or free text search. Configurable parameters include publication year range, causal predicate types (CAUSES, INHIBITS, STIMULATES, PREVENTS, DISRUPTS), minimum article support thresholds, and degrees of separation (currently limited to 3 degrees between exposure and outcome).
Graph Construction: CKT builds initial partially directed acyclic graphs (PDAGs) representing potential causal pathways connecting the exposure to the outcome. Edge directions are inferred from temporal precedence, biological plausibility, and semantic predicate types extracted from the literature.
Interactive Refinement: Users iteratively remove spurious associations, biologically implausible relationships, or irrelevant variables through the web interface. This step incorporates domain expertise to improve graph quality.
Export and Documentation: Refined graphs and supporting evidence (PubMed IDs, semantic predicates, citation counts) are exported for downstream causal analysis and documentation.

Advanced Causal Analysis Module

In development

The causal_analysis/ module performs systematic causal variable classification. Tools in this module:

Classify variables as confounders, mediators, or colliders relative to the exposure-outcome relationship
Apply graph traversal algorithms to retain variables within the causal vicinity while removing extraneous nodes
Compute minimal sufficient adjustment sets satisfying the back-door criterion for unbiased causal effect estimation
Identify adjustment strategies that block confounding paths while avoiding M-bias, and butterfly bias
Providing suggestions, given user input of measured variables, of best-match, minimally sufficient adjustment sets that may include proxy confounders

Current Limitations and Development Roadmap

The advanced analysis tools currently function on small example graphs but encounter computational challenges on literature-derived graphs due to:

Cyclic relationships: Extracted literature relationships may contain feedback loops that violate the acyclic assumption required for standard causal inference algorithms. Biological systems often exhibit genuine bidirectional causation (e.g., inflammation causes oxidative stress, which further exacerbates inflammation).
Markov equivalence classes: Many edge orientations in literature-derived graphs are ambiguous, resulting in equivalence classes of graphs that encode identical conditional independence relationships but different causal interpretations. The number of possible orientations grows exponentially (2^k for k ambiguous edges), making computation intractable for large graphs.

Planned solutions include:

Cycle detection and resolution: Implementing algorithms to identify feedback loops and apply domain-guided strategies for cycle breaking or collapsing cyclic components into latent variables
Constraint-based orientation: Using temporal information, intervention evidence, and biological knowledge to reduce the equivalence class search space
Approximate inference methods: Developing heuristic algorithms that identify near-optimal adjustment sets without exhaustive enumeration of all possible graph orientations
User-guided disambiguation: Enabling interactive edge orientation based on expert knowledge to progressively reduce uncertainty

Applications

This framework supports rigorous causal inference from observational biomedical data by enabling:

Systematic exploration of alternative causal hypotheses represented in published literature
Identification of potential confounders requiring measurement and adjustment in epidemiological studies
Sensitivity analyses examining how conclusions change under different assumptions about causal directionality
Hypothesis generation for experimental validation of putative causal relationships
Literature-based justification for variable selection in statistical models

Prerequisites

UMLS Metathesaurus License (Required)

CausalKnowledgeTrace uses SemMedDB, a database derived from the UMLS Metathesaurus. A free UMLS license is required before installation.

Why is this required? CausalKnowledgeTrace extracts causal relationships from SemMedDB, which is derived from the UMLS (Unified Medical Language System) Metathesaurus maintained by the National Library of Medicine. The NLM requires users to obtain a free license to access UMLS-derived resources.

How to obtain your license:

Visit the UMLS Metathesaurus License Agreement
Create an account or sign in with existing credentials
Complete the license application (takes ~5 minutes)
Wait for approval (typically 1-2 business days)
You'll receive confirmation via email

Installation note: You can complete software installation steps while waiting for license approval. However, you'll need your approved license before downloading the database.

System Requirements

Docker Desktop: Required for running the application
Disk Space: At least 50GB free (for database and Docker images)
RAM: 8GB minimum, 16GB recommended
Operating System: Linux, macOS, or Windows with Docker support

Installation

Common Setup Steps (Required for All Installation Methods)

Before proceeding with either installation method, complete these common steps:

Step 1: Get the Repository

Option A: Clone with Git (Recommended)

Git allows you to easily pull future updates to the project.

# Install Git if needed
# Linux: sudo apt-get install git
# macOS: brew install git
# Windows: https://git-scm.com/download/win

# Verify Git installation
git --version
# Should display: git version 2.x.x or higher

# Clone the repository
git clone git@github.com:unmtransinfo/CausalKnowledgeTrace.git
cd CausalKnowledgeTrace

# To get future updates later:
# git pull origin main

Option B: Download as ZIP

If you don't want to install Git:

Download: Download ZIP from GitHub
Extract the ZIP file
Open terminal/command prompt and navigate to the extracted directory

Step 2: Download Database Backup

Download the SemMedDB database backup file from OneDrive (requires UMLS license):

Download Link: causalehr_backup.tar.gz from OneDrive

Note: The file is approximately 25GB. Download may take several minutes depending on your internet connection. The file will typically download to your Downloads folder.

Step 3: Move and Extract Database Backup

Move the downloaded file to the project directory and extract it:

# Navigate to the project directory
cd CausalKnowledgeTrace

# Move the downloaded file from Downloads folder to current directory
# On Linux/macOS:
mv ~/Downloads/causalehr_backup.tar.gz .

# On Windows (in Git Bash or PowerShell):
# mv ~/Downloads/causalehr_backup.tar.gz .
# Or simply drag and drop the file from Downloads to the CausalKnowledgeTrace folder

# Extract the backup file
tar -xzf causalehr_backup.tar.gz

# Verify the backup directory exists
ls -la causalehr_backup/

You should see multiple .dat.gz files and a toc.dat file in the causalehr_backup/ directory.

Step 4: Configure Environment Variables (Preview)

Docker installation requires setting up database credentials in a .env.dev file. Here's a quick preview:

# Copy the sample environment file
cp doc/sample.env .env.dev

# Edit with your credentials (detailed instructions in installation guides)
nano .env.dev  # or use your preferred editor

Note: Detailed instructions for configuring the .env.dev file are provided in each installation guide below. You can complete this step now or during the installation process.

Docker Installation

Time: ~20 minutes (including database restoration) Prerequisites: Docker and Docker Compose only

CausalKnowledgeTrace is deployed exclusively through Docker, which provides a containerized environment with all dependencies pre-configured. This ensures consistent setup across different systems.

Complete Docker Installation Guide →

Usage

For detailed usage instructions, see: CKT Usage Instructions

Troubleshooting

For troubleshooting help, please refer to the Docker installation guide:

Docker Installation: See Docker Troubleshooting

Getting Help

If you encounter issues not covered in the installation guides:

Check the logs: Use docker compose -f docker-compose.dev.yaml logs -f to view detailed error messages
GitHub Issues: Open an issue with:
- Your operating system and version
- Docker and Docker Compose versions
- Error messages (copy the full text)
- Steps you've already tried
Email support: Contact Scott Malec (SMalec@salud.unm.edu) or Rajesh Upadhayaya (RAJESHUPADHAYAYA@salud.unm.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 768 Commits
bias		bias
causal_analysis		causal_analysis
django_ckt		django_ckt
doc		doc
docker		docker
graph_creation		graph_creation
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.dev.yaml		docker-compose.dev.yaml
docker-compose.prod.yaml		docker-compose.prod.yaml
requirements.txt		requirements.txt
restore.sh		restore.sh
run_bias_analysis.py		run_bias_analysis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CausalKnowledgeTrace: Interactive Literature-Based Causal Structure Mapping, Graph Generation, Visualization, and Refinement

Overview

Quick Links

Data Source

System Architecture

Workflow

Advanced Causal Analysis Module

Current Limitations and Development Roadmap

Applications

Prerequisites

UMLS Metathesaurus License (Required)

System Requirements

Installation

Common Setup Steps (Required for All Installation Methods)

Step 1: Get the Repository

Step 2: Download Database Backup

Step 3: Move and Extract Database Backup

Step 4: Configure Environment Variables (Preview)

Docker Installation

Usage

Troubleshooting

Getting Help

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CausalKnowledgeTrace: Interactive Literature-Based Causal Structure Mapping, Graph Generation, Visualization, and Refinement

Overview

Quick Links

Data Source

System Architecture

Workflow

Advanced Causal Analysis Module

Current Limitations and Development Roadmap

Applications

Prerequisites

UMLS Metathesaurus License (Required)

System Requirements

Installation

Common Setup Steps (Required for All Installation Methods)

Step 1: Get the Repository

Step 2: Download Database Backup

Step 3: Move and Extract Database Backup

Step 4: Configure Environment Variables (Preview)

Docker Installation

Usage

Troubleshooting

Getting Help

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages