CausalKnowledgeTrace: Interactive Literature-Based Causal Structure Mapping, Graph Generation, Visualization, and Refinement
CausalKnowledgeTrace (CKT) helps researchers build causal knowledge graphs from published biomedical literature. The system automatically extracts and organizes causal relationships between biological concepts (genes, proteins, diseases, drugs, etc.) to support hypothesis generation and study design in observational research.
Users specify an exposure and outcome of interest. They can constrain the search by publication year, causal predicate type, and minimum number of supporting articles per relationship. CKT constructs initial partially directed acyclic graphs (PDAGs) representing causal structures between biomedical concepts. Users then edit these graphs interactively to remove unnecessary nodes and edges. CKT can export graphs and evidence from the literature for downstream analysis.
- Test/Development Site: http://10.234.117.212:3837/ (Requires UNM HSC network connection or VPN)
- Production Site: https://habanero.health.unm.edu/CKT/
- User Manual: CKT Usage Guide
CKT queries SemMedDB, a database containing subject-predicate-object triples (e.g., "Smoking CAUSES Lung Cancer") extracted from 37+ million PubMed titles and abstracts using the SemRep natural language processing system. Each relationship is linked to its supporting literature, allowing users to trace claims back to primary evidence.
CKT is built as a modern Django web application with the following components:
- Django web framework: Provides the web interface, user authentication, and request handling
- Python graph engine: Implements graph construction, causal structure learning algorithms, and exports results for statistical analysis
- PostgreSQL database: Stores SemMedDB data and application state
- Interactive visualization: Browser-based DAG exploration with zoom, pan, and node interaction capabilities
- Query Configuration: Users specify an exposure and outcome of interest using UMLS (Unified Medical Language System) identifiers or free text search. Configurable parameters include publication year range, causal predicate types (CAUSES, INHIBITS, STIMULATES, PREVENTS, DISRUPTS), minimum article support thresholds, and degrees of separation (currently limited to 3 degrees between exposure and outcome).
- Graph Construction: CKT builds initial partially directed acyclic graphs (PDAGs) representing potential causal pathways connecting the exposure to the outcome. Edge directions are inferred from temporal precedence, biological plausibility, and semantic predicate types extracted from the literature.
- Interactive Refinement: Users iteratively remove spurious associations, biologically implausible relationships, or irrelevant variables through the web interface. This step incorporates domain expertise to improve graph quality.
- Export and Documentation: Refined graphs and supporting evidence (PubMed IDs, semantic predicates, citation counts) are exported for downstream causal analysis and documentation.
In development
The causal_analysis/ module performs systematic causal variable classification. Tools in this module:
- Classify variables as confounders, mediators, or colliders relative to the exposure-outcome relationship
- Apply graph traversal algorithms to retain variables within the causal vicinity while removing extraneous nodes
- Compute minimal sufficient adjustment sets satisfying the back-door criterion for unbiased causal effect estimation
- Identify adjustment strategies that block confounding paths while avoiding M-bias, and butterfly bias
- Providing suggestions, given user input of measured variables, of best-match, minimally sufficient adjustment sets that may include proxy confounders
The advanced analysis tools currently function on small example graphs but encounter computational challenges on literature-derived graphs due to:
- Cyclic relationships: Extracted literature relationships may contain feedback loops that violate the acyclic assumption required for standard causal inference algorithms. Biological systems often exhibit genuine bidirectional causation (e.g., inflammation causes oxidative stress, which further exacerbates inflammation).
- Markov equivalence classes: Many edge orientations in literature-derived graphs are ambiguous, resulting in equivalence classes of graphs that encode identical conditional independence relationships but different causal interpretations. The number of possible orientations grows exponentially (2^k for k ambiguous edges), making computation intractable for large graphs.
Planned solutions include:
- Cycle detection and resolution: Implementing algorithms to identify feedback loops and apply domain-guided strategies for cycle breaking or collapsing cyclic components into latent variables
- Constraint-based orientation: Using temporal information, intervention evidence, and biological knowledge to reduce the equivalence class search space
- Approximate inference methods: Developing heuristic algorithms that identify near-optimal adjustment sets without exhaustive enumeration of all possible graph orientations
- User-guided disambiguation: Enabling interactive edge orientation based on expert knowledge to progressively reduce uncertainty
This framework supports rigorous causal inference from observational biomedical data by enabling:
- Systematic exploration of alternative causal hypotheses represented in published literature
- Identification of potential confounders requiring measurement and adjustment in epidemiological studies
- Sensitivity analyses examining how conclusions change under different assumptions about causal directionality
- Hypothesis generation for experimental validation of putative causal relationships
- Literature-based justification for variable selection in statistical models
CausalKnowledgeTrace uses SemMedDB, a database derived from the UMLS Metathesaurus. A free UMLS license is required before installation.
Why is this required? CausalKnowledgeTrace extracts causal relationships from SemMedDB, which is derived from the UMLS (Unified Medical Language System) Metathesaurus maintained by the National Library of Medicine. The NLM requires users to obtain a free license to access UMLS-derived resources.
How to obtain your license:
- Visit the UMLS Metathesaurus License Agreement
- Create an account or sign in with existing credentials
- Complete the license application (takes ~5 minutes)
- Wait for approval (typically 1-2 business days)
- You'll receive confirmation via email
Installation note: You can complete software installation steps while waiting for license approval. However, you'll need your approved license before downloading the database.
- Docker Desktop: Required for running the application
- Disk Space: At least 50GB free (for database and Docker images)
- RAM: 8GB minimum, 16GB recommended
- Operating System: Linux, macOS, or Windows with Docker support
Before proceeding with either installation method, complete these common steps:
Option A: Clone with Git (Recommended)
Git allows you to easily pull future updates to the project.
# Install Git if needed
# Linux: sudo apt-get install git
# macOS: brew install git
# Windows: https://git-scm.com/download/win
# Verify Git installation
git --version
# Should display: git version 2.x.x or higher
# Clone the repository
git clone git@github.com:unmtransinfo/CausalKnowledgeTrace.git
cd CausalKnowledgeTrace
# To get future updates later:
# git pull origin mainOption B: Download as ZIP
If you don't want to install Git:
- Download: Download ZIP from GitHub
- Extract the ZIP file
- Open terminal/command prompt and navigate to the extracted directory
Download the SemMedDB database backup file from OneDrive (requires UMLS license):
Download Link: causalehr_backup.tar.gz from OneDrive
Note: The file is approximately 25GB. Download may take several minutes depending on your internet connection. The file will typically download to your Downloads folder.
Move the downloaded file to the project directory and extract it:
# Navigate to the project directory
cd CausalKnowledgeTrace
# Move the downloaded file from Downloads folder to current directory
# On Linux/macOS:
mv ~/Downloads/causalehr_backup.tar.gz .
# On Windows (in Git Bash or PowerShell):
# mv ~/Downloads/causalehr_backup.tar.gz .
# Or simply drag and drop the file from Downloads to the CausalKnowledgeTrace folder
# Extract the backup file
tar -xzf causalehr_backup.tar.gz
# Verify the backup directory exists
ls -la causalehr_backup/You should see multiple .dat.gz files and a toc.dat file in the causalehr_backup/ directory.
Docker installation requires setting up database credentials in a .env.dev file. Here's a quick preview:
# Copy the sample environment file
cp doc/sample.env .env.dev
# Edit with your credentials (detailed instructions in installation guides)
nano .env.dev # or use your preferred editorNote: Detailed instructions for configuring the .env.dev file are provided in each installation guide below. You can complete this step now or during the installation process.
Time: ~20 minutes (including database restoration) Prerequisites: Docker and Docker Compose only
CausalKnowledgeTrace is deployed exclusively through Docker, which provides a containerized environment with all dependencies pre-configured. This ensures consistent setup across different systems.
Complete Docker Installation Guide →
For detailed usage instructions, see: CKT Usage Instructions
For troubleshooting help, please refer to the Docker installation guide:
- Docker Installation: See Docker Troubleshooting
If you encounter issues not covered in the installation guides:
- Check the logs: Use
docker compose -f docker-compose.dev.yaml logs -fto view detailed error messages - GitHub Issues: Open an issue with:
- Your operating system and version
- Docker and Docker Compose versions
- Error messages (copy the full text)
- Steps you've already tried
- Email support: Contact Scott Malec (SMalec@salud.unm.edu) or Rajesh Upadhayaya (RAJESHUPADHAYAYA@salud.unm.edu)