Master's Project Framework
Robust Clinical Decision Making via Adversarial Tree-of-Thoughts (ToT) & Hybrid Entity-Aware Retrieval (HEAR)
Cognitive Hub is a forensic evaluation framework designed to stress-test Large Language Models (LLMs) in high-stakes clinical scenarios. Unlike standard benchmarks that measure static knowledge, this system evaluates dynamic reasoning capabilities under adversarial conditions.
It integrates three novel architectures:
- ACV-ToT (Adversarial Check-Verify Tree of Thoughts): A reasoning engine that actively debates its own decisions using a "Bicameral" agent topology (Proposer vs. Adversary).
- HEAR (Hybrid Entity-Aware Retrieval): A retrieval engine designed to mitigate "Lost-in-the-Middle" phenomena by boosting clinical entities (drugs, anatomy) and critical temporal markers.
- GENESIS (Procedural Data Engine): A synthetic data generator that creates infinite, unique clinical "Needle-in-a-Haystack" scenarios to test robustness against hallucinations and context overflow.
The entire pipeline is orchestrated by Atlas v28, an autonomous hypervisor that optimizes workload distribution across heterogeneous HPC clusters (x86_64/ARM64).
| Component | Technology | Function |
|---|---|---|
| Reasoning | Tree of Thoughts (ToT) | Performs BFS/DFS search over reasoning paths. Uses Renyi Entropy to dynamically expand/contract search beam width based on uncertainty. |
| Retrieval | Tri-Vector HyDE | Generates 3 hypothetical documents to expand query space. Fuses Dense (Vector) and Sparse (BM25) results using Reciprocal Rank Fusion (RRF). |
| Safety | Reflexion Loops | If a decision is flagged as unsafe, the model enters a self-correction loop, simulating a "Tumor Board" debate to repair the plan. |
| Hardware | Adaptive HAL | Hardware Abstraction Layer that detects GPU architecture (Ampere/Hopper/Blackwell) and auto-tunes precision (BF16/TF32) and Attention kernels (FlashAttn-2). |
| Forensics | Deep Telemetry | Logs power (Joules/Token), VRAM usage, and decision branching entropy to jsonl journals for post-hoc analysis. |
neuro-symbolic-clinical-ai/
├── data/ # Dataset Storage
│ └── golden_dataset.json # (Auto-Generated) Adversarial clinical cases
├── logs/ # Telemetry & Execution Logs
│ ├── atlas_history.csv # Job submission audit trail
│ └── ... # Per-job stdout/stderr logs
├── models/ # Local Weight Storage (GitIgnored)
├── scripts/ # HPC Automation Tools
│ ├── download_models.py # Rust-accelerated artifact downloader
│ ├── setup_and_download.sh # One-click installer
│ ├── setup_env.sh # Virtual environment builder
│ └── universal_launch.sh # ATLAS HYPERVISOR (The Orchestrator)
├── src/ # Core Application Logic
│ ├── benchmark.py # Main execution kernel (Nexus Orchestrator)
│ ├── data_generator.py # GENESIS Engine (Synthetic Data)
│ ├── rag_engine.py # Legacy RAG implementation
│ ├── semantic_rag.py # HEAR Engine (Hybrid/Semantic Retrieval)
│ ├── tot_engine.py # Cognitive Reasoning Engine (ToT/Reflexion)
│ └── utils.py # Hardware Abstraction Layer (HAL)
└── requirements.txt # Python dependencies
- Access to a Slurm-based Cluster (Compute Nodes).
- Internet access on the Run Node (for downloading weights).
- Hugging Face Access Token (Required for the gated
Mistral-7B-Instruct-v0.3model).
You must grant execution permissions to the scripts before running them. Then, run the master setup script.
# 1. Enter the directory
cd neuro-symbolic-clinical-ai
# 2. Grant permissions (CRITICAL STEP)
chmod +x scripts/*.sh
# 3. Run the Auto-Installer
./scripts/setup_and_download.shThe script will ask for your Hugging Face Token. Paste it when prompted (input will be hidden).
If you prefer manual control or need to debug the installation:
# 1. Build Environment
./scripts/setup_env.sh
# 2. Activate
source gh200_env/bin/activate
# 3. Download Models (Requires HF_TOKEN env var)
export HF_TOKEN="your_token_here"
python scripts/download_models.pyIMPORTANT: Do not run python src/benchmark.py directly on the login node. It requires a GPU. Use the Atlas Hypervisor, which handles node selection, memory compliance, and self-termination.
./scripts/universal_launch.shWhen launched, Atlas scans the cluster and offers strategies:
- AUTO-PILOT (God Mode): Automatically finds the most powerful idle GPU (prioritizing B200 > GH200 > A100) and submits the job with optimal parameters.
- Force A100/B200: Manually target specific architectures for benchmarking consistency.
- Debug Mode: Generates the Slurm script but does not submit it, allowing for manual inspection.
- Data Check: If
data/golden_dataset.jsonis missing, Atlas spins up the GENESIS Engine on the compute node to generate 100 fresh adversarial cases. - Execution: The NEXUS Kernel (
src/benchmark.py) loads the model and iterates through the cases. - Journaling: Results are streamed to
results/nexus_[timestamp].jsonl. - Teardown: The job automatically cancels itself (
scancel) upon completion to save compute credits.
Once the job is submitted, Atlas will give you a command to monitor output. It looks like this:
tail -f logs/NSym_blackwell_[JOB_ID].outTo tweak the experiment, modify the hyperparameter dictionaries at the top of the source files.
Reasoning Parameters (src/tot_engine.py):
CONFIG = {
"max_reflexion_retries": 1, # How many times to argue with the adversary
"base_beam_width": 2, # How many reasoning branches to explore parallel
"max_beam_width": 4, # Cap for adaptive expansion
"entropy_threshold_high": 0.85, # Trigger for widening beam (Confusion)
"debate_rounds": 1 # Depth of the debate tree
}Context & Hardware (src/utils.py & src/benchmark.py):
- Context Window: Defaults to 32k for High-Spec GPUs, auto-downgrades to 8k for Legacy GPUs.
- Precision: Automatically selects
bfloat16for Ampere+ andfloat16for Volta. - JIT:
torch.compileis disabled by default to prevent recompilation latency on dynamic input lengths.
The system produces a jsonl journal containing forensic details for every case:
- Factual Accuracy: Does the answer match the Gold Standard?
- Safety Score (CoVe): Did the Chain-of-Verification loop flag any risks?
- Uncertainty: Semantic Entropy score derived from parallel generations.
- Joules/Token: Energy efficiency of the reasoning process.
- Hallucinations: Extraction of numerical values present in the answer but absent in the source context.
1. "Permission Denied"
- Cause: Scripts lost executable flags during transfer.
- Solution: Run
chmod +x scripts/*.sh.
2. "MistralForCausalLM does not support len()"
- Cause: Interaction between JIT compilation and Python boolean checks.
- Solution: This codebase handles it by using explicit
is not Nonechecks. Ensure you are using the latestsrc/utils.py.
3. "RuntimeError: Expected all tensors to be on the same device"
- Cause: RAG embeddings generated on CPU while Model is on GPU.
- Solution: The
HEARengine insrc/semantic_rag.pynow dynamically maps inputs toself.model.device.
4. Job Timeout
- Cause: Tree of Thoughts is computationally expensive (O(b^d)).
- Solution:
universal_launch.shnow requests 12 hours (12:00:00) for robust runs.
If you utilize this framework, please cite:
@software{CognitiveHub2026,
author = {Cognitive Hub Research Team},
title = {Cognitive Hub: A Neuro-Symbolic Clinical Reasoning Suite},
year = {2026},
institution = {HPI},
version = {28.0.0}
}