The Surgical Agentic Framework Demo is a multimodal agentic AI framework tailored for surgical procedures. It supports:
- Speech-to-Text: Real-time audio is captured, transcribed by Whisper.
- VLM/LLM-based Conversational Agents: A selector agent decides which specialized agent to invoke:
- ChatAgent for general Q&A,
- NotetakerAgent to record specific notes,
- AnnotationAgent to automatically annotate progress in the background,
- PostOpNoteAgent to summarize all data into a final post-operative note.
- (Optional) Text-to-Speech: The system can speak back the AI's response if you enable TTS (ElevenLabs is implemented, but any local TTS could be implemented as well).
- Computer Vision or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by vLLM.
- Video Upload and Processing: Support for uploading and analyzing surgical videos.
- Post-Operation Note Generation: Automatic generation of structured post-operative notes based on the procedure data.
- Microphone: The user clicks "Start Mic" in the web UI, or types a question.
- Whisper ASR: Transcribes speech into text (via servers/whisper_online_server.py).
- SelectorAgent: Receives text from the UI, corrects it (if needed), decides whether to direct it to:
- ChatAgent (general Q&A about the procedure)
- NotetakerAgent (records a note with timestamp + optional image frame)
- In the background, AnnotationAgent is also generating structured "annotations" every 10 seconds.
- NotetakerAgent: If chosen, logs the note in a JSON file.
- AnnotationAgent: Runs automatically, storing procedure annotations in
procedure_..._annotations.json
. - PostOpNoteAgent (optional final step): Summarizes the entire procedure, reading from both the annotation JSON and the notetaker JSON, producing a final structured post-op note.
- Python 3.12 or higher
- Node.js 14.x or higher
- CUDA-compatible GPU (recommended) for model inference
- Microphone for voice input (optional)
- 16GB+ RAM recommended
- Clone or Download this repository:
git clone https://github.com/monai/surgical_agentic_framework.git
cd surgical_agentic_framework
- Setup vLLM (Optional)
vLLM is already configured in the project scripts. If you need to set up a custom vLLM server, see https://docs.vllm.ai/en/latest/getting_started/installation.html
- Install Dependencies:
conda create -n surgical_agentic_framework python=3.12
conda activate surgical_agentic_framework
pip install -r requirements.txt
- Install Node.js dependencies (for UI development):
npm install
- Models Folder:
-
Place your model files in
models/llm/
for LLMs andmodels/whisper/
for Whisper models. -
This repository is configured to use a Llama-3.2-11B model with surgical fine-tuning.
-
The model is served using vLLM for optimal performance.
-
Folder structure is:
models/
├── llm/
│ └── Llama-3.2-11B-lora-surgical-4bit/ <-- LLM model files
└── whisper/ <-- Whisper models (downloaded at runtime)
- Setup:
- Edit
scripts/start_app.sh
if you need to change ports or model file names.
- Create necessary directories:
mkdir -p annotations uploaded_videos
- Run the full stack with all services:
npm start
Or using the script directly:
./scripts/start_app.sh
What it does:
- Builds the CSS with Tailwind
- Starts vLLM server with the model on port 8000
- Waits 45 seconds for the model to load
- Starts Whisper (servers/whisper_online_server.py) on port 43001 (for ASR)
- Waits 5 seconds
- Launches
python servers/app.py
(the main Flask + WebSockets application) - Waits for all processes to complete
For UI development with hot-reloading CSS changes:
npm run dev:web
This starts:
- The CSS watch process for automatic Tailwind compilation
- The web server only (no LLM or Whisper)
For full stack development:
npm run dev:full
This is the same as production mode but also watches for CSS changes.
You can also use the development script for faster startup during development:
./scripts/dev.sh
-
Open your browser at
http://127.0.0.1:8050
. You should see the Surgical Agentic Framework Demo interface:- A video sample (
sample_video.mp4
) - Chat console
- A "Start Mic" button to begin ASR.
- A video sample (
-
Try speaking or Typing:
- If you say "Take a note: The gallbladder is severely inflamed," the system routes you to NotetakerAgent.
- If you say "What are the next steps after dissecting the cystic duct?" it routes you to ChatAgent.
-
Background Annotations:
- Meanwhile,
AnnotationAgent
writes a file like:procedure_2025_01_18__10_25_03_annotations.json
in the annotations folder very 10 seconds with structured timeline data.
- Meanwhile,
- Click on the "Upload Video" button to add your own surgical videos
- Browse the video library by clicking "Video Library"
- Select a video to analyze
- Use the chat interface to ask questions about the video or create annotations
After accumulating annotations and notes during a procedure:
- Click the "Generate Post-Op Note" button
- The system will analyze all annotations and notes
- A structured post-operation note will be generated with:
- Procedure information
- Key findings
- Procedure timeline
- Complications
Common issues and solutions:
-
WebSocket Connection Errors:
- Check firewall settings to ensure ports 49000 and 49001 are open
- Ensure no other applications are using these ports
- If you experience frequent timeouts, adjust the WebSocket configuration in
servers/web_server.py
-
Model Loading Errors:
- Verify model paths are correct in configuration files
- Ensure you have sufficient GPU memory for the models
- Check the log files for specific error messages
-
Audio Transcription Issues:
- Verify your microphone is working correctly
- Check that the Whisper server is running
- Adjust microphone settings in your browser
If you want to enable TTS with ElevenLabs (or implement your own local TTS server): * Follow the instructions in the index.html or your code snippet that calls a TTS route or API. * Provide your TTS API key if needed.
A brief overview:
surgical_agentic_framework/
├── agents/ <-- Agent implementations
│ ├── annotation_agent.py
│ ├── base_agent.py
│ ├── chat_agent.py
│ ├── notetaker_agent.py
│ ├── post_op_note_agent.py
│ └── selector_agent.py
├── configs/ <-- Configuration files
│ ├── annotation_agent.yaml
│ ├── chat_agent.yaml
│ ├── notetaker_agent.yaml
│ ├── post_op_note_agent.yaml
│ └── selector.yaml
├── models/ <-- Model files
│ ├── llm/ <-- LLM model files
│ │ └── Llama-3.2-11B-lora-surgical-4bit/
│ └── whisper/ <-- Whisper models (downloaded at runtime)
├── scripts/ <-- Shell scripts for starting services
│ ├── dev.sh <-- Development script for quick startup
│ ├── run_vllm_server.sh
│ ├── start_app.sh <-- Main script to launch everything
│ └── start_web_dev.sh <-- Web UI development script
├── servers/ <-- Server implementations
│ ├── app.py <-- Main application server
│ ├── uploaded_videos/ <-- Storage for uploaded videos
│ ├── web_server.py <-- Web interface server
│ └── whisper_online_server.py <-- Whisper ASR server
├── utils/ <-- Utility classes and functions
│ ├── chat_history.py
│ ├── logging_utils.py
│ └── response_handler.py
├── web/ <-- Web interface assets
│ ├── src/ <-- Vue.js components
│ │ ├── App.vue
│ │ ├── components/
│ │ │ ├── Annotation.vue
│ │ │ ├── ChatMessage.vue
│ │ │ ├── Note.vue
│ │ │ ├── PostOpNote.vue
│ │ │ └── VideoCard.vue
│ │ └── main.js
│ ├── static/ <-- CSS, JS, and other static assets
│ │ ├── audio.js
│ │ ├── bootstrap.bundle.min.js
│ │ ├── bootstrap.css
│ │ ├── chat.css
│ │ ├── jquery-3.6.3.min.js
│ │ ├── main.js
│ │ ├── nvidia-logo.png
│ │ ├── styles.css
│ │ ├── tailwind-custom.css
│ │ └── websocket.js
│ └── templates/
│ └── index.html
├── annotations/ <-- Stored procedure annotations
├── uploaded_videos/ <-- Uploaded video storage
├── README.md <-- This file
├── package.json <-- Node.js dependencies and scripts
├── postcss.config.js <-- PostCSS configuration for Tailwind
├── tailwind.config.js <-- Tailwind CSS configuration
├── vite.config.js <-- Vite build configuration
└── requirements.txt <-- Python dependencies