Skip to content

Dawn-Of-Justice/SpeakSense

Repository files navigation

SpeakSense

The replacement for wakeword tech for robots, using multimodal AI to understand when people are addressing them. Basically ElevenLabs turn taking mechanism for robots but on steroids, cause it also uses visual cues like gaze and body language to determine if someone is talking to the robot or just nearby. Ideal for robots that need to be always listening but not always responding, like home assistants, social robots, and more.

See it in action ⬇️

SpeakSense Live Demo

SpeakSense Workflow

πŸ—οΈ Project Structure

SpeakSense/
β”œβ”€β”€ backend/                    # Python backend services
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ models/            # ML models (ASD, audio classification)
β”‚   β”‚   β”œβ”€β”€ services/          # Core services (LLM, transcription)
β”‚   β”‚   β”œβ”€β”€ api/               # FastAPI endpoints and WebSocket
β”‚   β”‚   └── utils/             # Utility functions
β”‚   β”œβ”€β”€ tests/                 # Backend tests
β”‚   β”œβ”€β”€ requirements.txt       # Python dependencies
β”‚   └── setup.py              # Backend setup script
β”œβ”€β”€ frontend/                  # Next.js React frontend
β”‚   β”œβ”€β”€ app/                   # Next.js app directory
β”‚   β”œβ”€β”€ public/                # Static assets
β”‚   └── package.json           # Node.js dependencies
β”œβ”€β”€ models/                    # Trained model files and weights
β”‚   β”œβ”€β”€ trained/               # Trained model checkpoints
β”‚   └── weights/               # Model weight files
β”œβ”€β”€ data/                      # Data files and datasets
β”‚   β”œβ”€β”€ raw/                   # Raw audio/video data
β”‚   β”œβ”€β”€ processed/             # Processed features and outputs
β”‚   └── assets/                # Project assets (demos, diagrams)
β”œβ”€β”€ scripts/                   # Utility scripts and tools
β”œβ”€β”€ notebooks/                 # Jupyter notebooks for experiments
β”œβ”€β”€ docs/                      # Documentation and research papers
└── config/                    # Configuration files

πŸš€ Quick Start

Option 1: Automated Setup

# Set up both backend and frontend
python manage.py --setup

# Start backend server
python manage.py --start-backend

# Start frontend server (in another terminal)
python manage.py --start-frontend

Option 2: Docker Setup

# Build and run with Docker Compose
python manage.py --docker

Option 3: Manual Setup

Backend Setup

  1. Navigate to the backend directory:
cd backend
  1. Run the setup script:
python setup.py
  1. Start the backend server:
python src/api/fastapi_websocket_server.py

Frontend Setup

  1. Navigate to the frontend directory:
cd frontend
  1. Install dependencies:
npm install
  1. Start the development server:
npm run dev

Access the Application

🧠 How It Works

SpeakSense uses a multimodal approach combining:

  • Active Speaker Detection (ASD): Identifies who is speaking in video
  • Audio Classification: Determines if speech is directed at the assistant
  • Visual Analysis: Analyzes gaze direction and body language
  • Natural Language Processing: Understands speech intent and context

πŸ“Š Phase 1: Data Collection & Preparation

Collect multimodal training data

  • Record video, audio, and transcripts of people talking to and around the robot
  • Include diverse scenarios (directly addressing robot, talking nearby but not to robot)
  • Label data with "addressing robot" vs "not addressing robot" classifications

Feature extraction pipeline

  • Implement the active speaker detection model (Liao et al.)
  • Set up basic visual feature extraction (gaze, orientation)
  • Configure audio preprocessing pipeline
  • Establish transcription service integration

Phase 2: Initial Model Development

Build baseline model

  • Implement a simple Bidirectional LSTM architecture
  • Create input pipelines for each modality
  • Design feature fusion mechanism
  • Develop training and evaluation scripts

Basic training and validation

  • Train on clear-cut examples first
  • Implement cross-validation strategy
  • Establish baseline metrics for accuracy, latency, and resource usage

Phase 3: Model Enhancement

Improve feature engineering

  • Refine visual features (add sustained gaze detection, orientation angles)
  • Enhance audio features (directivity, voice characteristics)
  • Develop linguistic feature extraction (pronoun detection, imperative forms)

Architectural improvements

  • Add attention mechanisms
  • Implement hierarchical structure for modality processing
  • Optimize layer configurations

Advanced training techniques

  • Implement curriculum learning
  • Add data augmentation for edge cases
  • Fine-tune hyperparameters

Phase 4: System Integration

Develop real-time processing pipeline

  • Create efficient preprocessing modules
  • Implement sliding window for contextual memory
  • Design adaptive thresholding system

Optimize for low-end devices

  • Quantize model weights
  • Implement model pruning
  • Profile and optimize critical paths

Create staged activation system

  • Develop always-on lightweight monitoring
  • Build trigger mechanism for full model activation
  • Implement power management strategies

Phase 5: Testing & Refinement

Controlled environment testing

  • Measure accuracy metrics in controlled settings
  • Benchmark latency and resource usage
  • Identify common failure cases

Real-world testing

  • Deploy prototype in various environments
  • Collect user feedback on naturalism and responsiveness
  • Log false positives and false negatives

Model refinement

  • Retrain with additional edge cases
  • Fine-tune confidence thresholds
  • Optimize for specific deployment environments

Phase 6: Deployment & Learning

Full system deployment

  • Integrate with robot's main systems
  • Implement logging for continuous improvement
  • Develop update mechanism

Continuous learning

  • Add capability to learn from successful interactions
  • Implement personalization for specific users
  • Create feedback mechanism for misinterpretations

About

SpeakSense is a multimodal deep learning project that detects when a user is speaking to a virtual assistant by analyzing both audio and video in real time

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors