AI Pocket Interpreter

Privacy-First, 100% Offline real-time speech-to-text and multi-language translation — runs entirely on edge hardware with no cloud dependency.

Speak into your microphone in any supported language and see live captions translated into English, Greek, and Spanish streamed to your browser over WebSockets.

Overview

AI Pocket Interpreter captures live audio, detects speech boundaries, transcribes with OpenAI Whisper, translates via Meta's NLLB-200, and pushes results to a browser UI — all locally, all in real time.

Key highlights:

Zero internet required at runtime. Audio never leaves your device.
Dynamic VAD chunking prevents OOM on continuous speech (adaptive silence thresholds across three zones).
Context-overlap sliding window feeds trailing words to Whisper so mid-sentence cuts don't break transcription quality.
Single-process async architecture — FastAPI event loop coordinates audio capture, GPU inference, and WebSocket broadcast.

Architecture

Microphone → Silero-VAD → faster-whisper (STT) → Language Mapper → NLLB-200 (Translation) → WebSocket → Browser

Component	Role
AudioCapture	PyAudio callback stream → thread-safe queue
VADProcessor	Silero-VAD speech/silence state machine with dynamic thresholds
STTEngine	faster-whisper `large-v3-turbo` transcription + language detection
LanguageMapper	ISO 639-1 → FLORES-200 code mapping
TranslationEngine	CTranslate2 NLLB-200 distilled-1.3B (int8)
ConnectionManager	WebSocket client tracking and JSON broadcast
Frontend	Single-file HTML/CSS/JS — no external resources

Hardware Requirements

Designed for edge devices. The application automatically detects the best available backend at startup — no configuration needed.

Hardware	Device	Compute Type	Notes
Nvidia GPU (≥ 6 GB VRAM)	`cuda`	float16 / int8_float16	Recommended for real-time use
Intel N100 / any CPU	`cpu`	int8 / int8	Fully supported, slower throughput

Minimum system requirements:

Resource	Minimum
CPU	x86-64 with AVX2 (Intel N100 or better)
RAM	8 GB
GPU VRAM	6 GB (CUDA) — optional, falls back to CPU automatically
Storage	~4 GB for model weights

VRAM Budget (GPU mode)

Model	Quantization	Est. VRAM
Whisper large-v3-turbo	float16	~1.6 GB
NLLB-200 distilled-1.3B	int8_float16	~1.3 GB
Silero-VAD	float16	~10 MB
Total		~3.0 GB

Installation

# 1. Clone the repository
git clone https://github.com/your-org/ai-pocket-interpreter.git
cd ai-pocket-interpreter

# 2. Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 3. Install system dependency for PyAudio
sudo apt-get install portaudio19-dev   # Debian/Ubuntu

# 4. Install Python dependencies
pip install -r requirements.txt

Model Download

The AI weights are not included in the repository. Download them into a models/ directory before first run.

NLLB-200 (CTranslate2, int8)

pip install huggingface_hub

# Download the pre-converted CTranslate2 model
huggingface-cli download JustFrederik/nllb-200-distilled-1.3B-ct2-int8 \
    --local-dir models/nllb-200-distilled-1.3b-ct2

Whisper large-v3-turbo

faster-whisper downloads and caches the model automatically on first run. No manual step needed — just ensure you have internet for the initial download, then it works offline forever.

Silero-VAD

Downloaded automatically via torch.hub on first run and cached locally.

Usage

# Start the server
uvicorn main:app --host 0.0.0.0 --port 8000

# Open in your browser
# http://localhost:8000

Speak into your microphone. Live captions appear in the browser with translations into English, Greek, and Spanish.

Project Structure

ai-pocket-interpreter/
├── main.py                 # FastAPI app, startup, pipeline wiring
├── audio_capture.py        # Microphone capture (PyAudio callback)
├── vad_processor.py        # Dynamic VAD state machine
├── stt_engine.py           # Whisper STT with context overlap
├── language_mapper.py      # ISO 639-1 → FLORES-200 mapping
├── translation_engine.py   # NLLB-200 translation (CTranslate2)
├── connection_manager.py   # WebSocket client management
├── models.py               # Data classes (AudioChunk, Payload, etc.)
├── frontend.html           # Self-contained browser UI
├── requirements.txt        # Python dependencies
└── models/                 # AI weights (not in repo — see above)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Pocket Interpreter

Overview

Architecture

Hardware Requirements

VRAM Budget (GPU mode)

Installation

Model Download

NLLB-200 (CTranslate2, int8)

Whisper large-v3-turbo

Silero-VAD

Usage

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
audio_capture.py		audio_capture.py
connection_manager.py		connection_manager.py
frontend.html		frontend.html
language_mapper.py		language_mapper.py
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt
stt_engine.py		stt_engine.py
test_audio_capture.py		test_audio_capture.py
translation_engine.py		translation_engine.py
vad_processor.py		vad_processor.py

Folders and files

Latest commit

History

Repository files navigation

AI Pocket Interpreter

Overview

Architecture

Hardware Requirements

VRAM Budget (GPU mode)

Installation

Model Download

NLLB-200 (CTranslate2, int8)

Whisper large-v3-turbo

Silero-VAD

Usage

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages