Local Real-Time Conversational Pipeline (STT-LLM-TTS)

This project implements a real-time conversational pipeline that runs locally on your machine (optimized for macOS with Apple Silicon). It captures your voice, transcribes it to text, queries a local Large Language Model (LLM) for a response, synthesizes that response back into speech, and plays it—all while managing interruptions (barge-in) for a fluid interaction. The primary goal is to achieve minimal latency.

Key Features

Full Pipeline: Speech-to-Text -> LLM -> Text-to-Speech.
Real-Time & Low Latency: Uses asyncio, queues, audio chunking, and optimized (quantized) models.
STT: Uses Faster Whisper (default: small int8 quantized model) for fast and efficient CPU-based transcription.
LLM: Interacts with any local OpenAI-compatible API (e.g., LM Studio, Ollama) via SSE streaming.
TTS: Uses MLX Audio with the Kokoro model (default: 4-bit version), optimized for hardware acceleration (MPS) on Apple Silicon chips.
Barge-in (Interruption): The assistant's speech is automatically cut off if the user begins speaking again.
Configurable: Models, audio devices, silence thresholds, etc., can be adjusted via variables at the top of the script.
Multi-platform (with limitations): Primarily designed and optimized for macOS/Apple Silicon (due to MLX). The STT and LLM components can run on other platforms, but the MLX TTS will not function without Apple Silicon.

Technologies Used

Python 3.10+
Asyncio, Threading, Queue
Sounddevice (Audio I/O)
NumPy (Audio manipulation)
Aiohttp (Async HTTP client for LLM)
Faster Whisper (STT)
MLX Audio & Kokoro (TTS - Requires macOS/Apple Silicon)
PyTorch (Whisper inference & potentially Kokoro if non-MLX)
SoundFile (Optional audio saving)

Prerequisites

System: macOS (strongly recommended, especially Apple Silicon for MLX TTS). Partial functionality is possible on other OSes (without MLX TTS).
Python: Version 3.10 or higher.
FFmpeg: Required for torchaudio (used by mlx-audio and potentially faster-whisper). Install via Homebrew on Mac: brew install ffmpeg.
Git: To clone the repository.
Hugging Face Account: To download the Whisper and Kokoro models (requires huggingface-cli login).
Local LLM Server: An OpenAI-compatible API instance running locally (e.g., LM Studio, Ollama) and configured with a conversational model (e.g., Llama 3 Instruct).

Installation

Clone the repository:

git clone https://github.com/eauchs/speech-to-speech-pipeline.git
cd speech-to-speech-pipeline

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
(Note: Installing mlx-audio and its dependencies may have specific steps on Mac. If pip install mlx-audio fails, consult their documentation.)
Log in to Hugging Face:
```
huggingface-cli login
```
(Follow the prompts to paste your access token).
Configure Your Local LLM:
- Launch your LLM server (LM Studio, Ollama...).
- Ensure it exposes an OpenAI-compatible API on http://localhost:1234 (or the address configured in the script).
- Load a suitable conversational model (e.g., Meta-Llama-3-8B-Instruct-GGUF).
- Update LLM_MODEL_NAME in the script if needed to match the model name loaded in your local server.

Configuration

Adjust the variables in the # --- Configuration --- section at the top of main_pipeline.py as needed:

WHISPER_MODEL_SIZE, FASTER_WHISPER_DEVICE, FASTER_WHISPER_COMPUTE_TYPE, STT_LANGUAGE: Settings for Faster Whisper STT.
LLM_API_ENDPOINT, LLM_MODEL_NAME, LLM_SYSTEM_PROMPT: Settings for the local LLM API.
KOKORO_MODEL_ID, KOKORO_VOICE, KOKORO_LANG_CODE, TTS_SPEECH_SPEED: Settings for Kokoro MLX TTS.
SAMPLE_RATE_INPUT, SAMPLE_RATE_OUTPUT, CHUNK_DURATION_MS, INPUT_DEVICE, OUTPUT_DEVICE: Audio parameters.
BUFFER_DURATION_S, SILENCE_THRESHOLD, SILENCE_CHUNKS_NEEDED: STT buffering and silence detection settings (crucial for responsiveness).
SAVE_TTS_AUDIO: Set to True to save the generated TTS audio files.

Usage

Ensure your local LLM server is running and configured.
Activate your virtual environment (source .venv/bin/activate).
Run the main script:
```
python main_pipeline.py
```
Wait for the message [Main] Pipeline running. Parlez en français.... (The models will be downloaded on the first run, which may take time).
Speak into your microphone. The assistant should respond after a short delay.
To interrupt the assistant while it's speaking, simply start speaking.
Press Ctrl+C in the terminal to stop the script cleanly.

Compatibility

This script is optimized for macOS with Apple Silicon due to its use of mlx-audio for TTS.
The STT (Faster Whisper on CPU) and LLM call should work on other platforms (Linux, Windows).
The TTS will likely not work on non-Apple Silicon platforms without modification to use a different TTS library (e.g., Coqui TTS, Piper, etc.).
Audio I/O via sounddevice should be cross-platform, but device names/indices may vary.

License

This project is licensed under the MIT License. See the LICENSE.md file for details.

Acknowledgements

Faster Whisper for efficient STT.
MLX Audio and the MLX community for the optimized Kokoro TTS.
Sounddevice for cross-platform audio access.
The open-source LLM community, especially MLX.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
license.md		license.md
main-pipeline.py		main-pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Real-Time Conversational Pipeline (STT-LLM-TTS)

Key Features

Technologies Used

Prerequisites

Installation

Configuration

Usage

Compatibility

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local Real-Time Conversational Pipeline (STT-LLM-TTS)

Key Features

Technologies Used

Prerequisites

Installation

Configuration

Usage

Compatibility

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages