Transcribe and translate speech from a microphone or computer output in real-time, based on the faster-whisper module and Google translation service.
- Python 3.11+
This repo uses a standard src/ layout and is installable via pyproject.toml.
pip install .After installation, start the GUI with either:
whisperingor:
python -m whispering| Option | Description |
|---|---|
| Mic | Audio input device selection |
| Model size or path | Whisper model to use |
| Device | Computing device for inference |
| VAD | Enable Voice Activity Detection to filter out silence |
| Memory | Number of previous confirmed segments to use as prompts for context |
| Patience | Minimum time (seconds) to wait after a segment ends before confirming it |
| Timeout | HTTP timeout (seconds) for translation requests |
| Source | Source language for transcription (auto for automatic detection) |
| Target | Target language for translation (none to disable translation) |
| Prompt | Initial prompt text to guide transcription style/vocabulary |
- Left panel: Displays transcription results
- Right panel: Displays translation results
- Black text: Confirmed, finalized text
- Blue underlined text: Draft text still being refined in the transcription window
The project follows a clean, modular architecture with clear separation of concerns:
whispering/
├── core/ # Core abstractions and engine
│ ├── interfaces.py # Abstract interfaces for all services
│ ├── engine.py # STTEngine - orchestrates the pipeline
│ └── utils.py # Shared utilities (MergingQueue, Pair, Data)
├── services/ # Pluggable service implementations
│ ├── audio/
│ │ └── soundcard_impl.py # Audio recording via soundcard
│ ├── transcription/
│ │ └── whisper_impl.py # Transcription via faster-whisper
│ └── translation/
│ └── google_impl.py # Translation via Google Translate
└── gui.py # Tkinter-based GUI application
-
Interfaces (
core/interfaces.py): Defines abstract base classes forRecordingService,TranscriptionService, andTranslationService, enabling easy extension with custom implementations. -
STTEngine (
core/engine.py): The central orchestrator that manages three parallel threads:- Record thread: Captures audio frames from the input device
- Transcription thread: Processes audio frames and produces transcription pairs (confirmed + draft text)
- Translation thread: Translates transcription results in real-time
-
SoundcardRecordingService: Records audio using the
soundcardlibrary, supporting both microphones and loopback (system audio capture). -
WhisperTranscriptionService: Uses faster-whisper for transcription with a sliding window approach. Maintains a "transcription window" that accumulates audio until segments are confirmed based on the patience parameter.
-
GoogleTranslationService: Translates text using Google's translation API with configurable timeout.
The modular architecture makes it easy to add custom implementations:
- Custom Recording Service: Implement
RecordingServiceandRecordingServiceFactoryinterfaces - Custom Transcription Service: Implement
TranscriptionServiceandTranscriptionServiceFactoryinterfaces - Custom Translation Service: Implement
TranslationService(or extendCoreTranslationService) andTranslationServiceFactoryinterfaces
Example: To use a different translation service, create a new implementation in services/translation/ that implements the TranslationService interface, then update the GUI to use your factory.
When the program starts working, it takes the audio stream in real time from the input device (microphone or computer output) and transcribes it. After a piece of audio is transcribed, the corresponding text fragment is obtained and output to the screen immediately.
To avoid inaccurate transcription due to lack of context or speech being cut off mid-sentence, the program uses a "transcription window" mechanism. Segments that have been transcribed but not yet confirmed are displayed as underlined blue text. When the next audio chunk arrives, it's concatenated to the window. The audio in the window is transcribed iteratively, with results constantly revised and updated until a sentence is completed and has sufficient subsequent context (determined by the patience parameter) before being confirmed (turning into black text).
The last few confirmed segments (controlled by the memory parameter) are used as prompts for subsequent transcription to improve accuracy.
Simultaneously, the real-time transcription fragments are sent to Google Translate, and translation results are displayed in the right panel. Users can specify source and target languages using the respective dropdowns.
Patience: Determines the minimum time to wait for subsequent speech before confirming a segment.
- Too low: May confirm segments too early, resulting in incomplete sentences
- Too high: Creates lag as the transcription window accumulates too much content
Memory: Controls how many previous confirmed segments are used as context prompts.
- Too low: Insufficient context may reduce transcription accuracy
- Too high: Overly long prompts may slow down transcription
- Real-time iterative transcription: Continuously refines results while minimizing delay
- Smart sentence boundary detection: Automatically determines optimal points to confirm segments
- Simultaneous translation: Get translations in real-time alongside transcription
- System audio capture: Can transcribe audio from any application via loopback devices
- Transcription only: No internet required. Set target language to
none. - With translation: Internet connection required for Google Translate API.
