feat(openai): add OpenAI STT provider support#12
feat(openai): add OpenAI STT provider support#12nathanael-h wants to merge 4 commits intobigbluebutton:developmentfrom
Conversation
The application only supports Gladia as the STT backend. Users who already have an OpenAI API key, or who run a self-hosted OpenAI-compatible Whisper server, cannot use the application without signing up for Gladia. Add an OpenAI STT provider backed by livekit-agents[openai]. A new STT_PROVIDER env var (default: "gladia") selects the backend at startup. When set to "openai", an OpenAiSttAgent is used instead of GladiaSttAgent. Both agents implement the same EventEmitter interface, so main.py requires only minimal changes (provider selection + using an active_stt_config for confidence thresholds). Key differences from the Gladia agent: - update_locale_for_user() stops and restarts the pipeline instead of calling stream.update_options() (not supported by the OpenAI plugin). - Confidence thresholds default to 0.0 because OpenAI STT does not report per-utterance confidence scores. - alternative.language may be None; fall back to original_lang so the locale-mapping logic does not break. New env vars: OPENAI_API_KEY, OPENAI_STT_MODEL, OPENAI_BASE_URL, OPENAI_INTERIM_RESULTS, OPENAI_MIN_CONFIDENCE_FINAL/INTERIM. OPENAI_BASE_URL allows pointing at any OpenAI-compatible endpoint (e.g. a local faster-whisper server).
|
@nathanael-h Could you please confirm if you already sent in the signed Contributor License Agreement? See https://docs.bigbluebutton.org/support/faq.html#why-do-i-need-to-sign-a-contributor-license-agreement-to-contribute-source-code Thanks in advance! |
|
Hello @prlanzarin I've just signed the CLA and received a confirmation email about this. |
|
@nathanael-h Over the weekend, I worked on a refactor that should allow adding multiple STT providers in an easier/cleaner way (at least one that reduces code duplication). See this dev branch: https://github.com/bigbluebutton/bbb-livekit-stt/tree/stt/refactor/generic-providers. Feel free to rebase this PR against that branch and use it as target if you have the time. I think it'll make this PR leaner. Otherwise, let me know and I could look into it later. |
The openai plugin's stream() always connects via WebSocket to the /realtime endpoint, which is not implemented by all OpanAI compatible backends. Switch to recognize() which uses the standard REST/audio/transcriptions endpoint instead. Audio is segmented into speech utterances using energy-based silence detection (RMS threshold) before each recognize() call. Also fixes a NoneType crash in the Redis message handler that occurred when a message arrived before the agent had connected to the room.
The livekit OpenAI plugin's recognize() uses the OpenAI Python SDK which
constructs the URL as {base_url}/audio/transcriptions (no /v1/), causing
405 Method Not Allowed on backends like my-selfhosted-openwebui.com/api/.
Replace with a direct aiohttp POST to {base_url}/v1/audio/transcriptions,
matching the approach used in bbb-livekit-transcriber. Also manage the
aiohttp session lifecycle within the agent.
|
Oh nice, I looked quickly at your work to decouple this stt plugin from Gladia, it looks good. So on my side, I fixed the branch nathanael-h:feat/openai-stt of this PR, still targeting the main branch before you decoupled both. Now that I have reach a good milestone, I will create another branch and PR to rebase on top of your work. @prlanzarin are more commits expected in https://github.com/bigbluebutton/bbb-livekit-stt/tree/stt/refactor/generic-providers that could impact the adding of openai support? If no I will start "rebasing" on it. |
No commits expected for now - go for it. |
…dings Each speech segment was missing start_time/end_time on SpeechData, causing all transcripts to share the same transcriptId (open_time + 0.0). BBB's AudioCaptions model treated every utterance after the first as a same-ID update, returning empty text, which resulted in an empty VTT file in recordings even though live captions worked correctly.
|
Superseded by #13 |
As I run a self-hosted OpenAI-compatible https://speaches.ai/ Faster-Whisper server, I am interesting in using this instead of Gladia.
I open this PR as Draft because it is not ready to be reviewed. But as code is here, I think it is better for me to build in the open, rather than hidden on a local branch in my laptop. Also I want to be transparent regarding LLM usage, I used Claude to help me on this. Maintainers can edit this branch!
this PR adds an OpenAI STT provider backed by livekit-agents[openai]. A new STT_PROVIDER env var (default: "gladia") selects the backend at startup. When set to "openai", an OpenAiSttAgent is used instead of GladiaSttAgent. Both agents implement the same EventEmitter interface, so main.py requires only minimal changes (provider selection + using an active_stt_config for confidence thresholds).
LLM assertions that I need to very, read with caution
Here are some differences from the Gladia agent: - update_locale_for_user() stops and restarts the pipeline instead of calling stream.update_options() (not supported by the OpenAI plugin). - Confidence thresholds default to 0.0 because OpenAI STT does not report per-utterance confidence scores. - alternative.language may be None; fall back to original_lang so the locale-mapping logic does not break.
New env vars: OPENAI_API_KEY, OPENAI_STT_MODEL, OPENAI_BASE_URL, OPENAI_INTERIM_RESULTS, OPENAI_MIN_CONFIDENCE_FINAL/INTERIM. OPENAI_BASE_URL allows pointing at any OpenAI-compatible endpoint (e.g. a local faster-whisper server).
Related meta issue bigbluebutton/bigbluebutton#21059