feat: realtime voice demo (fastrtc + Cartesia) alongside legacy Gradio#1
Open
deepmind11 wants to merge 10 commits intomainfrom
Open
feat: realtime voice demo (fastrtc + Cartesia) alongside legacy Gradio#1deepmind11 wants to merge 10 commits intomainfrom
deepmind11 wants to merge 10 commits intomainfrom
Conversation
Ana now sees the last ~10 exchanges instead of 3 when drafting a reply. Conversation state has always kept the full history; only the prompt slice was narrow. Llama 3.3 70B context is 128k tokens, so even a 40-turn call fits comfortably; the old 6-turn slice was a leftover from early testing.
Design doc for replacing the record-send-reply Gradio UX with a Pipecat + Deepgram + Cartesia + Groq streaming pipeline. Constellation stays identical; only the I/O layer changes. Covers pipeline topology, parallelism strategy (A/B/C), session state, barge-in, failure modes, env vars, tradeoffs, and phased implementation. Nothing implemented yet — review before writing code.
Adds constella-realtime alongside the existing constella-demo: - fastrtc handles WebRTC + Silero VAD for auto-commit-on-pause - Groq Whisper still does ASR (batch, but on a short clip from VAD) - Cartesia Sonic-3 streams TTS for ~100 ms first-audible latency - Constellation (primary + 4 specialists + orchestrator) is unchanged New layout: constella/realtime/ tts.py, audio.py — Cartesia + numpy/WAV helpers constella/demo/realtime.py — fastrtc Stream entrypoint The legacy push-to-talk Gradio demo (constella-demo) is untouched and still the right tool for dev without a Cartesia key.
…oq free tier Three fixes in the realtime demo: 1. UI parity with the legacy Gradio demo — patient intro, mic + text input, example lines, Ana's latest reply, running transcript, and the specialist-verdict JSON are all back. fastrtc's WebRTC component is embedded in a custom gr.Blocks layout via AdditionalOutputs. 2. Language mismatch bug — the TTS hint was derived from the language specialist's verdict on the PATIENT's utterance. When Ana code-switches to Spanish to match the patient's register, the hint was still 'en' and Cartesia synthesized Spanish text with an American accent. Now detected from Ana's reply text via a small heuristic (Spanish chars, accented vowels, function-word count). 3. Startup warning — Groq free tier (6000 TPM on 8B) cannot sustain the 5-concurrent-calls-per-turn burst of a realtime constellation. 429 retries add 2-8 s of backoff per turn. We now warn at boot and point at CONSTELLA_PROVIDER=openrouter or Groq Dev tier as fixes. Tests still green.
WebRTC defaulted to full_screen=True (1280x720), so the wave visualization consumed the entire viewport and the Record button was pushed below the fold. Also the default button_labels are empty strings — icon-only. Setting full_screen=False + height=240 + explicit Record/Stop/waiting labels restores a usable control.
fastrtc's set_args in tracks.py prepends "__webrtc_value__" when the component value is passed as a string (always the case from Gradio). After audio replacement, the handler is called with (audio_tuple, webrtc_value, *real_inputs). The handler was signed for 2 args and blew up with 'takes 2 positional arguments but 3 were given' on the first VAD commit. Accepts the middle slot explicitly as _webrtc_value and discards it.
The previous default voice (Tessa) is native English. Passing language='es' applied Spanish phonemes to her English speaker model, which produces American-accented Spanish — unusable for a bilingual healthcare agent. Fix: pick a native-Spanish voice (Ximena, Latina female, calm professional register) for es replies and keep Tessa for en. Voice identity does shift mid-conversation when Ana code-switches, but that matches how a bilingual nurse actually sounds. Env overrides: CARTESIA_VOICE_ID - single-voice override for both languages CARTESIA_VOICE_EN / ES - per-language overrides
Ana was replying in Spanish even when the patient spoke English because:
- Maria's profile says primary_language=es (strong bias in prompt)
- Ana runs BEFORE the language specialist, so she had no signal
about the current utterance's language
- History from prior Spanish turns created conversational momentum
Adds detect_utterance_language() — same heuristic as the TTS language
detector (Spanish markers + function-word density, returning en/es/mix).
build_user_prompt() now injects a mandatory LANGUAGE DIRECTIVE so Ana
matches the current turn regardless of history or profile. The
profile field is demoted from 'Primary language' to 'Preferred
language at home (but ignore if they just spoke a different language)'.
…ry turn The UI was stuck on turn 1's reply because passing gr.State through AdditionalOutputs created a re-entrant loop: each state update triggered state_change, which re-entered the on_additional_outputs async handler and silently dropped subsequent yields. Fix: hold the ConversationState in a module-level _state (we already did this in an earlier iteration — the rich UI pulled it out into gr.State, which turned out to be the bug). AdditionalOutputs now carries only the three display strings (nurse_text, verdict, transcript), not state. One state per process is fine for a single-user dev demo; prod would key by fastrtc session_id. Handler signatures simplify correspondingly: _voice_handler(audio, _webrtc_value) # was 3 args _text_handler(patient_text) # was 2 args
A previous edit lost the '_state: ConversationState | None = None' line at module scope, so the first call into _ensure_state() raised 'NameError: name _state is not defined'. The UI build path never invokes the state machinery, so the smoke test missed it — it only surfaced when a mic turn actually fired. Also added a test helper in the smoke suite: importing the module and invoking _ensure_state() directly so future edits can't regress this silently.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
constella-realtime, a streaming voice demo that replaces push-to-talk with VAD-based auto-commit (speak, pause, Ana answers). The existingconstella-demois unchanged and still works as the no-Cartesia-key fallback.run_turn()→ Cartesia Sonic-3 streaming TTS → speaker. The 4-specialist constellation (orchestrator.py,primary.py,specialists/*,schemas.py) is untouched — only the I/O layer changes.LANGUAGE DIRECTIVEderived from heuristic detection on the current utterance, so Ana matches the patient's turn-level language instead of defaulting to the profile'sprimary_language.What's new in the code
constella/realtime/— new package (tts.py,audio.py) with the Cartesia streaming wrapper and numpy/WAV helpers.constella/demo/realtime.py— new fastrtc-based UI (rich Blocks layout with mic, text fallback, latest reply, running transcript, specialist verdict JSON).docs/realtime_architecture.md— design doc for the full Pipecat-based variant. Kept as reference for a future Phase-2 upgrade; not what's actually implemented here.pyproject.toml— addsfastrtc[vad],cartesia[websockets],numpy, plus theconstella-realtimescript.SETUP.md— documentsCARTESIA_API_KEY+ the optionalCARTESIA_VOICE_EN/CARTESIA_VOICE_ESoverrides.What intentionally did NOT change
constella/orchestrator.py,constella/primary.py(logic),constella/specialists/*,constella/schemas.py,constella/llm.py,constella/eval/,tests/test_schemas.py— zero changes.primary.pygained a helper + directive injection but kept the same public API.Test plan
Automated (green on every commit in the branch):
uv run pytest tests/ -v— 6 schema + orchestrator-merge tests passbuild_ui()constructs agr.Blockswithout error; Silero VAD model loads_ensure_state()resolves module-level bindingdetect_utterance_language()classifies representative EN/ES/mix samples correctly, including short greetings like "como estas"_pick_voice()honorsCARTESIA_VOICE_IDoverride and falls back to per-language defaultsManual (needs a human with a mic + speaker):
constella-realtimeboots on :7860, Record button is visible and clickablegr.State/AdditionalOutputsissue)Known tradeoffs
CONSTELLA_PROVIDER=openrouteror Groq Dev tier.CARTESIA_VOICE_ID=<uuid>to override and use a single multilingual voice for both.docs/realtime_architecture.mdas Phase-2 work.