feat: realtime voice demo (fastrtc + Cartesia) alongside legacy Gradio by deepmind11 · Pull Request #1 · deepmind11/constella

deepmind11 · 2026-04-15T03:48:27Z

Summary

Adds constella-realtime, a streaming voice demo that replaces push-to-talk with VAD-based auto-commit (speak, pause, Ana answers). The existing constella-demo is unchanged and still works as the no-Cartesia-key fallback.
Pipeline: mic → fastrtc (WebRTC + Silero VAD) → Groq Whisper → run_turn() → Cartesia Sonic-3 streaming TTS → speaker. The 4-specialist constellation (orchestrator.py, primary.py, specialists/*, schemas.py) is untouched — only the I/O layer changes.
Primary agent now sees a 20-turn history window (was 6) and receives an explicit LANGUAGE DIRECTIVE derived from heuristic detection on the current utterance, so Ana matches the patient's turn-level language instead of defaulting to the profile's primary_language.
TTS picks a native-language voice per turn (Tessa for EN, Ximena for ES) so Spanish replies no longer sound American-accented.

What's new in the code

constella/realtime/ — new package (tts.py, audio.py) with the Cartesia streaming wrapper and numpy/WAV helpers.
constella/demo/realtime.py — new fastrtc-based UI (rich Blocks layout with mic, text fallback, latest reply, running transcript, specialist verdict JSON).
docs/realtime_architecture.md — design doc for the full Pipecat-based variant. Kept as reference for a future Phase-2 upgrade; not what's actually implemented here.
pyproject.toml — adds fastrtc[vad], cartesia[websockets], numpy, plus the constella-realtime script.
SETUP.md — documents CARTESIA_API_KEY + the optional CARTESIA_VOICE_EN / CARTESIA_VOICE_ES overrides.

What intentionally did NOT change

constella/orchestrator.py, constella/primary.py (logic), constella/specialists/*, constella/schemas.py, constella/llm.py, constella/eval/, tests/test_schemas.py — zero changes. primary.py gained a helper + directive injection but kept the same public API.

Test plan

Automated (green on every commit in the branch):

uv run pytest tests/ -v — 6 schema + orchestrator-merge tests pass
build_ui() constructs a gr.Blocks without error; Silero VAD model loads
_ensure_state() resolves module-level binding
Audio helpers round-trip: numpy → WAV → bytes on disk, and Cartesia PCM bytes → numpy frame
detect_utterance_language() classifies representative EN/ES/mix samples correctly, including short greetings like "como estas"
_pick_voice() honors CARTESIA_VOICE_ID override and falls back to per-language defaults

Manual (needs a human with a mic + speaker):

constella-realtime boots on :7860, Record button is visible and clickable
Speak English → Ana replies in English with Tessa voice
Speak Spanish → Ana replies in Spanish with Ximena voice (native Latina, no American accent)
UI updates after every turn (regression test for the gr.State / AdditionalOutputs issue)
Running transcript accumulates correctly

Known tradeoffs

Groq free tier. 5 concurrent LLM calls per turn can blow the 6000 TPM ceiling on the 8B model and add 2-8 s of backoff on bursts. A startup warning now points users to CONSTELLA_PROVIDER=openrouter or Groq Dev tier.
Voice identity shifts on code-switch. Using one voice per native language gives correct pronunciation but Ana's voice changes slightly mid-call when she switches languages. Set CARTESIA_VOICE_ID=<uuid> to override and use a single multilingual voice for both.
Spanglish ASR. Groq Whisper handles code-switching well, but it's batch — we run it on the VAD-committed clip after pause. A true streaming Spanglish-aware ASR (Gladia Solaria) is noted in docs/realtime_architecture.md as Phase-2 work.

Ana now sees the last ~10 exchanges instead of 3 when drafting a reply. Conversation state has always kept the full history; only the prompt slice was narrow. Llama 3.3 70B context is 128k tokens, so even a 40-turn call fits comfortably; the old 6-turn slice was a leftover from early testing.

Design doc for replacing the record-send-reply Gradio UX with a Pipecat + Deepgram + Cartesia + Groq streaming pipeline. Constellation stays identical; only the I/O layer changes. Covers pipeline topology, parallelism strategy (A/B/C), session state, barge-in, failure modes, env vars, tradeoffs, and phased implementation. Nothing implemented yet — review before writing code.

Adds constella-realtime alongside the existing constella-demo: - fastrtc handles WebRTC + Silero VAD for auto-commit-on-pause - Groq Whisper still does ASR (batch, but on a short clip from VAD) - Cartesia Sonic-3 streams TTS for ~100 ms first-audible latency - Constellation (primary + 4 specialists + orchestrator) is unchanged New layout: constella/realtime/ tts.py, audio.py — Cartesia + numpy/WAV helpers constella/demo/realtime.py — fastrtc Stream entrypoint The legacy push-to-talk Gradio demo (constella-demo) is untouched and still the right tool for dev without a Cartesia key.

…oq free tier Three fixes in the realtime demo: 1. UI parity with the legacy Gradio demo — patient intro, mic + text input, example lines, Ana's latest reply, running transcript, and the specialist-verdict JSON are all back. fastrtc's WebRTC component is embedded in a custom gr.Blocks layout via AdditionalOutputs. 2. Language mismatch bug — the TTS hint was derived from the language specialist's verdict on the PATIENT's utterance. When Ana code-switches to Spanish to match the patient's register, the hint was still 'en' and Cartesia synthesized Spanish text with an American accent. Now detected from Ana's reply text via a small heuristic (Spanish chars, accented vowels, function-word count). 3. Startup warning — Groq free tier (6000 TPM on 8B) cannot sustain the 5-concurrent-calls-per-turn burst of a realtime constellation. 429 retries add 2-8 s of backoff per turn. We now warn at boot and point at CONSTELLA_PROVIDER=openrouter or Groq Dev tier as fixes. Tests still green.

WebRTC defaulted to full_screen=True (1280x720), so the wave visualization consumed the entire viewport and the Record button was pushed below the fold. Also the default button_labels are empty strings — icon-only. Setting full_screen=False + height=240 + explicit Record/Stop/waiting labels restores a usable control.

fastrtc's set_args in tracks.py prepends "__webrtc_value__" when the component value is passed as a string (always the case from Gradio). After audio replacement, the handler is called with (audio_tuple, webrtc_value, *real_inputs). The handler was signed for 2 args and blew up with 'takes 2 positional arguments but 3 were given' on the first VAD commit. Accepts the middle slot explicitly as _webrtc_value and discards it.

The previous default voice (Tessa) is native English. Passing language='es' applied Spanish phonemes to her English speaker model, which produces American-accented Spanish — unusable for a bilingual healthcare agent. Fix: pick a native-Spanish voice (Ximena, Latina female, calm professional register) for es replies and keep Tessa for en. Voice identity does shift mid-conversation when Ana code-switches, but that matches how a bilingual nurse actually sounds. Env overrides: CARTESIA_VOICE_ID - single-voice override for both languages CARTESIA_VOICE_EN / ES - per-language overrides

Ana was replying in Spanish even when the patient spoke English because: - Maria's profile says primary_language=es (strong bias in prompt) - Ana runs BEFORE the language specialist, so she had no signal about the current utterance's language - History from prior Spanish turns created conversational momentum Adds detect_utterance_language() — same heuristic as the TTS language detector (Spanish markers + function-word density, returning en/es/mix). build_user_prompt() now injects a mandatory LANGUAGE DIRECTIVE so Ana matches the current turn regardless of history or profile. The profile field is demoted from 'Primary language' to 'Preferred language at home (but ignore if they just spoke a different language)'.

…ry turn The UI was stuck on turn 1's reply because passing gr.State through AdditionalOutputs created a re-entrant loop: each state update triggered state_change, which re-entered the on_additional_outputs async handler and silently dropped subsequent yields. Fix: hold the ConversationState in a module-level _state (we already did this in an earlier iteration — the rich UI pulled it out into gr.State, which turned out to be the bug). AdditionalOutputs now carries only the three display strings (nurse_text, verdict, transcript), not state. One state per process is fine for a single-user dev demo; prod would key by fastrtc session_id. Handler signatures simplify correspondingly: _voice_handler(audio, _webrtc_value) # was 3 args _text_handler(patient_text) # was 2 args

A previous edit lost the '_state: ConversationState | None = None' line at module scope, so the first call into _ensure_state() raised 'NameError: name _state is not defined'. The UI build path never invokes the state machinery, so the smoke test missed it — it only surfaced when a mic turn actually fired. Also added a test helper in the smoke suite: importing the module and invoking _ensure_state() directly so future edits can't regress this silently.

deepmind11 added 10 commits April 14, 2026 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: realtime voice demo (fastrtc + Cartesia) alongside legacy Gradio#1

feat: realtime voice demo (fastrtc + Cartesia) alongside legacy Gradio#1
deepmind11 wants to merge 10 commits intomainfrom
feat/realtime-voice

deepmind11 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant