Skip to content

feat: realtime voice demo (fastrtc + Cartesia) alongside legacy Gradio#1

Open
deepmind11 wants to merge 10 commits intomainfrom
feat/realtime-voice
Open

feat: realtime voice demo (fastrtc + Cartesia) alongside legacy Gradio#1
deepmind11 wants to merge 10 commits intomainfrom
feat/realtime-voice

Conversation

@deepmind11
Copy link
Copy Markdown
Owner

Summary

  • Adds constella-realtime, a streaming voice demo that replaces push-to-talk with VAD-based auto-commit (speak, pause, Ana answers). The existing constella-demo is unchanged and still works as the no-Cartesia-key fallback.
  • Pipeline: mic → fastrtc (WebRTC + Silero VAD) → Groq Whisper → run_turn() → Cartesia Sonic-3 streaming TTS → speaker. The 4-specialist constellation (orchestrator.py, primary.py, specialists/*, schemas.py) is untouched — only the I/O layer changes.
  • Primary agent now sees a 20-turn history window (was 6) and receives an explicit LANGUAGE DIRECTIVE derived from heuristic detection on the current utterance, so Ana matches the patient's turn-level language instead of defaulting to the profile's primary_language.
  • TTS picks a native-language voice per turn (Tessa for EN, Ximena for ES) so Spanish replies no longer sound American-accented.

What's new in the code

  • constella/realtime/ — new package (tts.py, audio.py) with the Cartesia streaming wrapper and numpy/WAV helpers.
  • constella/demo/realtime.py — new fastrtc-based UI (rich Blocks layout with mic, text fallback, latest reply, running transcript, specialist verdict JSON).
  • docs/realtime_architecture.md — design doc for the full Pipecat-based variant. Kept as reference for a future Phase-2 upgrade; not what's actually implemented here.
  • pyproject.toml — adds fastrtc[vad], cartesia[websockets], numpy, plus the constella-realtime script.
  • SETUP.md — documents CARTESIA_API_KEY + the optional CARTESIA_VOICE_EN / CARTESIA_VOICE_ES overrides.

What intentionally did NOT change

  • constella/orchestrator.py, constella/primary.py (logic), constella/specialists/*, constella/schemas.py, constella/llm.py, constella/eval/, tests/test_schemas.py — zero changes. primary.py gained a helper + directive injection but kept the same public API.

Test plan

Automated (green on every commit in the branch):

  • uv run pytest tests/ -v — 6 schema + orchestrator-merge tests pass
  • build_ui() constructs a gr.Blocks without error; Silero VAD model loads
  • _ensure_state() resolves module-level binding
  • Audio helpers round-trip: numpy → WAV → bytes on disk, and Cartesia PCM bytes → numpy frame
  • detect_utterance_language() classifies representative EN/ES/mix samples correctly, including short greetings like "como estas"
  • _pick_voice() honors CARTESIA_VOICE_ID override and falls back to per-language defaults

Manual (needs a human with a mic + speaker):

  • constella-realtime boots on :7860, Record button is visible and clickable
  • Speak English → Ana replies in English with Tessa voice
  • Speak Spanish → Ana replies in Spanish with Ximena voice (native Latina, no American accent)
  • UI updates after every turn (regression test for the gr.State / AdditionalOutputs issue)
  • Running transcript accumulates correctly

Known tradeoffs

  • Groq free tier. 5 concurrent LLM calls per turn can blow the 6000 TPM ceiling on the 8B model and add 2-8 s of backoff on bursts. A startup warning now points users to CONSTELLA_PROVIDER=openrouter or Groq Dev tier.
  • Voice identity shifts on code-switch. Using one voice per native language gives correct pronunciation but Ana's voice changes slightly mid-call when she switches languages. Set CARTESIA_VOICE_ID=<uuid> to override and use a single multilingual voice for both.
  • Spanglish ASR. Groq Whisper handles code-switching well, but it's batch — we run it on the VAD-committed clip after pause. A true streaming Spanglish-aware ASR (Gladia Solaria) is noted in docs/realtime_architecture.md as Phase-2 work.

Ana now sees the last ~10 exchanges instead of 3 when drafting a
reply. Conversation state has always kept the full history; only the
prompt slice was narrow.

Llama 3.3 70B context is 128k tokens, so even a 40-turn call fits
comfortably; the old 6-turn slice was a leftover from early testing.
Design doc for replacing the record-send-reply Gradio UX with a
Pipecat + Deepgram + Cartesia + Groq streaming pipeline. Constellation
stays identical; only the I/O layer changes.

Covers pipeline topology, parallelism strategy (A/B/C), session state,
barge-in, failure modes, env vars, tradeoffs, and phased implementation.
Nothing implemented yet — review before writing code.
Adds constella-realtime alongside the existing constella-demo:

  - fastrtc handles WebRTC + Silero VAD for auto-commit-on-pause
  - Groq Whisper still does ASR (batch, but on a short clip from VAD)
  - Cartesia Sonic-3 streams TTS for ~100 ms first-audible latency
  - Constellation (primary + 4 specialists + orchestrator) is unchanged

New layout:
  constella/realtime/   tts.py, audio.py — Cartesia + numpy/WAV helpers
  constella/demo/realtime.py — fastrtc Stream entrypoint

The legacy push-to-talk Gradio demo (constella-demo) is untouched and
still the right tool for dev without a Cartesia key.
…oq free tier

Three fixes in the realtime demo:

1. UI parity with the legacy Gradio demo — patient intro, mic + text
   input, example lines, Ana's latest reply, running transcript, and
   the specialist-verdict JSON are all back. fastrtc's WebRTC component
   is embedded in a custom gr.Blocks layout via AdditionalOutputs.

2. Language mismatch bug — the TTS hint was derived from the language
   specialist's verdict on the PATIENT's utterance. When Ana code-switches
   to Spanish to match the patient's register, the hint was still 'en' and
   Cartesia synthesized Spanish text with an American accent. Now detected
   from Ana's reply text via a small heuristic (Spanish chars, accented
   vowels, function-word count).

3. Startup warning — Groq free tier (6000 TPM on 8B) cannot sustain the
   5-concurrent-calls-per-turn burst of a realtime constellation. 429
   retries add 2-8 s of backoff per turn. We now warn at boot and point
   at CONSTELLA_PROVIDER=openrouter or Groq Dev tier as fixes.

Tests still green.
WebRTC defaulted to full_screen=True (1280x720), so the wave
visualization consumed the entire viewport and the Record button was
pushed below the fold. Also the default button_labels are empty
strings — icon-only. Setting full_screen=False + height=240 + explicit
Record/Stop/waiting labels restores a usable control.
fastrtc's set_args in tracks.py prepends "__webrtc_value__" when the
component value is passed as a string (always the case from Gradio).
After audio replacement, the handler is called with
(audio_tuple, webrtc_value, *real_inputs). The handler was signed for
2 args and blew up with 'takes 2 positional arguments but 3 were given'
on the first VAD commit.

Accepts the middle slot explicitly as _webrtc_value and discards it.
The previous default voice (Tessa) is native English. Passing
language='es' applied Spanish phonemes to her English speaker model,
which produces American-accented Spanish — unusable for a bilingual
healthcare agent.

Fix: pick a native-Spanish voice (Ximena, Latina female, calm
professional register) for es replies and keep Tessa for en. Voice
identity does shift mid-conversation when Ana code-switches, but that
matches how a bilingual nurse actually sounds.

Env overrides:
  CARTESIA_VOICE_ID       - single-voice override for both languages
  CARTESIA_VOICE_EN / ES  - per-language overrides
Ana was replying in Spanish even when the patient spoke English because:
  - Maria's profile says primary_language=es (strong bias in prompt)
  - Ana runs BEFORE the language specialist, so she had no signal
    about the current utterance's language
  - History from prior Spanish turns created conversational momentum

Adds detect_utterance_language() — same heuristic as the TTS language
detector (Spanish markers + function-word density, returning en/es/mix).
build_user_prompt() now injects a mandatory LANGUAGE DIRECTIVE so Ana
matches the current turn regardless of history or profile. The
profile field is demoted from 'Primary language' to 'Preferred
language at home (but ignore if they just spoke a different language)'.
…ry turn

The UI was stuck on turn 1's reply because passing gr.State through
AdditionalOutputs created a re-entrant loop: each state update triggered
state_change, which re-entered the on_additional_outputs async handler
and silently dropped subsequent yields.

Fix: hold the ConversationState in a module-level _state (we already did
this in an earlier iteration — the rich UI pulled it out into gr.State,
which turned out to be the bug). AdditionalOutputs now carries only the
three display strings (nurse_text, verdict, transcript), not state. One
state per process is fine for a single-user dev demo; prod would key
by fastrtc session_id.

Handler signatures simplify correspondingly:
  _voice_handler(audio, _webrtc_value)         # was 3 args
  _text_handler(patient_text)                  # was 2 args
A previous edit lost the '_state: ConversationState | None = None' line
at module scope, so the first call into _ensure_state() raised
'NameError: name _state is not defined'. The UI build path never
invokes the state machinery, so the smoke test missed it — it only
surfaced when a mic turn actually fired.

Also added a test helper in the smoke suite: importing the module and
invoking _ensure_state() directly so future edits can't regress this
silently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant