feat(meet-agent): Flow A second brain — orchestrator + tools + pre-roll + cache + barge-in#2503
Draft
oxoxDev wants to merge 65 commits into
Draft
feat(meet-agent): Flow A second brain — orchestrator + tools + pre-roll + cache + barge-in#2503oxoxDev wants to merge 65 commits into
oxoxDev wants to merge 65 commits into
Conversation
The modal used to POST /mascots/join-meeting to the backend Camoufox bot
(Flow B). Two production blockers there:
- Firefox / Camoufox bypasses our JS getUserMedia override at the C++
native layer, so the mascot Y4M never replaces the bot's camera and
the tile is a static placeholder.
- Chromium / Chrome variants get rejected by Meet's anti-bot screen
("You can't join this video call") before they reach the join page.
Flow A (PR tinyhumansai#1350 + tinyhumansai#1359) sidesteps both: it opens a dedicated, profile-
isolated CEF webview on the user's machine, installs the audio + video
bridges via CDP at document-start, and lets meet_scanner drive the join.
The mascot canvas IS the outbound camera and the synthesized speech IS
the outbound mic — the user's OS mic is never wired to the meeting.
Surfaces the meeting-bots entry next to the speak-replies toggle on /human so users can dispatch the mascot directly from the chat surface without flipping to the Skills tab. Same modal, same Flow A backing — just an additional surface.
macOS Cocoa clamps NSWindow frame origins to keep the window at least partially on-screen, so the (-30000, -30000) requested via the builder lands as (0, 0) and the bot's Meet CEF window pops up visible — the user can see + interact with the bot's pre-join UI, which defeats the 'invisible bot' premise. Re-apply the off-screen position post-build via Tauri's set_position API (which hits the runtime's CEF set_position path, bypassing the initial-bounds clamp). Belt-and-suspenders with window.minimize() so even on builds where the position still leaks through Cocoa, the window doesn't visibly cover the user's main openhuman surface.
macOS restores a minimized window on the next focus event, which means the previously-minimized bot CEF window pops back up over the user's main openhuman surface as soon as anything brings the app to front. Worse UX than a window stuck off-screen — drop the minimize(). Also close any lingering meet-call-* window before opening a new one. Each Join was spawning a fresh request_id-keyed window without reclaiming the previous bot's resources, so the Dock accumulated "Meet — OpenHuman" windows and the listen_capture audio handler registry got two competing CEF audio handlers fighting over the same URL. Finally, log the actual outer_position post-build so we can verify in the log whether macOS still clamps (-30000, -30000) → (0, 0) or whether the runtime's CEF set_position path took effect this time.
macOS Cocoa clamps NSWindow frame origins to the union of all attached monitors' bounds, so even (-30000, -30000) lands on a secondary display on multi-monitor setups (e.g. (-1692, 66) on a left-extended layout). Confirmed via the post-build outer_position log line: the bot's Meet pre-join surface ends up visible on the user's second screen, which still defeats the 'invisible bot' premise. Swap to window.hide() instead — that calls macOS [NSWindow orderOut:] which removes the window from screen + Dock without releasing the backing surface. The renderer keeps painting, CDP keeps working, and all the existing scanner / audio-bridge / camera-bridge plumbing continues to function. Critically different from .visible(false) at builder time, which never gives the renderer a backing surface and silently breaks layout + clicks (see the existing builder comment for the original reasoning).
…r working Hiding the window at post-build time stripped CEF's renderer of its key-window state and the meet_scanner's CDP `Input.dispatchMouseEvent` clicks landed on un-rendered DOM, so the bot never got past the pre-join screen. Move the hide() call into `meet_scanner::spawn` on the Ok branch of the join sequence — that fires after "Ask to join" has been clicked and Meet has confirmed entry into the waiting room. By then the renderer has done its layout, gUM has fired (so the audio + camera bridges have taken hold), and the CDP session is in steady-state streaming captions + speech. orderOut: just removes the window from screen + Dock without releasing the backing surface, so all of that keeps running while the user no longer sees the bot. Pre-join, the window is positioned off-screen at (-30000, -30000) and macOS clamps it onto whatever monitor it can find — so on multi- display setups the user sees a flash of the bot's pre-join page on their secondary monitor for ~7 s before it goes away. Best we can do without restructuring CEF's headless-render path.
Meet defaults camera + mic OFF for new participants. If the scanner
just types a name and clicks Join, the bot lands in the meeting muted
with no camera — Meet never calls getUserMedia, the audio + camera
bridges have nothing to intercept (audio_context_state stays
'not-created', camera bridge canvas is never selected as the outbound
track), and the speak_pump can't push synthesized PCM into a live
mic track because there is no live mic track.
Add a Phase 2.5 between display-name and Ask-to-join that clicks the
camera and mic toggles ON. The toggles are icon buttons with no
visible text, so the existing wait_and_click_text helper (which
matches innerText) won't find them — introduce a sibling matcher
click_by_aria_label that walks button/aria-label nodes and matches
on case-insensitive substring against a list of canonical Meet
labels ("turn on camera", "camera is off", etc).
Both clicks are best-effort: if Meet's aria copy has drifted by
region / A-B test we log and continue. The bot still joins, just
without that capability.
Camera + mic toggle clicks timed out in the latest smoke. Meet's
aria-label copy doesn't match the narrow list shipped in the previous
commit, so the bot kept joining muted with no camera — Meet never
called getUserMedia, the audio + video bridges stayed inert
(audio_context_state stuck at not-created, destination_track_count
stuck at 0), and the speak_pump pushed PCM into a stream that
doesn't exist.
Two changes:
- Broaden the matcher list to include the toggled-on variants (Meet
sometimes ships pre-join in 'Turn off camera' state by default when
the previous session left the toggle on), and include the
keyboard-shortcut suffix variants ('camera (cmd+e)').
- Bump the per-toggle budget from 4 s to 12 s. Pre-join layout settles
~3-5 s after name input on slower CEF builds; 4 s left us racing.
- On miss, dump the matching aria-labels via a CDP Runtime.evaluate
helper so the next smoke surfaces the actual strings Meet shipped
this region/build, and we can extend the matcher precisely instead
of guessing.
Booby-trap fix. Meet's toggle aria-label describes the *action* the click would perform — "Turn on camera" when off, "Turn off camera" when on. My previous matcher included both directions, so when the device was already ON the matcher hit the "Turn off" variant and the click flipped it OFF. That's what muted the bot in the last smoke: mic started ON (or got auto-enabled by Meet between page-load and our scan), 'Turn off microphone' matched, we clicked, mic ended up muted. Trim both matchers to ON-only variants. If the device is already on, no match means we leave it alone — correct outcome. If both directions miss, dump aria-labels via the existing helper so we can extend. Also drops the cmd-shortcut and bare 'off' variants — they were either ambiguous or duplicates of the canonical 'Turn on …' / '… is off' pair, and removing them tightens the matcher window against future Meet copy drift.
Smoke shows audio_context_state stuck at 'not-created' and no push_caption RPC after the post-join hide. Both consistent with the hidden renderer (orderOut: under the hood) pausing its event loop — the captions_bridge MutationObserver never fires, the audio bridge's gUM intercept never gets a fresh getUserMedia call from Meet, and the speak_pump pushes PCM into a destination stream that was never attached to any outbound track. Temporarily revert the hide to confirm the diagnosis. With the window visible we should see audio_context_state transition to 'running' and push_caption start firing as the user speaks the wake word. If that holds, restore hiding via a non-orderOut mechanism (set_position to a far-off-screen value via the runtime path, or set_size to 1x1, or the CefBrowserHost::set_audio_muted route from the deferred follow-up list).
…ilence
When the wake-word caption arrives with no tail ("Hey Openhuman" by
itself, with no question following), session.take_pending_prompt
returns None and run_caption_turn silently returns Ok(false). From
the user's side this looks identical to the bot being broken — the
wake-word fired log appears in the dev:app stdout but no audible
reply ever follows.
Treat empty-tail wake as a 'say hi back' greeting cue: synthesize
a short ack so the user gets audible proof that the
caption→wake→speak loop is wired end-to-end. Reuses the existing
pick_ack_phrase / stub_tts fallbacks so this works without backend.
Smoke now traceable in logs: 'caption turn bare-wake (no tail)' →
'caption turn start … bare_wake=true' → ack reply enqueued →
speak_pump pushes PCM. If the user STILL hears nothing after this,
the failure has moved past brain to the audio_bridge intercept
(destination_track_count stuck at 0 because Meet cached its
pre-bridge MediaStream), which is the next thing to fix.
captions_bridge.js auto-enables CC by polling every 2s for a button whose aria-label starts with 'turn on captions' (indexOf === 0). Two weaknesses surfaced in smoke: 1. Meet ships variants like 'Turn on captions (c)' in some regions — the keyboard-shortcut parenthesis breaks the strict prefix match. 2. The polling cap (30 attempts * 2s = 60s) can expire before a slow host admits the bot from the waiting room. Add a Phase 4 to the Rust scanner: after clicking Ask-to-join, poll the in-call control bar for a 'Leave call' / 'End call' affordance — that's the cleanest signal the bot got admitted. Once admitted, click the captions toggle from the scanner side using the existing click_by_aria_label substring matcher, which is looser than the JS prefix matcher and handles the cmd-shortcut variant. Belt-and-suspenders: if either step times out, log and continue. The brain just sees no captions for that session — no worse than the pre-patch state. Admission budget is 120s to give the host plenty of time before we give up; both this loop and the captions_bridge poll run in parallel so whichever notices the CC button first wins.
Captions are flowing into the rpc handler (7 push_captions in ~10s
in the latest smoke) but no 'wake word fired' lines show up. Two
candidates:
(a) user said something that does not contain 'hey openhuman' in
Meet's normalised caption text — even after normalize_for_wake
strips punctuation
(b) normalisation is dropping/altering the match string before
session.note_caption searches it
Log every push_caption's text + wake_fired so the next smoke shows
the exact string Meet's STT produced and whether the matcher fired.
Truncated to 120 chars so a long caption doesn't blow up the log line.
Captions are already on the wire to every meeting participant, so
no new exposure surface here.
… gUM Smoke shows the full caption→wake→brain→TTS→speak_pump pipeline fires end-to-end (caption_turn_done reply_chars=12 synth_samples=3200) but the host hears nothing. Root cause: audio_bridge.js's getUserMedia intercept never fires — Meet caches its initial mic MediaStream from page load (before our bridges installed) and reuses it across the bridge-driven reload, so the bot's outbound mic track keeps pointing at the real OS microphone (MacBook Pro Microphone per the aria-label dump). The synthesised PCM that speak_pump pushes ends up in a MediaStreamDestination that's never attached to anything Meet broadcasts. Add a Phase 3.5 right after Ask-to-join: click 'Turn off microphone', pause ~700 ms for React to settle, then click 'Turn on microphone'. The second click triggers Meet to re-request its mic via getUserMedia, which our bridge now intercepts and replaces with the synthesised destination stream — destination_track_count flips from 0 → 1 and the bot's outbound mic becomes the brain's TTS pump output. Camera off-on cycle deliberately not added: the fake-camera Y4M flag already feeds Meet a one-frame mascot via Chromium's process-level fake-video-capture path, so the bot's tile shows the mascot already. The video animation upgrade lives in the separate MascotFrameProducer encode-bottleneck follow-up.
…are 'openhuman' Smoke caption 'I, Hi Openhuman.' did not fire the wake word because the previous matcher only knew 'hey openhuman' / 'hey open human'. Meet's STT also routinely drops the 'hey' prefix, splits the brand into 'Open Human' (two words), or substitutes 'Hi'/'Hello'. Expand the matcher to a small ordered list — checked longest-first so the tail offset is calculated against the matched phrase length, not the wake-prefix length: hey open human, hi open human, hello open human, hey openhuman, hi openhuman, hello openhuman, open human, openhuman Bare 'openhuman' is in the list because Meet's STT will sometimes drop both the greeting AND the space — leaving the brand alone in the caption. Risk of false-positives is low: 'openhuman' isn't a common English token, and 'open human' as a 2-word collocation is almost only ever the brand spoken aloud.
Latest smoke aborted at the Ask-to-join click (Meet UI variant; bot got admitted manually) and the post-join mic-cycle never ran — the flow returns Err and any later phase is skipped. Bot ended up broadcasting the real OS mic. Move Phase 3.5 → Phase 2.6: cycle the mic right after the name input, before clicking Ask-to-join. The cycle is best-effort either way, but this site is more reliable: - Pre-join is when Meet's React happily re-acquires media on toggle — in-call cycling can race the join handshake. - The mic cycle now runs even when Ask-to-join itself times out, so a manual join from the host still leaves the bot with the gUM intercept armed. - The Ask-to-join click stays best-effort (still -propagates Err so the caller knows the scanner gave up driving the page), but the gUM bootstrap is no longer gated on it.
…le session Smoke against the staging-deployed staging backend hit a new failure: the bot CEF webview landed on Google's 'Verify it's you' page for the user's own email (nikhil@tinyhumans.ai) instead of the anonymous 'Your name' pre-join input the scanner drives. The vendored tauri-cef runtime does not yet honour our per-request_id `data_directory` as a fresh CEF RequestContext — webviews effectively share the parent process's cookie + cache store, and Meet recognises the signed-in Google account on the user's main openhuman session. Add a Phase 0 in meet_scanner::run that: - enables the Network CDP domain - calls Network.clearBrowserCookies on the meet target - calls Network.clearBrowserCache too (belt-and-suspenders) - Page.reload with ignoreCache=true so Meet's React state re-fetches from a clean slate - 1500ms sleep to let the reloaded page settle before scanner phases start poking the DOM These CDP commands are scoped to the attached browser instance, so they wipe the session for THIS Meet target without touching the user's main openhuman webviews (those run in separate browser instances). Best-effort — if Network isn't reachable we log and continue. The proper fix is a per-RequestContext CEF profile in the vendored runtime; that lives in the deferred follow-up.
…terrupt on new turn
Three deep gaps surfaced once the staging backend was online and
real LLM + ElevenLabs were producing 60+ second replies:
1. Echo / noise loop. Meet labels its placeholder + accessibility
strings under speaker='You' (the local participant tag), which
includes a multi-paragraph 'sample caption' demo string staging's
captioning UI emits every 250ms. Each scrape re-fired the wake
word ('openhuman' literal lives inside that demo string) and the
bot kept replying to its own broadcast. note_caption now drops
captions where speaker.lowercase() == 'you' (or empty).
2. Bot was speaking its own chain-of-thought. The reasoning models
on staging emit a <think>...</think> block ahead of the actual
user-facing reply; strip_for_speech happily passed it through to
TTS, producing a minute of internal monologue. Strip the think
blocks before any other markdown clean-up. Unclosed <think> at
end of output drops everything from the tag onwards.
3. Bot wouldn't stop talking. speak_pump just drains whatever is
queued — if a new wake fires while the previous reply is still
playing, the old PCM finishes BEFORE the new reply starts.
run_caption_turn now calls session.cancel_outbound() at start,
which clears the outbound buffer and flips outbound_done so the
page bridge sees end-of-utterance cleanly. Bot becomes
interruptible — user can re-fire the wake word and the previous
reply is cut short.
Three guards stack to make the bot loop-proof when running with a real LLM that produces 30s+ replies on staging: 1. Speaking gate. session.note_caption refuses to fire a fresh wake while the outbound TTS queue still has audio. Without this, the user continuing to speak (or Meet captioning the bot's own voice) during a long reply lands a second wake, brain cancels the first and starts a new turn — repeated forever. Captures still record to the transcript log with a "(suppressed: bot speaking)" tag so we keep the diagnostic trail. 2. Server-side caption dedup. Meet's CC region re-renders the same line every 250 ms poll tick, and the page-side lastBySpeaker dedup keys on a speaker guess that flips for the same row when the avatar marker comes and goes. Defensive (speaker, text) signature on the session drops verbatim repeats before they hit the wake matcher or the RPC log. 3. TTS char cap. Reasoning models on staging routinely emit 800+ char replies despite REPLY_MAX_TOKENS=220 (token budget is per the user-facing text, not the <think> trace). New cap_for_speech trims to 400 chars at the last sentence terminator inside the budget; falls back to a hard cut + ellipsis. ~25s of speech at average prosody — short enough to stay interruptible. Together these break the speak-listen-speak loop user hit on the "Hey Openhuman, can you hear me?" round trip.
…mode prompt The previous prompt asked for "1-2 sentences" but reasoning-style backends (DeepSeek / GMI / qwen flavours routed under model="agentic-v1") routinely ignored soft length hints and emitted 800+ char monologues. cap_for_speech trimmed them at 400 chars but the TTS still ran 25s per turn — long enough that the user couldn't get a word in edge-wise. Two changes: 1. REPLY_MAX_TOKENS 220 → 80. ~60 spoken words ≈ ~12s of audio. Hard ceiling regardless of model verbosity. 2. MEETING_SYSTEM_PROMPT rewritten as strict numbered rules — "ONE sentence, max 25 spoken words, no chain-of-thought, no <think> blocks, plain spoken English". Address-detection and dictation rules preserved but condensed. Combined with cap_for_speech(400) and the speaking gate, the bot now produces one short answer per wake instead of a minute-long reply that locks the loop open. Real second-brain (tools+memory+calendar via Agent::from_config_for_agent) is the next commit per the approved plan.
…soning
Root cause of "bot reads its chain-of-thought aloud" (e.g. "We need to
generate a single sentence, max 25 words, plane spoken English. The user
said hello. This is a greeting addressed to Openhuman. So I should respond
with a greeting."): the bare /openai/v1/chat/completions endpoint pinned
to model="agentic-v1", which is a reasoning-style model. Reasoning models
emit their internal chain-of-thought as PLAIN TEXT (not <think> tags) in
the completion body when called outside the structured thinking_delta
channel — senamakel's chat path consumes those events separately and
shows them as a status, but a raw chat/completions call gets them
concatenated into the response. TTS then reads the whole thing aloud.
Two changes:
1. Pin model to chat-v1 (MODEL_CHAT_V1 in
src/openhuman/config/schema/types.rs:17). chat-v1 is the
conversational non-reasoning model — produces a direct user-facing
answer suited to voice. Same family of aliases used by other entry
points; no infra change required.
2. Add strip_untagged_reasoning() pass in strip_for_speech. Defensive
heuristic against future model swaps: drops sentences whose lower-
case trim begins with known reasoning openers ("We need to…",
"I should…", "Let me…", "The user said…", "So I should…", etc.).
If every sentence matches, returns the last sentence (final
conclusion) instead of empty string.
3. Tighter MEETING_SYSTEM_PROMPT with NO-CHAIN-OF-THOUGHT rules +
explicit good/bad examples. Even though chat-v1 doesn't reason out
loud, the prompt now defends against accidental leaks if the router
ever falls back to a reasoning tier.
Real second-brain (Agent::from_config_for_agent / channels-style chat
path) is still the next commit per the approved plan — this is the
defence-in-depth that fixes the spoken-out-loud reasoning today.
… in voice
The bot now answers via the SAME path as the chat UI and the webview meet
handoff: Agent::from_config_for_agent(&config, "orchestrator"). It
inherits the user's connected integrations, memory tree, MCP clients,
skills, and the project-wide tool registry. Whatever the user has wired
in their core is available to the bot day-one — no per-tool plumbing in
meet_agent.
Pipeline now:
caption / STT → llm_meeting_agentic (orchestrator + tools + memory)
↓ on error: llm_meeting_basic (bare chat-v1)
↓ on error: stub / canned ack
→ strip_for_speech → cap_for_speech(400) → TTS
Why agentic-first, basic-as-fallback:
- Agentic gives real answers ("is my Friday evening free", "what did
Alice say about the deploy", "remember to mail Bob tomorrow"). The
orchestrator runs the same tool-iteration loop the chat UI does.
- Basic exists only so a config / registry / token issue doesn't kill
the call. Degrades to a polite reply instead of dead air.
- Reasoning leak ("We need to generate a single sentence…") was the
symptom that motivated this commit; the proper fix is routing through
the channels-style path because that path consumes thinking_delta
events separately and never lands them in the response body.
MEET_VOICE_DIRECTIVE prepended to every user utterance constrains the
orchestrator's reply to one short spoken sentence (max 25 words, no
markdown, no preamble, no chain-of-thought). The directive is wrapped
in a delimiter so the orchestrator can't confuse it with the user's
literal speech.
AGENTIC_TURN_TIMEOUT_SECS = 20 wraps run_single so a slow tool
iteration doesn't leave the meeting participant in indefinite silence.
On timeout the basic-LLM fallback fires.
strip_for_speech + cap_for_speech(400) still run on the harness output
as TTS hygiene — tool-use markers / citations / markdown leak through
even on chat-v1, and the agent reply can be longer than the
voice-budget if the orchestrator decides a fuller answer is right.
…integrations from_config_for_agent builds the orchestrator with ZERO integrations attached — saw "[orchestrator_tools] assembled 9 delegation tool(s) for agent 'orchestrator' (0 integrations connected)" in the bot path log, versus "10 delegation tool(s) (119 integrations connected)" for the chat UI path. The web channel uses Agent::from_config_for_agent_with_profile (channels/providers/web.rs:1570) which is what wires the integrations in. Switch the meet-agent path to the same builder. Pass MEET_VOICE_DIRECTIVE as profile_prompt_suffix instead of prepending to the user message — same hook the web channel uses for locale-reply directives. The orchestrator now reads the voice-frontend constraint at system-prompt level, which is the right altitude (it's a channel-wide contract, not a per-utterance instruction). Per-meet event-context + agent-definition-name (orchestrator_meet_<id>) so the harness scopes its session transcript to this request_id — otherwise two simultaneous orchestrators (chat UI + meet bot) would share one transcript file. Strengthened MEET_VOICE_DIRECTIVE wording — explicit "tool-use is great, but only the final spoken reply should appear in your output" so the orchestrator knows it CAN run tools (calendar, memory, integrations) but should suppress narration about them. Net effect: bot now has the user's full 119-integration tool surface available, plus the voice-mode output contract.
…anscript resume Every turn was hitting: "400 An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'" Root cause: the harness auto-resumes prior transcripts when an agent_definition_name matches a file on disk. A prior turn was killed mid-tool-call (app restart while orchestrator was awaiting tool output), leaving an assistant message with `tool_calls` and no follow-up `tool` reply. Every subsequent run_single re-loaded that file as the seeded history and the LLM API rejected it. Switch agent_definition_name to include now_ms so each turn gets a unique name and the harness never finds a prior transcript to load. Trade-off: harness loses cross-turn memory persistence (each turn is stateless from the agent's POV). Tools still work — they query real external systems. Cross-turn memory is a follow-up that needs an Agent cache (Arc<Mutex<Agent>> per request_id) so the harness keeps history in-memory and never round-trips through the corrupt-able disk transcript loader. Corrupt transcript file purged manually for the active staging workspace; future kills will create new ones but per-turn unique naming means they won't poison subsequent turns.
User reported: connected calendar mid-call, then asked bot about tomorrow's meetings; bot kept saying "I don't have calendar access" even though [orchestrator_tools] logged 119 integrations connected on every turn. Diagnosis: the previous MEET_VOICE_DIRECTIVE said "answer in ONE short spoken sentence, no preamble, no 'Let me…', no 'I should…'". The model interpreted this as a blanket "skip tool use, answer directly from prior" — tool calls + tool replies look like preamble to a model trained to match instruction shape. So it short-circuited to a hallucinated "not connected" answer instead of dispatching delegate_to_integrations_agent. Rewritten directive separates two contracts: 1. TOOL USE (encouraged + explicit): call tools whenever real data is needed. Tool calls are invisible to the user, do NOT count toward reply length. Explicit "do not claim something is not connected before attempting to call its tool". Explicit pointer to delegate_to_integrations_agent as the integration gateway. 2. FINAL SPOKEN REPLY (strict): same 25-word one-sentence ceiling, but framed as applying ONLY to the user-facing text that lands in TTS. The model is free to do whatever tool work it needs first. Same dictation / silence-on-side-conversation rules retained. Bug-1 (echo loop — Rust outbound drains faster than JS audio playback, is_speaking() flips false mid-reply, new wake fires) is a known follow-up. Needs speaking_until_ms deadline on the session + a JS-side audio flush RPC. Tracked, not addressed in this commit.
…60s timeout Sub-agent log analysis of the live dev:app run found three converging bugs that produced "bot keeps repeating the same toolless reply 20 times" behaviour even after the orchestrator + tools were wired up correctly: 1. **Single-slot last_caption_signature was broken**. Meet's CC region renders two simultaneous rows (the user's caption AND the bot's TTS captioned back as speaker="You"). The 250 ms poll walked both rows every tick, so the signature flipped A → B → A → B and dedup never matched on byte-identical user repeats. One utterance fired the wake word 24 times. Replace with HashMap<speaker_lower, last_text>. 2. **turn_in_progress gate** added. While a brain turn is in flight (LLM + tools), refuse new wakes. The user's growing utterance was firing a fresh agentic turn every ~9-10s while the prior turn's delegate_to_integrations_agent (16-30s for calendar) was still running. Result: ~20 parallel calendar API hits per question, none of which finished inside the timeout. Gate is set at run_caption_turn entry (alongside cancel_outbound + take_pending_prompt) and cleared at the final with_session that enqueues the reply. 3. **Agentic timeout 20s → 60s**. Single delegate_to_integrations_agent already takes 15-30s on its own. Iteration 2 (synthesis using the tool result) needs another 3-5s. The 20s budget killed iteration 1 mid-flight and forced the bot back to llm_meeting_basic, which produced the confidently-wrong "I don't have access to your calendar" lie. 60s covers tool + synthesis with headroom. The turn_in_progress gate prevents the longer window from starving the user — they cannot fire 20 parallel turns during the wait. Known follow-up: when the agentic path times out (rare with 60s), the basic-LLM fallback still hallucinates. Should swap that for a polite "still checking" ack instead. Tracked, not in this commit.
Live test of the Slack question hit the 60s ceiling — delegate_to_integrations_agent completed in 33.97s with 8 iterations + 239 chars of real Slack data, but iteration 2 (orchestrator synthesis) never landed. The bot fell back to llm_meeting_basic, which has no tool access and confidently invented an answer the user heard over voice — worse than honest silence. 1. AGENTIC_TURN_TIMEOUT_SECS: 60 → 90. Slack / Gmail fetches via Composio + per-message filtering + synthesis hit 60-80s in the slow path. The turn_in_progress gate still blocks parallel wakes during the wait. 2. Removed llm_meeting_basic fallback from both run_caption_turn and run_turn. On agentic failure we now speak "Let me get back to you on that." instead of routing to a toolless LLM that hallucinates. Honest deflection > false answer in a live meeting. llm_meeting_basic is retained in the file for future integration-degraded smoke tests; no live caller exercises it now.
…rompt User asked "what time is it" and got "I don't know" / "Let me get back to you" because the orchestrator's registry has no clock tool. Cheap fix: include current local date/time/weekday/tz-offset in the profile_prompt_suffix when building the per-turn orchestrator. The directive tells the model to use this block directly for time/date questions and NOT dispatch a tool. Refreshed every turn because Agent is built per-turn, so the answer stays accurate across long meetings. Format example: "Current local date/time: 2026-05-23 01:21:48".
…udget
User reported: still has to enable Meet captions manually each call.
The bot can't hear without CC because Flow A scrapes Meet's caption DOM.
Two paths were running but both narrow:
1. captions_bridge.js polled prefix-only `aria.indexOf("turn on captions") === 0`,
missing Meet variants like "Turn on captions (c)", "Turn on live captions",
"Subtitles", "Closed captions".
2. meet_scanner phase-4 click_by_aria_label substring-matched but only
knew 5 patterns; Meet rolls out new labels regionally.
Widen both:
- Patterns: turn on captions / turn on live captions / turn on subtitles /
turn on closed captions / captions on / captions (c) / show captions /
enable captions
- Bridge uses substring match (`indexOf >= 0`), not prefix-only
- Negative guard added so we never accidentally click an already-ON
toggle ("Turn off captions" / "captions off" / "disable captions")
- Bridge attempt budget 30 → 60 (~120s) for slow waiting-room admits
- Scanner dump label widened from "caption" to "caption|subtitle" so the
failure log catches any future label variant for further widening
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
The agent-cached path builds an orchestrator on first wake (memory tree load + MCP init) which can take several seconds even in a minimal test environment. The prior 50ms fixed sleep raced against that and the test asserted on an empty queue. Convert to a 100ms-tick busy-poll capped at 30s so the test exits the moment audio lands but tolerates the slower cold path.
12 tasks
Add `owner_display_name` + `bot_display_name` fields to `MeetAgentSession`. Reject `note_caption` wakes unless the speaker matches the owner, and drop the bot's own captioned TTS as self-echo. Empty owner fails closed so a misconfigured launch can never expose the user's tool surface. The brain runs the user's full orchestrator with 119 Composio integrations + memory tree. Without an identity gate, any participant in the Meet who says the wake phrase can issue tool calls in the user's name and have the results spoken back to the whole room (e.g. "hey openhuman read my Slack DMs from <person>" → private data broadcast). Gate is intentionally enforced before dedup / cooldown so unauthorised-wake attempts are auditable. Normalisation strips a single trailing parenthetical so Meet's "(host)" / "(you)" / "(presenter)" decorators don't break the match, and lowercases for case-insensitive compare. Unit tests cover the four denial paths (non-owner, bot-self, empty owner, case insensitivity) plus the (host)-suffix path.
Extend `StartSessionRequest` with `owner_display_name` and `bot_display_name` (both `#[serde(default)]` so old shells keep parsing). `handle_start_session` installs the identities via `session.set_identities` right after the registry create — before any caption push can arrive, so a racing wake never reads the empty-owner state in a way that could leak. Done as a two-step register+set rather than threading the identities through the existing `start()` signature so smoke tests (and any future non-Meet caller) don't have to be updated in lockstep.
…gate `meet_call_open_window`'s `OpenWindowArgs` gains `owner_display_name`. `meet_audio::start` now accepts both the bot and owner names and includes them in the `openhuman.meet_agent_start_session` RPC payload, so the core wake gate is armed before the first caption arrives. Dev-auto launch path in `lib.rs` passes an empty owner name — the gate fails closed (no wakes fire) which is the safe posture for an automated harness that has no real user behind the keyboard.
Add a required "Your name in the call" input to the Flow A join modal and forward it through `joinMeetCall` → `meet_call_open_window`. The hint text under the field tells the user this is the privacy lock — OpenHuman will only respond to the wake word when this exact name is speaking, so a remote participant can't trigger tool calls in their name. Submit is disabled until the owner field is non-empty; submitting an empty value would fail closed in core anyway but surfacing the requirement up front avoids the user joining a call and finding the bot silent. `IntelligenceCallsTab` is hidden behind a Coming Soon gate, so its `joinMeetCall` call site passes an empty owner placeholder with a note that the field has to be wired up when the tab is revived. Vitest `MeetingBotsCard.test.tsx` updated to type a value into the new field before submitting (previously the disabled-submit gate would have blocked the form).
The pump now tracks an edge-detected speaking flag per session and fires a `meet-video:speaking-state` Tauri event on every flip. The detector is gated by a 400 ms hangover so the natural gap between two consecutive PCM chunks doesn't flap the mascot's mouth shut. Shutdown and fatal-feed-error paths force the state to `false` so the mascot can't get stuck mid-talk if the call dies during a TTS chunk. `poll_and_feed` now returns whether the tick carried PCM (the edge-detector's input). `speak_pump::start` takes an `AppHandle<R>` so the spawned task can emit events; updated the single caller in `meet_audio::start`. Frontend consumer (the in-Meet mascot frame producer) lands in the next commit.
`MascotFrameProducer` subscribes to `meet-video:speaking-state`
and flips `<YellowMascotIdle/>` between `talking={false}` (idle)
and `talking={true}` (mouth animating in sync with the synthesized
PCM the bot is feeding into Meet). RequestId on the payload is
matched against the active session so a stale event from a torn-
down call can't bleed into the current one.
Visual cue only — no audio path / bridge changes. Meet participants
now see the mascot's mouth open and close in time with the audio
they hear, instead of the prior frozen idle pose.
`store::MeetCallRecord` captures request_id, meet_url, bot + owner display names, started/ended timestamps, listened/spoken seconds, and turn count. `append_record` opens the workspace's `meet_agent/calls.jsonl` in append mode (mkdir as needed); `read_recent(limit)` reads the file, drops malformed lines with a debug log, sorts newest-first, and clamps to 200 rows so a misconfigured caller can't trigger an unbounded read. JSONL chosen over sqlite for the same shape used elsewhere in the workspace: low-volume, write-rarely / read-rarely data, no migration story needed, and a malformed final line just gets skipped on next read. Tests cover round-trip, limit cap, missing-file → empty, malformed-line tolerance, zero-limit, and the usize::MAX clamp guard.
Extend `MeetAgentSession` with `meet_url: String` and `started_at_ms: u64`, plus a `set_meet_url` setter and read accessors (`meet_url`, `bot_display_name`, `started_at_ms`) so the store layer doesn't reach into private fields. The monotonic `Instant` `started_at` is kept for elapsed-seconds math; the new wall-clock ms field is what the JSONL log sorts on across process restarts. `StartSessionRequest` gains an optional `meet_url` field (serde default = empty) so older shells keep parsing while new shells forward the URL the CEF window joined.
handle_stop_session now builds a `MeetCallRecord` from the just-closed session and appends it to the JSONL store. The append is best-effort: a failed write logs at warn level but never blocks the stop_session response (the call is already over). handle_start_session forwards `meet_url` from the request into the session. New `openhuman.meet_agent_list_calls` returns the most recent records, newest first, with an optional `limit` param (default 50, hard-capped at 200 by the store). Wired into the controller schema registry alongside the existing five `meet_agent_*` endpoints; the schema-vs-handler-symmetry test is extended to include it.
The shell already knows the call's Meet URL (it built the CEF window with it); include it in the meet_agent_start_session RPC payload so the core can snapshot it onto the session and persist it in the recent-calls JSONL log on stop_session.
`MeetCallRecord` interface mirrors the core's `MeetCallRecord` struct (snake_case fields surfaced verbatim). `listMeetCalls(limit)` calls `openhuman.meet_agent_list_calls` and returns the rows array, or an empty array on a fresh install. Test file updated for the new privacy-lock contract: every joinMeetCall happy-path case now passes `ownerDisplayName`, and the invoke-args assertion checks the new `owner_display_name` field on the shell payload. Added a dedicated test for the empty-owner rejection path so future refactors can't silently weaken the gate.
`MeetingBotsModal` now fetches the most recent 20 calls via `listMeetCalls()` on mount and renders them in a new `RecentCallsSection` underneath the join form — same surface where the user launched the call, so they see their history without navigating away. Three render states (loading / empty / populated) avoid the empty-flash on first open. Each row shows the trailing Meet code (`abc-defg-hij`), a relative timestamp (`12m ago`, `yesterday`, `May 14`), and the turn count + on-call seconds — enough at a glance without overflowing the modal width. Fetch errors are surfaced inline as informational text (not role="alert", which the form already owns).
`note_caption` now returns a `CaptionOutcome` enum (Ignored / WakeFired / UnauthorizedWake) so callers can branch between the silent-drop, normal-turn, and polite-refusal paths without re-doing the gate logic out-of-band. The unauthorised path only fires when the non-owner caption actually contains a wake phrase — random chatter still goes through the existing `Ignored` branch. Session gains: - `pending_unauthorized_speaker` + timestamp (2 min window) - `allowlist: HashSet<String>` of normalised speaker names - `allow_speaker(name)` adds to allowlist - `take_pending_unauthorized()` consumes the slot if fresh Wake gate now accepts owner OR any allowlisted speaker. Bot-self filter still returns Ignored (an UnauthorizedWake here would loop on the bot's own refusal caption). Tests cover non-owner soft-deny outcome, non-owner chatter still ignored, allowlist promotes a refused speaker, pending take consumes once.
Two new short brain paths that bypass the orchestrator agent:
`run_soft_deny_turn` synthesises a canned refusal line ("Sorry
<asker>, only <owner> can ask me things here. <owner>, say
'allow' to let them in.") and enqueues it as a normal TTS reply.
Cancels any prior outbound first so the refusal doesn't queue
behind a half-drained turn. Stamps turn-done so the min-turn-gap
backstop also covers refusals — a chatty non-owner can't spam
the gate every few seconds.
`run_grant_turn` adds the previously-refused speaker to the
session's per-call allowlist, speaks a short confirmation
("Okay, Bob can ask me now."), and clears the wake_active /
turn_in_progress flags so the grantee's next caption can fire
a fresh turn rather than coalescing into this one.
`run_caption_turn` checks `looks_like_grant_intent` at the top
of the prompt. If a pending unauthorised speaker exists within
the 2-min grant window, the turn branches into `run_grant_turn`
instead of the orchestrator. No pending request → fall through
to the normal LLM path, so the model can still answer if the
owner uses the same vocabulary in an unrelated query.
Tests cover the canned message templates, the grant-intent
matcher (accepts canonical phrases including "yes go ahead",
"let them in"; rejects mid-prompt false positives like
"did i allow that meeting").
`handle_push_caption` now switches on the `CaptionOutcome` enum returned by `session::note_caption`. `WakeFired` spawns the existing `run_caption_turn`; `UnauthorizedWake` spawns the new `run_soft_deny_turn` (passing the asker's display name so the spoken refusal can address them by name); `Ignored` is a no-op. `turn_started` in the response stays true only for `WakeFired` so the existing shell-side UI hints don't see a refusal as an authorised turn.
…ings
The Meeting Bots modal's submit button renders the platform label by
string-concatenating the translation with `selected.label`
(\`\${t('sendTo')} \${selected.label}\` / \`\${selected.label} \${t('comingSoon')}\`).
The base `t()` does not interpolate, so en/ko translations that
embedded `{label}` showed up verbatim — "Send to {label} Google Meet"
and "{label} coming soon" — instead of the intended interpolation.
All other locale chunks already use bare "Send to" / "Coming soon"
strings to match the concat pattern. Bring en + ko in line so the
button reads correctly in those locales too.
…_url in start_session schema
The controller schema validator rejected the new fields as unknown
params:
meet_audio start failed err=rpc error: {"code":-32000, ...,
"message":"unknown param 'bot_display_name' for meet_agent.start_session"}
Plan C added the fields to `StartSessionRequest` (with serde default
fallbacks) and Plan A added `meet_url`, but the schema declaration
in `schemas.rs` was never updated. Add all three as optional fields
so the dispatch layer admits them and the gate / persistence paths
actually run.
Knock-on effect of the rejection: `meet_audio::start` bailed before
installing the audio bridge or starting the frame bus, so the
gUM intercept never installed → Meet exposed the host's real
camera instead of the mascot canvas. Fixing the schema restores
the full pipeline.
Plan D landed the unauthorised-wake branch ABOVE the per-speaker dedup + min-turn-gap + cooldown + turn-in-progress gates. Meet's caption observer re-emits the same caption row every 250 ms while the speaker is still visible in the CC region, so each tick fired a fresh UnauthorizedWake → soft-deny TTS — producing the "sorry sorry sorry" loop seen in dev:app on 2026-05-25 (also producing 429s from the TTS endpoint as the loop hit rate-limits). Restructure: compute `speaker_is_authorised` early, run all rate-limit gates uniformly for both authorised and unauthorised speakers, then branch on authorised at the wake-phrase match point. Restrict the wake_active prompt-continuation append to authorised speakers too so a non-owner can't smuggle text into the in-flight owner prompt. Regression test `note_caption_unauthorized_wake_does_not_loop_on_identical_caption` asserts the first emission produces `UnauthorizedWake` and subsequent emissions of the same (or punctuation-jittered) text are deduped to `Ignored`.
…plit
Two follow-up bugs from the first soft-deny smoke:
1) Meet's STT re-transcribes the same utterance with text jitter
("Openhuman. I open." → "Openhuman. High openhum." →
"Openhuman. High Openhuman.") so the per-text dedup misses
the variants. Each fired a fresh soft-deny TTS, producing
the "sorry sorry sorry" loop and 429 rate-limits from the
TTS backend.
Fix: session-wide UNAUTHORIZED_COOLDOWN_MS (60s, 1 dispatch
per window). Tracked on a new
`last_unauthorized_dispatch_at_ms` field on the session.
Independent of the owner's `last_turn_done_at_ms` so the
owner can still wake (e.g. say "allow them") within seconds
of a refusal.
2) Greetings from non-owners were getting refused instead of
answered. New `classify_unauthorized_intent` looks at the
post-wake tail — bare wake or greeting-only words ("hi",
"hello", "good morning", "there", "everyone", ...) maps to
`Greeting`; substantive task asks map to `TaskAsk`.
`run_soft_deny_turn` branches on intent:
Greeting → "Hi <asker>! Nice to meet you." (no privacy
gate noise on a hello)
TaskAsk → the existing refusal + "say 'allow' to let
them in" hint
`CaptionOutcome::UnauthorizedWake` now carries the full caption
text so the brain layer can classify; rpc.rs forwards it into
the spawned turn.
Tests:
- session: cooldown blocks text-variants + cross-speaker
- brain: greeting / filler / task classification
…uplink The audio bridge connected each fed `AudioBufferSource` only to the `MediaStreamAudioDestinationNode` that backs Meet's getUserMedia intercept. Bot voice therefore reached Meet (and other participants via the WebRTC wire), but was silent on the host machine — the user running openhuman could only hear the bot if they were receiving the call on a *separate* endpoint (other browser tab, phone, ...). Smoke today surfaced as "captions appear from OpenHuman but no sound" while the user was watching the bot+meet on the same mac. Add a second `src.connect(ctx.destination)` so the same buffer also plays through the default output device. No quality impact; the MediaStream path is unchanged. Follow-up tinyhumansai#20 (vendored CEF `set_audio_muted` for the bot window) will re-introduce a clean off switch behind a config toggle once we have one — right now defaulting to audible-locally is the less confusing posture.
Loosen the non-owner branch: instead of a canned refusal, route
substantive asks through a toolless chat-v1 LLM with an explicit
no-personal-data system prompt. The LLM:
- Answers general knowledge / casual chat / definitions / jokes
from training data ("what's the capital of France" -> "Paris").
- Refuses anything that would need the owner's tools (Slack,
Gmail, Calendar, memory, integrations) with a one-line pointer
at the magic word: "<owner>, say 'allow' if you'd like me to
help."
- Has zero tools wired, so it physically can't fire a Composio
call even if it tried.
- Has empty history (no rolling context from owner turns) so
private replies from earlier in the call can't bleed into a
non-owner reply.
`run_soft_deny_turn` still gates on `classify_unauthorized_intent`:
greeting -> canned hi (cheap, no network); task ask -> the new
`llm_general_no_tools`. LLM errors / empty replies fall through
to the explicit canned refusal so the speaker hears *something*.
Changes:
- brain::llm_meeting_basic gains a `system_prompt` param so the
same plumbing serves both owner-fallback and non-owner paths.
- new `non_owner_system_prompt(owner)` builder.
- new `llm_general_no_tools(prompt, owner)` wrapper.
- cooldown lowered 60s -> 20s so non-owners can engage in
actual back-and-forth instead of the bot going deaf for a
minute after the first refusal.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
The pre-revival Flow A baseline could put the mascot into a Meet but:
/openai/v1/chat/completionsendpoint with no tools → could not answer "what's on my calendar", "what did Alice say on Slack", "remember Friday is busy"."We need to generate a single sentence…"spoken aloud.speaker="You"re-fired wake; cooldown and dedup were too narrow.Solution
Agent::from_config_for_agent_with_profile(&config, "orchestrator", None, Some(MEET_VOICE_DIRECTIVE))— same canonical path the chat UI uses. Inherits 119 integrations + memory tree + MCP.MEET_VOICE_DIRECTIVEsystem-prompt suffix splits two contracts: TOOL USE (encouraged, invisible to user, doesn't count toward word budget) vs FINAL SPOKEN REPLY (one sentence ≤25 words, plain English, no markdown).RIGHT-NOW CONTEXTblock injects local date/time so time questions answer without a clock tool.AGENT_CACHE: OnceLock<TokioMutex<HashMap<request_id, Arc<TokioMutex<Agent>>>>>per-meet Agent built once, locked acrossrun_single().await, dropped onstop_session. Eliminates 5–10s per-turn rebuild and restores cross-turn memory.tool_calls-pair 400 error after a kill mid-tool-call; in-memory history survives via the cache.AGENTIC_TURN_TIMEOUT_SECS = 90covers slow Composio fetches + iteration-2 synthesis. On agentic failure: speak"Let me get back to you on that."— honest deflection — instead of falling back to a toolless LLM that hallucinates."On it."synthesised + enqueued withdone=falseimmediately after wake. Skipped when prompt ≤ 50 chars (greetings / time / hear-me checks).window.__openhumanFlushAudio()inaudio_bridge.jstracks every startedAudioBufferSourceand stops in-flight on flush.session.flush_pendingset bycancel_outbound, returned inpoll_outboundJSON, consumed byspeak_pumpwhich callsinject::flush_audio_bridgebefore the next feed.note_captiongate relaxed to only block during LLM-in-flight (not during TTS playback).last_caption_by_speakerHashMap with normalised dedup key (lowercase + drop non-alphanumeric + collapse whitespace) catches Meet's punctuation/case jitter between observer ticks.last_turn_done_at_mswall-clock backstop +MIN_TURN_GAP_MS=60srefuses wake within 60s of prior turn done regardless of caption content. Caption cooldown also 60s.turn on captions,turn on live captions,turn on subtitles,turn on closed captions,captions on,captions (c),show captions,enable captions) with substring match + negative-OFF guard, budget 30→60 attempts.Submission Checklist
## Related— will add when coverage matrix row is updated.Closes #NNN— no tracking issue opened yet; will add before un-drafting.Impact
poll_outboundJSON gained an optionalflush_pending: boolfield. Older shells ignore it; new shell consumes it.What's working today (achievements)
"hey openhuman" / "hi openhuman" / "openhuman"etc. (8 variants), speaker=You bot-echo filtered.delegate_to_integrations_agent→integrations_agent completed iterations=3 output_chars=329 success=true).RIGHT-NOW CONTEXTblock (no tool needed).sources_stopped=1when bot was mid-playback).Still to conquer (follow-ups)
sources_stopped=0shows we sometimes flush after playback has already drained naturally; instrument JS-side to record pending-schedule depth for tighter diagnostics.set_audio_mutedfor the bot — currently the bot CEF window is fully audible to the host machine; should be muted at the OS level so only the Meet wire hears the synthesised PCM.AI Authored PR Metadata
Linear Issue
Commit & Branch
feat/mascot-meet-flowA