feat(gateway): Google Chat attachment support (image / file / audio + STT)#762
feat(gateway): Google Chat attachment support (image / file / audio + STT)#762canyugs wants to merge 3 commits intoopenabdev:mainfrom
Conversation
Implements image / text file / audio download from Google Chat via Media API + service account token, following the PR openabdev#731 base64 pattern. Changes: - GoogleChatMessage: parse attachment[] array (Attachment / AttachmentDataRef / DriveDataRef structs) - GoogleChatMediaRef enum: Image / File / Audio variants for typed dispatch - parse_attachments(): branches on contentType prefix, skips DRIVE_FILE source - download_googlechat_image(): resize → 1200px JPEG q75, max 10MB, GIF preserved - download_googlechat_file(): text extension whitelist (.txt/.md/.py/...), max 512KB - download_googlechat_audio(): forwarded as-is for core STT pipeline, max 25MB - media_url(): percent-encode resource_name as path segment - webhook handler: parses attachments, async-downloads via adapter token, populates Content.attachments - Empty-text events with attachments are now forwarded (previously dropped) - Tests: 11 new (parse, download success/skip/oversized, URL encoding) Refs: openabdev#731 (Feishu pattern) Co-Authored-By: Claude Opus 4.6 <[email protected]>
Extends src/gateway.rs attachment handling to transcribe audio attachments via the existing STT pipeline (previously only Discord/Slack adapters went through download_and_transcribe; Custom Gateway adapters got no audio path even though stt::transcribe was available). When a gateway adapter (Feishu, Google Chat, etc.) sends an Attachment with attachment_type = "audio", core now: 1. Decodes base64 → audio bytes 2. Calls stt::transcribe with the configured SttConfig 3. Wraps the transcript as a ContentBlock::Text: "[Voice message transcript]: ..." The audio branch is gated on stt_config.enabled — if STT is disabled in config, audio attachments fall through unchanged (same as before). Threads stt_config through GatewayParams and run_gateway_adapter. This closes the audio attachment gap left by the (now-closed) PR openabdev#726 without re-introducing the HTTP MediaStore proxy approach. Pairs with the Google Chat adapter audio download (separate PR) — once both land, Google Chat voice/audio attachments work end-to-end. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Addresses #4 must-fix items: #1+#2 Webhook timeout safety: - Spawn background tokio task for attachment downloads so the webhook returns 200 within Google Chat's 30s deadline regardless of how long downloads take. - Add 30s per-request timeout to all Media API GET calls — a single hung connection can no longer stall the download task indefinitely. - Refactor event emission into send_googlechat_event helper to share between sync (no-attachment) and async (background download) paths. #4 Text file caps (matches Discord/Slack): - TEXT_FILE_COUNT_CAP = 5: skip text files past the 5th with a warning. - TEXT_TOTAL_CAP = 1 MB: skip text files that would push the running aggregate past 1 MB with a warning. - Image and audio attachments are not capped (same as Discord/Slack). openabdev#6 STT silent failure: - When stt::transcribe returns None, push a fallback ContentBlock::Text ("[Voice message — transcription failed for ...]") so the agent knows a voice message was attempted and can ask the user to re-send. Previously the failure was silent and confusing. Skipped from issue #4: #3 (streaming download), openabdev#5 (cross-adapter refactor — adapters stay independent per design), openabdev#7-openabdev#10 (cosmetic). Co-Authored-By: Claude Opus 4.6 <[email protected]>
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening report## IntentPR #762 adds inbound attachment handling to the Google Chat gateway so users can send images, text files, and audio messages to OpenAB agents from Google Chat. It also fills a core gateway gap: custom gateway adapters can now convert inbound audio attachments into STT transcripts before passing them to the agent. The operator-visible problem is that Google Chat users currently lose attachment context, and audio attachments from custom gateways do not reach the existing transcription path. This makes Google Chat less capable than Discord/Slack for multimodal or voice workflows. FeatThis is primarily a feature PR with a small core integration fix. Behavioral change:
Who It ServesPrimary beneficiaries:
Rewritten PromptImplement inbound attachment support for the Google Chat gateway adapter. Parse Google Chat Support:
In core gateway attachment conversion, add support for Add focused tests for attachment parsing, caps, unsupported sources/types, core audio conversion, STT disabled behavior, and webhook fast-return behavior where practical. Update Google Chat docs with supported types, limits, and unsupported Drive files. Merge PitchThis is worth advancing because it closes a major Google Chat parity gap and reuses the existing gateway attachment channel instead of introducing a new media-serving subsystem. The PR also improves the shared custom gateway path by making audio attachments useful beyond Google Chat. Risk profile is moderate. The high-risk areas are the large Best-Practice ComparisonRelevant OpenClaw principles:
Relevant Hermes Agent principles:
The main best-practice tension is that returning Implementation OptionsOption 1: Conservative merge with review hardening Option 2: Balanced gateway media job abstraction Option 3: Durable media processing queue Option 4: MediaStore/proxy architecture Comparison Table
RecommendationAdvance this PR toward merge using Option 1, with a reviewer checklist focused on bounded background execution, observability, test coverage, and memory limits. This is the right next step because the feature solves a real Google Chat usability gap without requiring a new media architecture. The PR already follows the existing base64 attachment pattern and keeps STT in core, which is the cleaner boundary. Recommended follow-up split:
|
Summary
Adds inbound attachment support to the Google Chat gateway adapter (image, text file, audio) using the PR #731 base64 pattern, plus completes the audio → STT path in OAB core for all Custom Gateway adapters (Feishu, Google Chat, …).
Changes
gateway/src/adapters/googlechat.rsGoogleChatMessage.attachment[]parsing +GoogleChatMediaRefenum (Image/File/Audio).download_googlechat_image/_file/_audiovia Media API + SA token.parse_attachmentsbranches onsource(skipsDRIVE_FILE) +contentType. Webhook handler spawns background task for downloads so it returns 200 within Google Chat's 30 s deadline. Per-request 30 s timeout. Text file caps: 5 count / 1 MB total (matches Discord/Slack).src/gateway.rs"audio"branch in attachment → ContentBlock conversion. Decodes base64 →stt::transcribe(gated onstt_config.enabled) → wraps transcript as[Voice message transcript]: …. Falls back to[Voice message — transcription failed for X]so STT failures aren't silent. Threadsstt_configthroughGatewayParams.src/main.rscfg.stt.clone()toGatewayParams.stt_config.docs/google-chat.mdKey Design Decisions
PR feat(gateway): feishu image and text file attachment support #731 base64 pattern — Google Chat Media API requires SA-token auth that the agent doesn't have; gateway downloads, compresses, and base64-encodes before sending over the existing WebSocket attachment channel. No new HTTP server needed (vs. the closed PR feat(gateway): support images and audio for LINE/Telegram #726 MediaStore proxy).
Webhook returns 200 immediately — Multi-attachment downloads can exceed Google Chat's 30 s webhook deadline. Webhook handler does sync parsing then
tokio::spawnfor downloads + event emit, so Google Chat won't retry.Per-request 30 s timeout — A hung Media API connection can no longer block the spawned task indefinitely.
Text file caps match Discord/Slack —
TEXT_FILE_COUNT_CAP = 5+TEXT_TOTAL_CAP = 1 MB. Text files concatenate into the agent prompt and need an aggregate cap; image and audio are independent and have only per-file size caps. (Feishu currently caps nothing — that's an existing gap in the Feishu adapter, not a model to copy.)STT in core, not gateway —
stt::transcribealready exists for Discord/Slack viadownload_and_transcribe. Custom Gateway audio just needed a"audio"branch insrc/gateway.rsto bridge the same pipeline. No new STT infrastructure; only reuse.STT failure fallback — Returns
[Voice message — transcription failed for X]as aContentBlock::Textinstead of silently dropping audio, so the agent can prompt the user to re-send.Drive-linked files skipped —
DRIVE_FILEsource needs separate Drive API integration; left for a follow-up.UPLOADED_CONTENT(the common path) works.Testing
E2E tested on k3s VPS with Google Chat workspace:
attachment_count=2.bin(non-whitelist extension)attachment_count=0)Content-Lengthearly reject.m4a) + STT via Whispercargo test— gateway 65 / core 238Not Yet Supported
DRIVE_FILEsource skipped; needs Drive API + token.GatewayReplyschema extension.Breaking Changes
None.
GatewayParamsadds anstt_configfield butcfg.sttalready hasDefault, and the"audio"arm is gated onstt_config.enabled(defaultfalse).Discord Discussion URL
https://discord.com/channels/1491295327620169908/1501153334042824764
🤖 Generated with Claude Code