Skip to content

feat(gateway): Google Chat attachment support (image / file / audio + STT)#762

Open
canyugs wants to merge 3 commits intoopenabdev:mainfrom
canyugs:feat/gateway-googlechat-attachments
Open

feat(gateway): Google Chat attachment support (image / file / audio + STT)#762
canyugs wants to merge 3 commits intoopenabdev:mainfrom
canyugs:feat/gateway-googlechat-attachments

Conversation

@canyugs
Copy link
Copy Markdown
Contributor

@canyugs canyugs commented May 6, 2026

Summary

Adds inbound attachment support to the Google Chat gateway adapter (image, text file, audio) using the PR #731 base64 pattern, plus completes the audio → STT path in OAB core for all Custom Gateway adapters (Feishu, Google Chat, …).

User attaches file in Google Chat
  │
  ▼
Webhook → Gateway parses message.attachment[]
  │  returns 200 immediately (background task)
  ▼
Gateway tokio::spawn:
  - SA-token GET /v1/media/{resourceName}?alt=media (30s timeout)
  - image: resize 1200px JPEG75 + base64
  - text: 5×512KB whitelist + base64
  - audio: forward as-is + base64
  ▼
GatewayEvent.content.attachments → core
  │
  ▼  src/gateway.rs decodes per attachment_type:
  - "image"     → ContentBlock::Image  (already in main)
  - "text_file" → ContentBlock::Text   (already in main)
  - "audio"     → stt::transcribe → ContentBlock::Text  (NEW)
  │
  ▼
Agent sees image / text / voice-transcript and replies

Changes

File Change Description
gateway/src/adapters/googlechat.rs MOD GoogleChatMessage.attachment[] parsing + GoogleChatMediaRef enum (Image/File/Audio). download_googlechat_image / _file / _audio via Media API + SA token. parse_attachments branches on source (skips DRIVE_FILE) + contentType. Webhook handler spawns background task for downloads so it returns 200 within Google Chat's 30 s deadline. Per-request 30 s timeout. Text file caps: 5 count / 1 MB total (matches Discord/Slack).
src/gateway.rs MOD Add "audio" branch in attachment → ContentBlock conversion. Decodes base64 → stt::transcribe (gated on stt_config.enabled) → wraps transcript as [Voice message transcript]: …. Falls back to [Voice message — transcription failed for X] so STT failures aren't silent. Threads stt_config through GatewayParams.
src/main.rs MOD Pass cfg.stt.clone() to GatewayParams.stt_config.
docs/google-chat.md MOD Document inbound attachment behavior, size limits, and Drive limitations.

Key Design Decisions

  1. PR feat(gateway): feishu image and text file attachment support #731 base64 pattern — Google Chat Media API requires SA-token auth that the agent doesn't have; gateway downloads, compresses, and base64-encodes before sending over the existing WebSocket attachment channel. No new HTTP server needed (vs. the closed PR feat(gateway): support images and audio for LINE/Telegram #726 MediaStore proxy).

  2. Webhook returns 200 immediately — Multi-attachment downloads can exceed Google Chat's 30 s webhook deadline. Webhook handler does sync parsing then tokio::spawn for downloads + event emit, so Google Chat won't retry.

  3. Per-request 30 s timeout — A hung Media API connection can no longer block the spawned task indefinitely.

  4. Text file caps match Discord/SlackTEXT_FILE_COUNT_CAP = 5 + TEXT_TOTAL_CAP = 1 MB. Text files concatenate into the agent prompt and need an aggregate cap; image and audio are independent and have only per-file size caps. (Feishu currently caps nothing — that's an existing gap in the Feishu adapter, not a model to copy.)

  5. STT in core, not gatewaystt::transcribe already exists for Discord/Slack via download_and_transcribe. Custom Gateway audio just needed a "audio" branch in src/gateway.rs to bridge the same pipeline. No new STT infrastructure; only reuse.

  6. STT failure fallback — Returns [Voice message — transcription failed for X] as a ContentBlock::Text instead of silently dropping audio, so the agent can prompt the user to re-send.

  7. Drive-linked files skippedDRIVE_FILE source needs separate Drive API integration; left for a follow-up. UPLOADED_CONTENT (the common path) works.

Testing

E2E tested on k3s VPS with Google Chat workspace:

Scenario Result
Single image (red/blue/green PNG) PASS
Image + text → agent identifies color PASS — "Blue."
Markdown text file (.md) → agent summarizes PASS — correct one-sentence summary
Empty text + image only PASS — event still forwarded
Multiple images (2 PNG) → "Blue, green" PASS — attachment_count=2
.bin (non-whitelist extension) PASS — skipped (attachment_count=0)
48 MB PNG > 10 MB limit PASS — Content-Length early reject
Audio (.m4a) + STT via Whisper PASS — full transcript routed to agent
Smoke test after webhook bg refactor PASS — single image still replies "Blue."
cargo test — gateway 65 / core 238 PASS

Not Yet Supported

  • Drive-linked attachmentsDRIVE_FILE source skipped; needs Drive API + token.
  • Outbound media (bot sends image / file to user) — needs GatewayReply schema extension.
  • Streaming download — 25 MB audio is buffered in full; high-concurrency OOM is a known follow-up.

Breaking Changes

None. GatewayParams adds an stt_config field but cfg.stt already has Default, and the "audio" arm is gated on stt_config.enabled (default false).

Discord Discussion URL

https://discord.com/channels/1491295327620169908/1501153334042824764

🤖 Generated with Claude Code

canyugs and others added 3 commits May 6, 2026 22:28
Implements image / text file / audio download from Google Chat via
Media API + service account token, following the PR openabdev#731 base64 pattern.

Changes:
- GoogleChatMessage: parse attachment[] array (Attachment / AttachmentDataRef / DriveDataRef structs)
- GoogleChatMediaRef enum: Image / File / Audio variants for typed dispatch
- parse_attachments(): branches on contentType prefix, skips DRIVE_FILE source
- download_googlechat_image(): resize → 1200px JPEG q75, max 10MB, GIF preserved
- download_googlechat_file(): text extension whitelist (.txt/.md/.py/...), max 512KB
- download_googlechat_audio(): forwarded as-is for core STT pipeline, max 25MB
- media_url(): percent-encode resource_name as path segment
- webhook handler: parses attachments, async-downloads via adapter token, populates Content.attachments
- Empty-text events with attachments are now forwarded (previously dropped)
- Tests: 11 new (parse, download success/skip/oversized, URL encoding)

Refs: openabdev#731 (Feishu pattern)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Extends src/gateway.rs attachment handling to transcribe audio attachments
via the existing STT pipeline (previously only Discord/Slack adapters
went through download_and_transcribe; Custom Gateway adapters got no
audio path even though stt::transcribe was available).

When a gateway adapter (Feishu, Google Chat, etc.) sends an Attachment
with attachment_type = "audio", core now:
1. Decodes base64 → audio bytes
2. Calls stt::transcribe with the configured SttConfig
3. Wraps the transcript as a ContentBlock::Text:
   "[Voice message transcript]: ..."

The audio branch is gated on stt_config.enabled — if STT is disabled in
config, audio attachments fall through unchanged (same as before).

Threads stt_config through GatewayParams and run_gateway_adapter.

This closes the audio attachment gap left by the (now-closed) PR openabdev#726
without re-introducing the HTTP MediaStore proxy approach. Pairs with
the Google Chat adapter audio download (separate PR) — once both land,
Google Chat voice/audio attachments work end-to-end.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Addresses #4 must-fix items:

#1+#2 Webhook timeout safety:
- Spawn background tokio task for attachment downloads so the webhook
  returns 200 within Google Chat's 30s deadline regardless of how long
  downloads take.
- Add 30s per-request timeout to all Media API GET calls — a single
  hung connection can no longer stall the download task indefinitely.
- Refactor event emission into send_googlechat_event helper to share
  between sync (no-attachment) and async (background download) paths.

#4 Text file caps (matches Discord/Slack):
- TEXT_FILE_COUNT_CAP = 5: skip text files past the 5th with a warning.
- TEXT_TOTAL_CAP = 1 MB: skip text files that would push the running
  aggregate past 1 MB with a warning.
- Image and audio attachments are not capped (same as Discord/Slack).

openabdev#6 STT silent failure:
- When stt::transcribe returns None, push a fallback ContentBlock::Text
  ("[Voice message — transcription failed for ...]") so the agent
  knows a voice message was attempted and can ask the user to re-send.
  Previously the failure was silent and confusing.

Skipped from issue #4: #3 (streaming download), openabdev#5 (cross-adapter
refactor — adapters stay independent per design), openabdev#7-openabdev#10 (cosmetic).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@canyugs canyugs requested a review from thepagent as a code owner May 6, 2026 17:58
Copilot AI review requested due to automatic review settings May 6, 2026 17:58
@github-actions github-actions Bot added the pending-screening PR awaiting automated screening label May 6, 2026
@shaun-agent
Copy link
Copy Markdown
Contributor

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report ## Intent

PR #762 adds inbound attachment handling to the Google Chat gateway so users can send images, text files, and audio messages to OpenAB agents from Google Chat. It also fills a core gateway gap: custom gateway adapters can now convert inbound audio attachments into STT transcripts before passing them to the agent.

The operator-visible problem is that Google Chat users currently lose attachment context, and audio attachments from custom gateways do not reach the existing transcription path. This makes Google Chat less capable than Discord/Slack for multimodal or voice workflows.

Feat

This is primarily a feature PR with a small core integration fix.

Behavioral change:

  • Google Chat webhook events parse message.attachment[].
  • Uploaded Google Chat media is downloaded by the gateway with service-account auth.
  • Images are resized/compressed and forwarded as base64.
  • Whitelisted text files are capped, base64 encoded, and forwarded.
  • Audio files are forwarded as base64 and transcribed in core when STT is enabled.
  • Google Chat webhook handling returns 200 quickly, with media download and event emission moved into a background task.
  • Drive-linked files are explicitly skipped.
  • Docs describe supported attachment behavior and limits.

Who It Serves

Primary beneficiaries:

  • Google Chat end users who want to send images, text files, and voice messages to agents.
  • Deployers running OpenAB in Google Chat workspaces.
  • Agent runtime operators who need consistent custom gateway attachment behavior.
  • Maintainers trying to reduce adapter-specific gaps between Google Chat, Discord, Slack, and Feishu.

Rewritten Prompt

Implement inbound attachment support for the Google Chat gateway adapter.

Parse Google Chat message.attachment[] for uploaded media, skip unsupported Drive-linked attachments, and download supported media using service-account authenticated Google Chat Media API requests. Return the webhook response promptly and perform media download plus gateway event emission asynchronously so Google Chat does not retry due to webhook timeout.

Support:

  • Images: download, enforce size limits, resize to max 1200px, JPEG encode, base64 attach as image.
  • Text files: allow only safe text extensions/content types, enforce count and aggregate size caps, base64 attach as text_file.
  • Audio: download within limits, base64 attach as audio.

In core gateway attachment conversion, add support for audio attachments by decoding base64 and invoking the existing STT transcription path when STT is enabled. Convert successful transcripts to text content blocks and emit an explicit fallback text block when transcription fails.

Add focused tests for attachment parsing, caps, unsupported sources/types, core audio conversion, STT disabled behavior, and webhook fast-return behavior where practical. Update Google Chat docs with supported types, limits, and unsupported Drive files.

Merge Pitch

This is worth advancing because it closes a major Google Chat parity gap and reuses the existing gateway attachment channel instead of introducing a new media-serving subsystem. The PR also improves the shared custom gateway path by making audio attachments useful beyond Google Chat.

Risk profile is moderate. The high-risk areas are the large googlechat.rs change, background task lifecycle/error visibility, memory use from buffering media, and correctness of Google Chat media auth/download behavior. Reviewers will likely focus on whether the gateway-owned background work is observable, bounded, and testable enough before merge.

Best-Practice Comparison

Relevant OpenClaw principles:

  • Gateway-owned scheduling: Relevant. The gateway owns the delayed media download and event emission after webhook acknowledgement.
  • Durable job persistence: Partially relevant. This PR uses tokio::spawn, so work can be lost if the process exits after returning 200.
  • Isolated executions: Somewhat relevant. Media handling is separated into background work, but not isolated as durable jobs.
  • Explicit delivery routing: Relevant. Attachments are converted into typed gateway event content before core handling.
  • Retry/backoff and run logs: Relevant gap. The PR mentions timeouts and fallback behavior, but not durable retries or structured run logs for failed background media processing.

Relevant Hermes Agent principles:

  • Gateway daemon tick model: Mostly not relevant unless OpenAB wants scheduled or recoverable background gateway work.
  • File locking to prevent overlap: Not directly relevant for per-webhook media downloads.
  • Atomic writes for persisted state: Not relevant unless the background task is made durable.
  • Fresh session per scheduled run: Not relevant.
  • Self-contained prompts for scheduled tasks: Not relevant.

The main best-practice tension is that returning 200 before media processing is correct for Google Chat webhook behavior, but without durable persistence it trades retry avoidance for possible silent event loss during crashes or deploys.

Implementation Options

Option 1: Conservative merge with review hardening
Keep the current base64 gateway approach and background task model. Require targeted tests, stronger logging, clearer error paths, and validation of limits/auth behavior before merge. Defer durable media jobs and streaming downloads.

Option 2: Balanced gateway media job abstraction
Keep base64 attachments, but introduce a small internal media-processing job abstraction inside the gateway. Add structured run logs, bounded concurrency, explicit timeout/error reporting, and a clean test surface. Still avoid persistent storage for now.

Option 3: Durable media processing queue
After webhook parse, persist a media-processing job before returning 200. A gateway worker processes jobs with retry/backoff, logs outcomes, and emits events only after successful attachment processing. This aligns more closely with OpenClaw/Hermes durability patterns.

Option 4: MediaStore/proxy architecture
Revisit a media store or proxy design where the gateway stores fetched media and sends references instead of base64 payloads. Core or agents retrieve media through controlled URLs or internal APIs. This is broader and likely requires schema/API changes.

Comparison Table

Option Speed to ship Complexity Reliability Maintainability User impact Fit for OpenAB right now
1. Conservative merge with hardening High Low-Medium Medium Medium High Strong
2. Gateway media job abstraction Medium Medium Medium-High High High Strong
3. Durable media processing queue Low High High Medium-High High Medium
4. MediaStore/proxy architecture Low High High Medium High Weak-Medium

Recommendation

Advance this PR toward merge using Option 1, with a reviewer checklist focused on bounded background execution, observability, test coverage, and memory limits.

This is the right next step because the feature solves a real Google Chat usability gap without requiring a new media architecture. The PR already follows the existing base64 attachment pattern and keeps STT in core, which is the cleaner boundary.

Recommended follow-up split:

  • Merge Google Chat uploaded attachment support and core audio-to-STT bridging first.
  • Open separate issues for durable gateway media jobs, streaming downloads, Drive-linked files, and outbound media replies.
  • Consider Option 2 after merge if more adapters start needing the same background media-processing behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pending-screening PR awaiting automated screening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants