You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make chat multimodal so users can attach images, use the camera, and upload files directly in conversations.
Problem
Chat is currently too text-centric for many real user workflows. Users need to send screenshots, photos, camera captures, PDFs, documents, and other files so the agent can inspect, summarize, answer questions, and take actions with that context.
Without multimodal chat, users have to describe visual/file context manually or leave OpenHuman to process documents elsewhere.
Solution (optional)
Add multimodal input support to chat:
Image upload from disk.
Camera capture where supported.
File upload for common document formats.
Attachment previews in the composer and message history.
Backend/core handling for file metadata, storage, memory ingestion, and model/tool routing.
Clear errors for unsupported file types, oversized files, or unavailable vision/file models.
Start with desktop chat, then make the implementation reusable for mobile/web if applicable.
Acceptance criteria
Image upload — Users can attach images to chat and the agent can reason over them when a vision-capable model/tool is available.
Camera capture — Users can capture an image from camera/webcam where supported and send it into chat.
File upload — Users can upload common files such as PDFs, text files, docs, CSVs, and images.
Attachment preview — Composer and message history show clear attachment previews, file names, sizes, and upload state.
Model/tool routing — Attachments are routed to the correct vision, document, memory, or file-processing path.
Memory integration — Uploaded files can be stored/ingested into memory when appropriate, with user-visible status.
Failure states — Unsupported formats, oversized files, upload errors, and missing model capability are shown clearly.
Privacy controls — Users understand whether files are local-only, sent to cloud models, or stored in memory.
Regression safety — Unit/E2E coverage verifies image upload, file upload, camera capture fallback, and error states.
Diff coverage ≥ 80% — the implementing PR meets the changed-lines coverage gate (Vitest + cargo-llvm-cov, enforced by .github/workflows/coverage.yml) when code changes are involved.
Related
Prior user request: support multimodal chat with images, camera, and files.
Summary
Make chat multimodal so users can attach images, use the camera, and upload files directly in conversations.
Problem
Chat is currently too text-centric for many real user workflows. Users need to send screenshots, photos, camera captures, PDFs, documents, and other files so the agent can inspect, summarize, answer questions, and take actions with that context.
Without multimodal chat, users have to describe visual/file context manually or leave OpenHuman to process documents elsewhere.
Solution (optional)
Add multimodal input support to chat:
Start with desktop chat, then make the implementation reusable for mobile/web if applicable.
Acceptance criteria
.github/workflows/coverage.yml) when code changes are involved.Related