Add multimodal memory to the upgraded memory-lancedb-pro stack without overloading the core text-memory path.
Target outcome:
- keep the existing text memory plugin stable
- add multimodal ingestion as an extension/plugin layer
- use
gemini-embedding-2-previewas the default multimodal embedding provider - keep LanceDB as the storage/index layer for cross-modal retrieval
Use a two-plugin design instead of forcing all new logic into the current plugin.
Keep this as the core memory runtime:
- memory lifecycle hooks
- smart extraction for text conversations
- recall injection
- scoring, rerank, decay, scope isolation
- CLI and maintenance commands
Create a new plugin responsible for multimodal assets:
- image / video / audio / PDF ingestion
- file normalization and MIME detection
- media chunking
- embedding generation via Gemini Embedding 2
- writing media records into LanceDB
- optional file/blob sidecar storage
This keeps the core memory plugin focused and lets you evolve multimodal support independently.
Use one LanceDB table per embedding space.
memories_textmemories_multimodal
Pros:
- easiest migration path
- avoids mixing old text-only rows with new multimodal rows
- lets retrieval logic be tuned separately
Store both text and media rows together with a content_type field:
textimagevideoaudiopdf
Extra fields:
source_urimime_typemodalitycaptionocr_texttranscriptsegment_indexsegment_start_mssegment_end_mspreview_textmetadata
Recommendation: start with separate tables, then unify only if cross-modal ranking needs a single pipeline.
Add a dedicated tool family for multimodal memory.
memory_media_storememory_media_searchmemory_media_listmemory_media_delete
Suggested input shapes:
- image: file path or URL
- video: file path or URL plus chunking options
- audio: file path or URL plus transcript options
- pdf: file path or URL plus page chunking options
Keep memory_store text-only. Do not overload it with media unions unless you are ready to refactor the current prompt/tooling contract.
- Resolve file or download URL
- Validate MIME and size
- Generate a preview/caption if needed
- Embed raw image or image+caption with Gemini Embedding 2
- Store vector + metadata + source pointer
- Resolve file
- Sample frames and segment timeline
- Optionally extract audio transcript
- Embed representative frames or segments
- Store segment-level rows
- Resolve file
- Transcribe if useful
- Embed raw audio segments or transcript+audio representations
- Store segment-level rows
- Resolve file
- Split by page or page ranges
- Extract OCR/text when available
- Embed page-level content
- Store page-level rows
Use a broker pattern:
- text query enters
memory-lancedb-pro - broker queries text memory and multimodal memory in parallel
- results are normalized into a shared score shape
- optional reranker merges final ranking
Suggested retrieval stages:
- text memory search
- multimodal vector search
- metadata filter by modality/scope/project
- score normalization
- rerank
- inject top results as concise summaries, not raw blobs
Add a separate config block under the new plugin:
{
"plugins": {
"entries": {
"memory-multimodal-ingest": {
"enabled": true,
"config": {
"embedding": {
"provider": "openai-compatible",
"apiKey": "${GEMINI_API_KEY}",
"model": "gemini-embedding-2-preview",
"baseURL": "https://generativelanguage.googleapis.com/v1beta/openai/",
"dimensions": 3072
},
"dbPath": "/Users/yyc/.openclaw/memory/lancedb-multimodal",
"blobPath": "/Users/yyc/.openclaw/memory/blobs",
"modalities": {
"image": true,
"video": true,
"audio": true,
"pdf": true
}
}
}
}
}
}Phase 1:
- upgrade core
memory-lancedb-pro - keep text memory behavior unchanged
Phase 2:
- add
memory-multimodal-ingest - ingest media into a separate LanceDB path
Phase 3:
- add retrieval broker
- merge text + media recall into one injection layer
Phase 4:
- add maintenance commands
- re-embed / rebuild / compact / export for media rows
- add a
MultimodalEmbedderabstraction instead of reusing the text-only helper blindly - add MIME-aware file loaders
- add media chunkers for video/audio/PDF
- add LanceDB schema for modality metadata
- add a retrieval broker in the core memory plugin
- add config validation and UI hints
- add test fixtures for image, audio, video, and PDF ingestion
Do not refactor the current plugin into an all-in-one multimodal plugin on the first pass.
Ship in this order:
- upgrade the core plugin
- add a separate multimodal ingest plugin
- bridge retrieval after ingestion is stable
That gives you a reversible migration path and keeps text memory reliable while you build the multimodal layer.