Skip to content

[codex] add multimodal support across providers and unify media retries#37

Merged
clifton merged 5 commits intomainfrom
universal-multi-modal
Feb 13, 2026
Merged

[codex] add multimodal support across providers and unify media retries#37
clifton merged 5 commits intomainfrom
universal-multi-modal

Conversation

@clifton
Copy link
Owner

@clifton clifton commented Feb 13, 2026

Summary

This PR adds first-class multimodal structured extraction support for all major providers in rstructor (OpenAI, Anthropic, Grok), and fixes Gemini media-path retry parity.

Problem

Before this change, materialize_with_media only worked in the Gemini backend. For OpenAI, Anthropic, and Grok, media inputs were accepted by the public API but effectively ignored because provider request serialization only emitted text content.

Additionally, Gemini's media path bypassed the existing retry-with-conversation-history flow used by text materialization, which reduced recovery reliability after validation failures.

User Impact

  • Users can now call materialize_with_media(...) consistently across OpenAI, Anthropic, Grok, and Gemini.
  • Structured multimodal extraction now works with inline bytes and URI/URL media forms (provider-specific mappings).
  • Retry behavior is now consistent for Gemini media calls as well, improving robustness for schema-constrained outputs.

Root Cause

  • Provider request models for OpenAI/Grok/Anthropic were implemented as content: String only.
  • Shared retry utility only accepted prompt text and always built initial messages as ChatMessage::user(prompt).

Fix

  1. Added a new shared retry utility entry point that accepts initial message history:
    • generate_with_retry_with_initial_messages(...)
    • Existing generate_with_retry_with_history(...) now delegates to it.
  2. Implemented provider-native multimodal content serialization:
    • OpenAI/Grok: mixed content parts with type: text and type: image_url.
    • Anthropic: content blocks with type: text and type: image + source (base64 or url).
  3. Updated each provider backend to override materialize_with_media(...) and invoke retry with initial media-bearing messages.
  4. Updated Gemini materialize_with_media(...) to use retry/history path (parity fix).
  5. Added shared media_to_url(...) helper for normalized data-URL/URI handling in compatible providers.
  6. Updated model enums to include latest discovered IDs used in current docs/API lists:
    • OpenAI: gpt-5.2-chat-latest, gpt-5.2-codex
    • Gemini: gemini-2.5-flash-image, gemini-2.0-flash-lite-001
  7. Expanded docs/examples for multimodal usage across providers.

Validation

  • cargo fmt --all
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --all-features --no-run
  • cargo test --all-features --test openai_multimodal_tests -- --nocapture
  • cargo test --all-features --test anthropic_multimodal_tests -- --nocapture
  • cargo test --all-features --test grok_multimodal_tests -- --nocapture
  • cargo test --all-features --test gemini_multimodal_tests -- --nocapture
  • cargo test --all-features --test model_string_test
  • cargo test --all-features test_generate_with_retry_with_initial_messages -- --nocapture

All checks passed locally.

@clifton clifton marked this pull request as ready for review February 13, 2026 15:11
@clifton clifton merged commit 4ea98ce into main Feb 13, 2026
8 checks passed
@clifton clifton deleted the universal-multi-modal branch February 13, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant