Skip to content

LLM cost optimization: cross-provider migration to Gemini for 40-60% OpenAI savings #6873

@beastoin

Description

@beastoin

Problem

Omi's LLM costs are concentrated on two OpenAI models: gpt-5.1 (57% of OpenAI spend) and gpt-4.1-mini (41%). Cross-provider alternatives — particularly Gemini 2.5 Pro and Flash-Lite — offer 37-75% savings per token at equivalent quality for most workloads. Two OpenRouter models are already dead (404).

Current spend distribution

  • OpenAI: ~44% of total LLM spend — gpt-5.1 (57% of OAI), gpt-4.1-mini (41% of OAI)
  • Anthropic: ~56% of total LLM spend — Sonnet 4.6 (74%), Opus 4.6 (24%)
  • OpenRouter: 3 features (2 dead models)
  • Perplexity: 1 feature (web search)

Top 6 features by OpenAI spend (= 88% of OpenAI cost)

  1. conversation_processing — 26.3%, gpt-5.1
  2. other (misc) — 26.6%, mixed mini+5.1
  3. conv_action_items — 13.0%, gpt-5.1
  4. conv_structure — 10.2%, gpt-5.1
  5. conv_apps — 8.7%, gpt-5.1
  6. memories — 4.6%, gpt-4.1-mini

Proposal: Cross-Provider Migration (5 Phases)

Phase 1 — gpt-5.1 → Gemini 2.5 Pro (highest impact)

~57% of OpenAI spend. Savings: 37% on input, ~same output.

These are the most expensive features. gemini-2.5-pro ($1.25/$10 per 1M) vs gpt-5.1 (~$2/$10 per 1M).

Feature Current Proposed Output Type Blocker
conversation_processing gpt-5.1 gemini-2.5-pro Pydantic JSON Structured outputs, prompt caching
conv_action_items gpt-5.1 gemini-2.5-pro Pydantic JSON (ActionItemsExtraction) Structured outputs, prompt caching
conv_structure gpt-5.1 gemini-2.5-pro Pydantic JSON (Structured) Structured outputs, prompt caching
conv_app_result gpt-5.1 gemini-2.5-pro Free text Prompt caching
daily_summary gpt-5.1 gemini-2.5-pro Free text ≤50w Prompt caching, highest cost/call
persona_clone gpt-5.1 gemini-2.5-pro Multi-stage text Prompt caching

Prerequisite: Pydantic → Gemini responseSchema adapter (Phase 3 blocker shared). Prompt caching → evaluate Gemini context caching.

Risk: Medium-High. These features process every conversation. A/B eval required.

Phase 2 — Easy gpt-4.1-mini → Gemini 2.5 Flash-Lite (quick wins)

~5% of OpenAI spend but zero engineering risk. Savings: 75%.

Free-text features with no Pydantic dependency. gemini-2.5-flash-lite ($0.10/$0.40) vs gpt-4.1-mini ($0.40/$1.60).

Feature Output Type Latency Calls/day
memory_category Text enum background part of memories (5.4K)
session_titles Free text near-realtime low
followup Free text realtime low
onboarding Boolean near-realtime low
daily_summary Free text ≤50w background ~850

Prerequisite: Add Gemini direct API client (google-generativeai SDK). Simple routing change.

Risk: Low. Simplest tasks in the system.

Phase 3 — Structured gpt-4.1-mini → Gemini 2.5 Flash-Lite (bulk migration)

~36% of OpenAI spend. Savings: 75%.

18 features use OpenAI Pydantic structured outputs. Build a provider-agnostic adapter, then migrate.

Feature Output Type Latency Notes
conv_discard Boolean (Pydantic) near-realtime Simplest structured
conv_folder FolderAssignment near-realtime
conv_app_select SuggestedAppsSelection near-realtime
chat_extraction Entities JSON near-realtime
memory_conflict Action + merged background
memories Facts[] background 4.6% of OAI spend
knowledge_graph Nodes[] + edges[] background 2.0% of OAI spend
goals Goal objects background
trends Items[] background 1.2% of OAI spend
external_structure Structured JSON near-realtime
daily_summary_simple Stats dict background
app_integration Pydantic JSON background

Also migrate smart_glasses to gemini-2.5-flash (not lite) — needs vision/multimodal input.

Prerequisite: Pydantic → Gemini responseSchema converter with validation layer. Fallback to OpenAI on malformed response.

Risk: Medium. Gemini's JSON mode is less strict than OpenAI's Pydantic enforcement. Need retry/fallback logic.

Phase 4 — User-Facing Streaming → Gemini 2.5 Pro (high risk, high reward)

~8% of OpenAI spend. Savings: 50% input, 33% output.

Feature Current Proposed Notes
chat_responses gpt-5.2 gemini-2.5-pro User-facing streaming, quality-critical
goals_advice gpt-5.2 gemini-2.5-pro Near-realtime streaming
app_generator gpt-5.2 gemini-2.5-pro Pydantic JSON, code gen
notifications gpt-5.2 gemini-2.5-flash Short text, background

Risk: High for chat_responses. Must A/B test with user satisfaction metrics.

Phase 5 — Dead OpenRouter Replacement + Cleanup

Feature Dead Model Replacement
persona_chat google/gemini-flash-1.5-8b (404) gpt-4.1-nano ($0.10/$0.40) or gemini-2.5-flash-lite
persona_chat_premium anthropic/claude-3.5-sonnet (404) claude-sonnet-4-6 (direct Anthropic API)

Risk: Low. These are already broken.

Not Migrated (keep on current provider)

Feature Model Reason
chat_agent claude-sonnet-4-6 Anthropic-native tool_use with 24+ tools. No cross-provider equivalent
learnings o4-mini Dedicated reasoning model. No Gemini equivalent for chain-of-thought extraction
wrapped_analysis gemini-3-flash-preview Already on Gemini via OpenRouter
web_search sonar-pro Specialized search+citations API
chat_graph gpt-4.1 Tightly coupled to chat streaming

Anthropic Optimization (separate track)

Anthropic is 56% of total LLM spend, mostly Sonnet 4.6 for RAG/chat_agent. Key opportunity: cache hit ratio optimization — cache writes dominate Anthropic costs. Improving cache hit ratio (currently write-heavy) could yield significant savings without model changes. Recommend separate issue for this.

Engineering Prerequisites

  1. Gemini direct API client — add google-generativeai SDK to backend, bypass OpenRouter for new features
  2. Pydantic → responseSchema adapter — converts OpenAI Pydantic models to Gemini JSON Schema with validation/retry/fallback
  3. A/B evaluation framework — per-feature quality comparison, use existing model eval infrastructure
  4. Fallback routing — if Gemini returns malformed output or rate-limits, auto-retry on OpenAI
  5. Prompt caching migration — evaluate Gemini context caching ($4.50/1M/hr storage) vs OpenAI 24h caching for 8 cached features
  6. Cost monitoring — per-feature, per-provider cost tracking in BQ to measure actual savings

Estimated Impact

Phase OAI Spend % Affected Per-Token Savings Effort Priority
1: gpt-5.1 → Gemini Pro 57% 37% input Medium Highest
2: Free-text → Flash-Lite ~5% 75% Low Quick win
3: Structured → Flash-Lite ~36% 75% Medium High
4: Streaming → Gemini Pro ~8% 33-50% Medium Medium
5: Dead OpenRouter fix Low Urgent (broken)

Projected OpenAI cost reduction: 40-60% if Phases 1-3 complete successfully.
Anthropic costs (56% of total) addressed separately via cache optimization.

Migration Order Recommendation

Phase 5 (fix dead models) → Phase 2 (easy wins) → Phase 3 (structured adapter) → Phase 1 (big spend) → Phase 4 (user-facing)

Start with Phase 5 (already broken) and Phase 2 (zero risk), build the structured output adapter (Phase 3), then tackle the expensive gpt-5.1 features (Phase 1) with confidence. User-facing chat (Phase 4) last — needs most validation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestp3Priority: Backlog (score <14)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions