Problem
Omi's LLM costs are concentrated on two OpenAI models: gpt-5.1 (57% of OpenAI spend) and gpt-4.1-mini (41%). Cross-provider alternatives — particularly Gemini 2.5 Pro and Flash-Lite — offer 37-75% savings per token at equivalent quality for most workloads. Two OpenRouter models are already dead (404).
Current spend distribution
- OpenAI: ~44% of total LLM spend —
gpt-5.1 (57% of OAI), gpt-4.1-mini (41% of OAI)
- Anthropic: ~56% of total LLM spend — Sonnet 4.6 (74%), Opus 4.6 (24%)
- OpenRouter: 3 features (2 dead models)
- Perplexity: 1 feature (web search)
Top 6 features by OpenAI spend (= 88% of OpenAI cost)
conversation_processing — 26.3%, gpt-5.1
other (misc) — 26.6%, mixed mini+5.1
conv_action_items — 13.0%, gpt-5.1
conv_structure — 10.2%, gpt-5.1
conv_apps — 8.7%, gpt-5.1
memories — 4.6%, gpt-4.1-mini
Proposal: Cross-Provider Migration (5 Phases)
Phase 1 — gpt-5.1 → Gemini 2.5 Pro (highest impact)
~57% of OpenAI spend. Savings: 37% on input, ~same output.
These are the most expensive features. gemini-2.5-pro ($1.25/$10 per 1M) vs gpt-5.1 (~$2/$10 per 1M).
| Feature |
Current |
Proposed |
Output Type |
Blocker |
| conversation_processing |
gpt-5.1 |
gemini-2.5-pro |
Pydantic JSON |
Structured outputs, prompt caching |
| conv_action_items |
gpt-5.1 |
gemini-2.5-pro |
Pydantic JSON (ActionItemsExtraction) |
Structured outputs, prompt caching |
| conv_structure |
gpt-5.1 |
gemini-2.5-pro |
Pydantic JSON (Structured) |
Structured outputs, prompt caching |
| conv_app_result |
gpt-5.1 |
gemini-2.5-pro |
Free text |
Prompt caching |
| daily_summary |
gpt-5.1 |
gemini-2.5-pro |
Free text ≤50w |
Prompt caching, highest cost/call |
| persona_clone |
gpt-5.1 |
gemini-2.5-pro |
Multi-stage text |
Prompt caching |
Prerequisite: Pydantic → Gemini responseSchema adapter (Phase 3 blocker shared). Prompt caching → evaluate Gemini context caching.
Risk: Medium-High. These features process every conversation. A/B eval required.
Phase 2 — Easy gpt-4.1-mini → Gemini 2.5 Flash-Lite (quick wins)
~5% of OpenAI spend but zero engineering risk. Savings: 75%.
Free-text features with no Pydantic dependency. gemini-2.5-flash-lite ($0.10/$0.40) vs gpt-4.1-mini ($0.40/$1.60).
| Feature |
Output Type |
Latency |
Calls/day |
| memory_category |
Text enum |
background |
part of memories (5.4K) |
| session_titles |
Free text |
near-realtime |
low |
| followup |
Free text |
realtime |
low |
| onboarding |
Boolean |
near-realtime |
low |
| daily_summary |
Free text ≤50w |
background |
~850 |
Prerequisite: Add Gemini direct API client (google-generativeai SDK). Simple routing change.
Risk: Low. Simplest tasks in the system.
Phase 3 — Structured gpt-4.1-mini → Gemini 2.5 Flash-Lite (bulk migration)
~36% of OpenAI spend. Savings: 75%.
18 features use OpenAI Pydantic structured outputs. Build a provider-agnostic adapter, then migrate.
| Feature |
Output Type |
Latency |
Notes |
| conv_discard |
Boolean (Pydantic) |
near-realtime |
Simplest structured |
| conv_folder |
FolderAssignment |
near-realtime |
|
| conv_app_select |
SuggestedAppsSelection |
near-realtime |
|
| chat_extraction |
Entities JSON |
near-realtime |
|
| memory_conflict |
Action + merged |
background |
|
| memories |
Facts[] |
background |
4.6% of OAI spend |
| knowledge_graph |
Nodes[] + edges[] |
background |
2.0% of OAI spend |
| goals |
Goal objects |
background |
|
| trends |
Items[] |
background |
1.2% of OAI spend |
| external_structure |
Structured JSON |
near-realtime |
|
| daily_summary_simple |
Stats dict |
background |
|
| app_integration |
Pydantic JSON |
background |
|
Also migrate smart_glasses to gemini-2.5-flash (not lite) — needs vision/multimodal input.
Prerequisite: Pydantic → Gemini responseSchema converter with validation layer. Fallback to OpenAI on malformed response.
Risk: Medium. Gemini's JSON mode is less strict than OpenAI's Pydantic enforcement. Need retry/fallback logic.
Phase 4 — User-Facing Streaming → Gemini 2.5 Pro (high risk, high reward)
~8% of OpenAI spend. Savings: 50% input, 33% output.
| Feature |
Current |
Proposed |
Notes |
| chat_responses |
gpt-5.2 |
gemini-2.5-pro |
User-facing streaming, quality-critical |
| goals_advice |
gpt-5.2 |
gemini-2.5-pro |
Near-realtime streaming |
| app_generator |
gpt-5.2 |
gemini-2.5-pro |
Pydantic JSON, code gen |
| notifications |
gpt-5.2 |
gemini-2.5-flash |
Short text, background |
Risk: High for chat_responses. Must A/B test with user satisfaction metrics.
Phase 5 — Dead OpenRouter Replacement + Cleanup
| Feature |
Dead Model |
Replacement |
| persona_chat |
google/gemini-flash-1.5-8b (404) |
gpt-4.1-nano ($0.10/$0.40) or gemini-2.5-flash-lite |
| persona_chat_premium |
anthropic/claude-3.5-sonnet (404) |
claude-sonnet-4-6 (direct Anthropic API) |
Risk: Low. These are already broken.
Not Migrated (keep on current provider)
| Feature |
Model |
Reason |
| chat_agent |
claude-sonnet-4-6 |
Anthropic-native tool_use with 24+ tools. No cross-provider equivalent |
| learnings |
o4-mini |
Dedicated reasoning model. No Gemini equivalent for chain-of-thought extraction |
| wrapped_analysis |
gemini-3-flash-preview |
Already on Gemini via OpenRouter |
| web_search |
sonar-pro |
Specialized search+citations API |
| chat_graph |
gpt-4.1 |
Tightly coupled to chat streaming |
Anthropic Optimization (separate track)
Anthropic is 56% of total LLM spend, mostly Sonnet 4.6 for RAG/chat_agent. Key opportunity: cache hit ratio optimization — cache writes dominate Anthropic costs. Improving cache hit ratio (currently write-heavy) could yield significant savings without model changes. Recommend separate issue for this.
Engineering Prerequisites
- Gemini direct API client — add
google-generativeai SDK to backend, bypass OpenRouter for new features
- Pydantic → responseSchema adapter — converts OpenAI Pydantic models to Gemini JSON Schema with validation/retry/fallback
- A/B evaluation framework — per-feature quality comparison, use existing model eval infrastructure
- Fallback routing — if Gemini returns malformed output or rate-limits, auto-retry on OpenAI
- Prompt caching migration — evaluate Gemini context caching ($4.50/1M/hr storage) vs OpenAI 24h caching for 8 cached features
- Cost monitoring — per-feature, per-provider cost tracking in BQ to measure actual savings
Estimated Impact
| Phase |
OAI Spend % Affected |
Per-Token Savings |
Effort |
Priority |
| 1: gpt-5.1 → Gemini Pro |
57% |
37% input |
Medium |
Highest |
| 2: Free-text → Flash-Lite |
~5% |
75% |
Low |
Quick win |
| 3: Structured → Flash-Lite |
~36% |
75% |
Medium |
High |
| 4: Streaming → Gemini Pro |
~8% |
33-50% |
Medium |
Medium |
| 5: Dead OpenRouter fix |
— |
— |
Low |
Urgent (broken) |
Projected OpenAI cost reduction: 40-60% if Phases 1-3 complete successfully.
Anthropic costs (56% of total) addressed separately via cache optimization.
Migration Order Recommendation
Phase 5 (fix dead models) → Phase 2 (easy wins) → Phase 3 (structured adapter) → Phase 1 (big spend) → Phase 4 (user-facing)
Start with Phase 5 (already broken) and Phase 2 (zero risk), build the structured output adapter (Phase 3), then tackle the expensive gpt-5.1 features (Phase 1) with confidence. User-facing chat (Phase 4) last — needs most validation.
Problem
Omi's LLM costs are concentrated on two OpenAI models:
gpt-5.1(57% of OpenAI spend) andgpt-4.1-mini(41%). Cross-provider alternatives — particularly Gemini 2.5 Pro and Flash-Lite — offer 37-75% savings per token at equivalent quality for most workloads. Two OpenRouter models are already dead (404).Current spend distribution
gpt-5.1(57% of OAI),gpt-4.1-mini(41% of OAI)Top 6 features by OpenAI spend (= 88% of OpenAI cost)
conversation_processing— 26.3%, gpt-5.1other(misc) — 26.6%, mixed mini+5.1conv_action_items— 13.0%, gpt-5.1conv_structure— 10.2%, gpt-5.1conv_apps— 8.7%, gpt-5.1memories— 4.6%, gpt-4.1-miniProposal: Cross-Provider Migration (5 Phases)
Phase 1 — gpt-5.1 → Gemini 2.5 Pro (highest impact)
~57% of OpenAI spend. Savings: 37% on input, ~same output.
These are the most expensive features.
gemini-2.5-pro($1.25/$10 per 1M) vsgpt-5.1(~$2/$10 per 1M).Prerequisite: Pydantic → Gemini
responseSchemaadapter (Phase 3 blocker shared). Prompt caching → evaluate Gemini context caching.Risk: Medium-High. These features process every conversation. A/B eval required.
Phase 2 — Easy gpt-4.1-mini → Gemini 2.5 Flash-Lite (quick wins)
~5% of OpenAI spend but zero engineering risk. Savings: 75%.
Free-text features with no Pydantic dependency.
gemini-2.5-flash-lite($0.10/$0.40) vsgpt-4.1-mini($0.40/$1.60).Prerequisite: Add Gemini direct API client (google-generativeai SDK). Simple routing change.
Risk: Low. Simplest tasks in the system.
Phase 3 — Structured gpt-4.1-mini → Gemini 2.5 Flash-Lite (bulk migration)
~36% of OpenAI spend. Savings: 75%.
18 features use OpenAI Pydantic structured outputs. Build a provider-agnostic adapter, then migrate.
Also migrate
smart_glassesto gemini-2.5-flash (not lite) — needs vision/multimodal input.Prerequisite: Pydantic → Gemini responseSchema converter with validation layer. Fallback to OpenAI on malformed response.
Risk: Medium. Gemini's JSON mode is less strict than OpenAI's Pydantic enforcement. Need retry/fallback logic.
Phase 4 — User-Facing Streaming → Gemini 2.5 Pro (high risk, high reward)
~8% of OpenAI spend. Savings: 50% input, 33% output.
Risk: High for chat_responses. Must A/B test with user satisfaction metrics.
Phase 5 — Dead OpenRouter Replacement + Cleanup
Risk: Low. These are already broken.
Not Migrated (keep on current provider)
Anthropic Optimization (separate track)
Anthropic is 56% of total LLM spend, mostly Sonnet 4.6 for RAG/chat_agent. Key opportunity: cache hit ratio optimization — cache writes dominate Anthropic costs. Improving cache hit ratio (currently write-heavy) could yield significant savings without model changes. Recommend separate issue for this.
Engineering Prerequisites
google-generativeaiSDK to backend, bypass OpenRouter for new featuresEstimated Impact
Projected OpenAI cost reduction: 40-60% if Phases 1-3 complete successfully.
Anthropic costs (56% of total) addressed separately via cache optimization.
Migration Order Recommendation
Start with Phase 5 (already broken) and Phase 2 (zero risk), build the structured output adapter (Phase 3), then tackle the expensive gpt-5.1 features (Phase 1) with confidence. User-facing chat (Phase 4) last — needs most validation.