LLM cost optimization: cross-provider migration to Gemini for 40-60% OpenAI savings

## Problem

Omi's LLM costs are concentrated on two OpenAI models: `gpt-5.1` (57% of OpenAI spend) and `gpt-4.1-mini` (41%). Cross-provider alternatives — particularly Gemini 2.5 Pro and Flash-Lite — offer 37-75% savings per token at equivalent quality for most workloads. Two OpenRouter models are already dead (404).

### Current spend distribution
- **OpenAI**: ~44% of total LLM spend — `gpt-5.1` (57% of OAI), `gpt-4.1-mini` (41% of OAI)
- **Anthropic**: ~56% of total LLM spend — Sonnet 4.6 (74%), Opus 4.6 (24%)
- **OpenRouter**: 3 features (2 dead models)
- **Perplexity**: 1 feature (web search)

### Top 6 features by OpenAI spend (= 88% of OpenAI cost)
1. `conversation_processing` — 26.3%, gpt-5.1
2. `other` (misc) — 26.6%, mixed mini+5.1
3. `conv_action_items` — 13.0%, gpt-5.1
4. `conv_structure` — 10.2%, gpt-5.1
5. `conv_apps` — 8.7%, gpt-5.1
6. `memories` — 4.6%, gpt-4.1-mini

## Proposal: Cross-Provider Migration (5 Phases)

### Phase 1 — gpt-5.1 → Gemini 2.5 Pro (highest impact)

**~57% of OpenAI spend. Savings: 37% on input, ~same output.**

These are the most expensive features. `gemini-2.5-pro` ($1.25/$10 per 1M) vs `gpt-5.1` (~$2/$10 per 1M).

| Feature | Current | Proposed | Output Type | Blocker |
|---------|---------|----------|-------------|---------|
| conversation_processing | gpt-5.1 | **gemini-2.5-pro** | Pydantic JSON | Structured outputs, prompt caching |
| conv_action_items | gpt-5.1 | **gemini-2.5-pro** | Pydantic JSON (ActionItemsExtraction) | Structured outputs, prompt caching |
| conv_structure | gpt-5.1 | **gemini-2.5-pro** | Pydantic JSON (Structured) | Structured outputs, prompt caching |
| conv_app_result | gpt-5.1 | **gemini-2.5-pro** | Free text | Prompt caching |
| daily_summary | gpt-5.1 | **gemini-2.5-pro** | Free text ≤50w | Prompt caching, highest cost/call |
| persona_clone | gpt-5.1 | **gemini-2.5-pro** | Multi-stage text | Prompt caching |

**Prerequisite**: Pydantic → Gemini `responseSchema` adapter (Phase 3 blocker shared). Prompt caching → evaluate Gemini context caching.

**Risk**: Medium-High. These features process every conversation. A/B eval required.

### Phase 2 — Easy gpt-4.1-mini → Gemini 2.5 Flash-Lite (quick wins)

**~5% of OpenAI spend but zero engineering risk. Savings: 75%.**

Free-text features with no Pydantic dependency. `gemini-2.5-flash-lite` ($0.10/$0.40) vs `gpt-4.1-mini` ($0.40/$1.60).

| Feature | Output Type | Latency | Calls/day |
|---------|-------------|---------|-----------|
| memory_category | Text enum | background | part of memories (5.4K) |
| session_titles | Free text | near-realtime | low |
| followup | Free text | realtime | low |
| onboarding | Boolean | near-realtime | low |
| daily_summary | Free text ≤50w | background | ~850 |

**Prerequisite**: Add Gemini direct API client (google-generativeai SDK). Simple routing change.

**Risk**: Low. Simplest tasks in the system.

### Phase 3 — Structured gpt-4.1-mini → Gemini 2.5 Flash-Lite (bulk migration)

**~36% of OpenAI spend. Savings: 75%.**

18 features use OpenAI Pydantic structured outputs. Build a provider-agnostic adapter, then migrate.

| Feature | Output Type | Latency | Notes |
|---------|-------------|---------|-------|
| conv_discard | Boolean (Pydantic) | near-realtime | Simplest structured |
| conv_folder | FolderAssignment | near-realtime | |
| conv_app_select | SuggestedAppsSelection | near-realtime | |
| chat_extraction | Entities JSON | near-realtime | |
| memory_conflict | Action + merged | background | |
| memories | Facts[] | background | 4.6% of OAI spend |
| knowledge_graph | Nodes[] + edges[] | background | 2.0% of OAI spend |
| goals | Goal objects | background | |
| trends | Items[] | background | 1.2% of OAI spend |
| external_structure | Structured JSON | near-realtime | |
| daily_summary_simple | Stats dict | background | |
| app_integration | Pydantic JSON | background | |

Also migrate `smart_glasses` to **gemini-2.5-flash** (not lite) — needs vision/multimodal input.

**Prerequisite**: Pydantic → Gemini responseSchema converter with validation layer. Fallback to OpenAI on malformed response.

**Risk**: Medium. Gemini's JSON mode is less strict than OpenAI's Pydantic enforcement. Need retry/fallback logic.

### Phase 4 — User-Facing Streaming → Gemini 2.5 Pro (high risk, high reward)

**~8% of OpenAI spend. Savings: 50% input, 33% output.**

| Feature | Current | Proposed | Notes |
|---------|---------|----------|-------|
| chat_responses | gpt-5.2 | **gemini-2.5-pro** | User-facing streaming, quality-critical |
| goals_advice | gpt-5.2 | **gemini-2.5-pro** | Near-realtime streaming |
| app_generator | gpt-5.2 | **gemini-2.5-pro** | Pydantic JSON, code gen |
| notifications | gpt-5.2 | **gemini-2.5-flash** | Short text, background |

**Risk**: High for chat_responses. Must A/B test with user satisfaction metrics.

### Phase 5 — Dead OpenRouter Replacement + Cleanup

| Feature | Dead Model | Replacement |
|---------|-----------|-------------|
| persona_chat | google/gemini-flash-1.5-8b (404) | **gpt-4.1-nano** ($0.10/$0.40) or **gemini-2.5-flash-lite** |
| persona_chat_premium | anthropic/claude-3.5-sonnet (404) | **claude-sonnet-4-6** (direct Anthropic API) |

**Risk**: Low. These are already broken.

## Not Migrated (keep on current provider)

| Feature | Model | Reason |
|---------|-------|--------|
| chat_agent | claude-sonnet-4-6 | Anthropic-native tool_use with 24+ tools. No cross-provider equivalent |
| learnings | o4-mini | Dedicated reasoning model. No Gemini equivalent for chain-of-thought extraction |
| wrapped_analysis | gemini-3-flash-preview | Already on Gemini via OpenRouter |
| web_search | sonar-pro | Specialized search+citations API |
| chat_graph | gpt-4.1 | Tightly coupled to chat streaming |

## Anthropic Optimization (separate track)

Anthropic is 56% of total LLM spend, mostly Sonnet 4.6 for RAG/chat_agent. Key opportunity: **cache hit ratio optimization** — cache writes dominate Anthropic costs. Improving cache hit ratio (currently write-heavy) could yield significant savings without model changes. Recommend separate issue for this.

## Engineering Prerequisites

1. **Gemini direct API client** — add `google-generativeai` SDK to backend, bypass OpenRouter for new features
2. **Pydantic → responseSchema adapter** — converts OpenAI Pydantic models to Gemini JSON Schema with validation/retry/fallback
3. **A/B evaluation framework** — per-feature quality comparison, use existing model eval infrastructure
4. **Fallback routing** — if Gemini returns malformed output or rate-limits, auto-retry on OpenAI
5. **Prompt caching migration** — evaluate Gemini context caching ($4.50/1M/hr storage) vs OpenAI 24h caching for 8 cached features
6. **Cost monitoring** — per-feature, per-provider cost tracking in BQ to measure actual savings

## Estimated Impact

| Phase | OAI Spend % Affected | Per-Token Savings | Effort | Priority |
|-------|---------------------|-------------------|--------|----------|
| 1: gpt-5.1 → Gemini Pro | 57% | 37% input | Medium | **Highest** |
| 2: Free-text → Flash-Lite | ~5% | 75% | Low | Quick win |
| 3: Structured → Flash-Lite | ~36% | 75% | Medium | High |
| 4: Streaming → Gemini Pro | ~8% | 33-50% | Medium | Medium |
| 5: Dead OpenRouter fix | — | — | Low | Urgent (broken) |

**Projected OpenAI cost reduction: 40-60%** if Phases 1-3 complete successfully.
Anthropic costs (56% of total) addressed separately via cache optimization.

## Migration Order Recommendation

```
Phase 5 (fix dead models) → Phase 2 (easy wins) → Phase 3 (structured adapter) → Phase 1 (big spend) → Phase 4 (user-facing)
```

Start with Phase 5 (already broken) and Phase 2 (zero risk), build the structured output adapter (Phase 3), then tackle the expensive gpt-5.1 features (Phase 1) with confidence. User-facing chat (Phase 4) last — needs most validation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM cost optimization: cross-provider migration to Gemini for 40-60% OpenAI savings #6873

Problem

Current spend distribution

Top 6 features by OpenAI spend (= 88% of OpenAI cost)

Proposal: Cross-Provider Migration (5 Phases)

Phase 1 — gpt-5.1 → Gemini 2.5 Pro (highest impact)

Phase 2 — Easy gpt-4.1-mini → Gemini 2.5 Flash-Lite (quick wins)

Phase 3 — Structured gpt-4.1-mini → Gemini 2.5 Flash-Lite (bulk migration)

Phase 4 — User-Facing Streaming → Gemini 2.5 Pro (high risk, high reward)

Phase 5 — Dead OpenRouter Replacement + Cleanup

Not Migrated (keep on current provider)

Anthropic Optimization (separate track)

Engineering Prerequisites

Estimated Impact

Migration Order Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	Current	Proposed	Output Type	Blocker
conversation_processing	gpt-5.1	gemini-2.5-pro	Pydantic JSON	Structured outputs, prompt caching
conv_action_items	gpt-5.1	gemini-2.5-pro	Pydantic JSON (ActionItemsExtraction)	Structured outputs, prompt caching
conv_structure	gpt-5.1	gemini-2.5-pro	Pydantic JSON (Structured)	Structured outputs, prompt caching
conv_app_result	gpt-5.1	gemini-2.5-pro	Free text	Prompt caching
daily_summary	gpt-5.1	gemini-2.5-pro	Free text ≤50w	Prompt caching, highest cost/call
persona_clone	gpt-5.1	gemini-2.5-pro	Multi-stage text	Prompt caching

Feature	Output Type	Latency	Calls/day
memory_category	Text enum	background	part of memories (5.4K)
session_titles	Free text	near-realtime	low
followup	Free text	realtime	low
onboarding	Boolean	near-realtime	low
daily_summary	Free text ≤50w	background	~850

Feature	Output Type	Latency	Notes
conv_discard	Boolean (Pydantic)	near-realtime	Simplest structured
conv_folder	FolderAssignment	near-realtime
conv_app_select	SuggestedAppsSelection	near-realtime
chat_extraction	Entities JSON	near-realtime
memory_conflict	Action + merged	background
memories	Facts[]	background	4.6% of OAI spend
knowledge_graph	Nodes[] + edges[]	background	2.0% of OAI spend
goals	Goal objects	background
trends	Items[]	background	1.2% of OAI spend
external_structure	Structured JSON	near-realtime
daily_summary_simple	Stats dict	background
app_integration	Pydantic JSON	background

Feature	Current	Proposed	Notes
chat_responses	gpt-5.2	gemini-2.5-pro	User-facing streaming, quality-critical
goals_advice	gpt-5.2	gemini-2.5-pro	Near-realtime streaming
app_generator	gpt-5.2	gemini-2.5-pro	Pydantic JSON, code gen
notifications	gpt-5.2	gemini-2.5-flash	Short text, background

Feature	Dead Model	Replacement
persona_chat	google/gemini-flash-1.5-8b (404)	gpt-4.1-nano ($0.10/$0.40) or gemini-2.5-flash-lite
persona_chat_premium	anthropic/claude-3.5-sonnet (404)	claude-sonnet-4-6 (direct Anthropic API)

Feature	Model	Reason
chat_agent	claude-sonnet-4-6	Anthropic-native tool_use with 24+ tools. No cross-provider equivalent
learnings	o4-mini	Dedicated reasoning model. No Gemini equivalent for chain-of-thought extraction
wrapped_analysis	gemini-3-flash-preview	Already on Gemini via OpenRouter
web_search	sonar-pro	Specialized search+citations API
chat_graph	gpt-4.1	Tightly coupled to chat streaming

Phase	OAI Spend % Affected	Per-Token Savings	Effort	Priority
1: gpt-5.1 → Gemini Pro	57%	37% input	Medium	Highest
2: Free-text → Flash-Lite	~5%	75%	Low	Quick win
3: Structured → Flash-Lite	~36%	75%	Medium	High
4: Streaming → Gemini Pro	~8%	33-50%	Medium	Medium
5: Dead OpenRouter fix	—	—	Low	Urgent (broken)

LLM cost optimization: cross-provider migration to Gemini for 40-60% OpenAI savings #6873

Description

Problem

Current spend distribution

Top 6 features by OpenAI spend (= 88% of OpenAI cost)

Proposal: Cross-Provider Migration (5 Phases)

Phase 1 — gpt-5.1 → Gemini 2.5 Pro (highest impact)

Phase 2 — Easy gpt-4.1-mini → Gemini 2.5 Flash-Lite (quick wins)

Phase 3 — Structured gpt-4.1-mini → Gemini 2.5 Flash-Lite (bulk migration)

Phase 4 — User-Facing Streaming → Gemini 2.5 Pro (high risk, high reward)

Phase 5 — Dead OpenRouter Replacement + Cleanup

Not Migrated (keep on current provider)

Anthropic Optimization (separate track)

Engineering Prerequisites

Estimated Impact

Migration Order Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions