Skip to content

LLM layer: minor correctness cleanup (pricing match, Ollama stream errors, Gemini parts, 408, etc.) #231

Description

@hydropix

Summary

A cluster of small, individually low-severity correctness issues in the LLM layer, grouped into one cleanup issue. Each is independently fixable.

1. Pricing lookup matches the wrong (shorter) model first

src/core/pricing/pricing_data.py:85-87 — the last-resort substring loop returns the first insertion-order entry where either name contains the other. gpt-4.1 precedes gpt-4.1-mini, so gpt-4.1-mini-2025-04-14 matches gpt-4.1 and is priced ~5x too high; same class of bug for gemini-2.5-flash-lite-*gemini-2.5-flash. Fix: prefer the longest matching key.

2. Ollama streaming error bodies can never be read → context-overflow 400s misclassified

src/core/llm/providers/ollama.py:530-558raise_for_status() runs inside client.stream(...) before the body is read, so e.response.json() raises httpx.ResponseNotRead, swallowed by a bare except. The keyword check ("context", "length", …) never matches, so Ollama's num_ctx-overflow 400 (payload sets "truncate": False) is not raised as ContextOverflowError and the translator can't grow context / shrink chunks. Fix: await e.response.aread() before parsing.

3. Gemini response parsing reads only parts[0]

src/core/llm/providers/gemini.py:227-229 — Gemini can return multiple parts (thought + text, or long split responses); taking parts[0].get("text") drops the rest. Fix: join all text parts.

4. OllamaProvider.get_model_context_size() references a never-created attribute

src/core/llm/providers/ollama.py:661 — uses self._context_detector, never assigned in __init__; raises AttributeError, swallowed into a "failed gracefully" warning. Currently dead (no caller) but broken. Fix: create the detector or remove the method.

5. Repetition-loop threshold branch is unreachable

src/core/llm/thinking/detection.py:60-63elif phrase_len >= 40 can never run because phrase_len >= 20 is checked first; the strongest loop signal never gets the lenient threshold. Fix: order the branches longest-first.

6. LiteLLM provider's KeyPool integration is dead

src/core/llm/providers/litellm.py:80-89,125_build_kwargs() uses peek() once before the retry loop, never acquire()/mark_throttled(), so multi-key LiteLLM never rotates on RateLimitError. Fix: rotate keys in the retry loop, or drop the pool wiring if unsupported.

7. Thinking cache stores monotonic loop time as a persistent timestamp

src/core/llm/thinking/cache.py:131-137tested_at uses loop.time() (monotonic, resets per process) then persists it as if wall-clock. Currently never read, but misleading. Fix: use time.time().

8. 408 classified as non-retryable

src/core/llm/rate_limit_handler.py:25-37is_retryable_http_status treats 408 (Request Timeout) as non-retryable; it's conventionally transient. Fix: add 408 to the retryable set.


Found during the June 2026 repo audit. Severity: low (each). Confidence: certain except #2 (likely).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cleanup / tech-debtaudit-2026-06Found during the June 2026 repo auditbugSomething isn't workingtech-debtDead code, duplication, architecture cleanup

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions