Summary
A cluster of small, individually low-severity correctness issues in the LLM layer, grouped into one cleanup issue. Each is independently fixable.
1. Pricing lookup matches the wrong (shorter) model first
src/core/pricing/pricing_data.py:85-87 — the last-resort substring loop returns the first insertion-order entry where either name contains the other. gpt-4.1 precedes gpt-4.1-mini, so gpt-4.1-mini-2025-04-14 matches gpt-4.1 and is priced ~5x too high; same class of bug for gemini-2.5-flash-lite-* → gemini-2.5-flash. Fix: prefer the longest matching key.
2. Ollama streaming error bodies can never be read → context-overflow 400s misclassified
src/core/llm/providers/ollama.py:530-558 — raise_for_status() runs inside client.stream(...) before the body is read, so e.response.json() raises httpx.ResponseNotRead, swallowed by a bare except. The keyword check ("context", "length", …) never matches, so Ollama's num_ctx-overflow 400 (payload sets "truncate": False) is not raised as ContextOverflowError and the translator can't grow context / shrink chunks. Fix: await e.response.aread() before parsing.
3. Gemini response parsing reads only parts[0]
src/core/llm/providers/gemini.py:227-229 — Gemini can return multiple parts (thought + text, or long split responses); taking parts[0].get("text") drops the rest. Fix: join all text parts.
4. OllamaProvider.get_model_context_size() references a never-created attribute
src/core/llm/providers/ollama.py:661 — uses self._context_detector, never assigned in __init__; raises AttributeError, swallowed into a "failed gracefully" warning. Currently dead (no caller) but broken. Fix: create the detector or remove the method.
5. Repetition-loop threshold branch is unreachable
src/core/llm/thinking/detection.py:60-63 — elif phrase_len >= 40 can never run because phrase_len >= 20 is checked first; the strongest loop signal never gets the lenient threshold. Fix: order the branches longest-first.
6. LiteLLM provider's KeyPool integration is dead
src/core/llm/providers/litellm.py:80-89,125 — _build_kwargs() uses peek() once before the retry loop, never acquire()/mark_throttled(), so multi-key LiteLLM never rotates on RateLimitError. Fix: rotate keys in the retry loop, or drop the pool wiring if unsupported.
7. Thinking cache stores monotonic loop time as a persistent timestamp
src/core/llm/thinking/cache.py:131-137 — tested_at uses loop.time() (monotonic, resets per process) then persists it as if wall-clock. Currently never read, but misleading. Fix: use time.time().
8. 408 classified as non-retryable
src/core/llm/rate_limit_handler.py:25-37 — is_retryable_http_status treats 408 (Request Timeout) as non-retryable; it's conventionally transient. Fix: add 408 to the retryable set.
Found during the June 2026 repo audit. Severity: low (each). Confidence: certain except #2 (likely).
Summary
A cluster of small, individually low-severity correctness issues in the LLM layer, grouped into one cleanup issue. Each is independently fixable.
1. Pricing lookup matches the wrong (shorter) model first
src/core/pricing/pricing_data.py:85-87— the last-resort substring loop returns the first insertion-order entry where either name contains the other.gpt-4.1precedesgpt-4.1-mini, sogpt-4.1-mini-2025-04-14matchesgpt-4.1and is priced ~5x too high; same class of bug forgemini-2.5-flash-lite-*→gemini-2.5-flash. Fix: prefer the longest matching key.2. Ollama streaming error bodies can never be read → context-overflow 400s misclassified
src/core/llm/providers/ollama.py:530-558—raise_for_status()runs insideclient.stream(...)before the body is read, soe.response.json()raiseshttpx.ResponseNotRead, swallowed by a bareexcept. The keyword check ("context", "length", …) never matches, so Ollama'snum_ctx-overflow 400 (payload sets"truncate": False) is not raised asContextOverflowErrorand the translator can't grow context / shrink chunks. Fix:await e.response.aread()before parsing.3. Gemini response parsing reads only
parts[0]src/core/llm/providers/gemini.py:227-229— Gemini can return multipleparts(thought + text, or long split responses); takingparts[0].get("text")drops the rest. Fix: join all text parts.4.
OllamaProvider.get_model_context_size()references a never-created attributesrc/core/llm/providers/ollama.py:661— usesself._context_detector, never assigned in__init__; raisesAttributeError, swallowed into a "failed gracefully" warning. Currently dead (no caller) but broken. Fix: create the detector or remove the method.5. Repetition-loop threshold branch is unreachable
src/core/llm/thinking/detection.py:60-63—elif phrase_len >= 40can never run becausephrase_len >= 20is checked first; the strongest loop signal never gets the lenient threshold. Fix: order the branches longest-first.6. LiteLLM provider's KeyPool integration is dead
src/core/llm/providers/litellm.py:80-89,125—_build_kwargs()usespeek()once before the retry loop, neveracquire()/mark_throttled(), so multi-key LiteLLM never rotates onRateLimitError. Fix: rotate keys in the retry loop, or drop the pool wiring if unsupported.7. Thinking cache stores monotonic loop time as a persistent timestamp
src/core/llm/thinking/cache.py:131-137—tested_atusesloop.time()(monotonic, resets per process) then persists it as if wall-clock. Currently never read, but misleading. Fix: usetime.time().8. 408 classified as non-retryable
src/core/llm/rate_limit_handler.py:25-37—is_retryable_http_statustreats 408 (Request Timeout) as non-retryable; it's conventionally transient. Fix: add 408 to the retryable set.Found during the June 2026 repo audit. Severity: low (each). Confidence: certain except #2 (likely).