Update news.md

chauncygu · web-flow · commit f790261614f3 · 2026-05-10T11:23:21.000-07:00
diff --git a/docs/news.md b/docs/news.md
@@ -2,8 +2,9 @@
  
 ## 🔥🔥🔥 News (Pacific Time)
 
-
-- May 10, 2026 (latest): **Small-context local models survive large workloads — 4-part fix: ctx cap, auto-fanout, stagnation-stop, output paths under `~/.cheetahclaws/`.** Repro that motivated the work: running `/agent → 1 (Research Assistant)` on a 6.6 MB PDF (`AutoRedTeamer.pdf` — ~70k tokens of extracted text) with `custom/qwen2.5-72b` (32k ctx). Old behavior: 400 BadRequest "context length 32768"; the agent_runner kept polling the template every 2 s; the model produced **1500+ identical "task complete" summaries** before anything stopped it. New behavior, four cooperating layers: (1) **Per-model context-window registry + dynamic max_tokens cap** (`providers._MODEL_CONTEXT_LIMITS` + `get_model_context_window` + `dynamic_cap_max_tokens`) — covers Qwen 2.5/3, Llama 3.x, Mistral/Mixtral, Phi, Gemma, DeepSeek local variants; `_fetch_custom_model_limit` now backfills `PROVIDERS["custom"]["context_limit"]` so compaction sees the live `/v1/models` value; per-call shrink based on actual prompt size keeps `input + output + 1024 safety ≤ ctx`. `compaction.get_context_limit` gains an optional `config` arg so custom-endpoint detection works on the very first turn. (2) **Auto-fanout for oversize tool outputs** (`multi_agent/fanout.py`) — when a single tool result (Read on a huge PDF, Grep over a giant tree, WebFetch of a long article) exceeds 0.4 × ctx_window, split into chunks at paragraph boundaries with token-overlap, dispatch parallel sub-LLM map calls (one per chunk, default cap 5 subagents), merge with a single reduce call; substitutes the merged summary in conversation history instead of letting the next API call overflow. Hooked at the tool-result append site in `agent.py`; transparent UX prints `[Auto-fanout: <Tool> returned ~N chars (>threshold) → dispatching K parallel sub-summaries]`. Configurable: `auto_fanout_enabled` / `_threshold` / `_max_subagents` / `_chunk_overlap_tokens`. (3) **Stagnation-stop in `agent_runner.py`** — when the model emits the same summary N iterations in a row (default 3, whitespace/case-normalized), stop the loop with a clear notification instead of burning thousands of API calls; configurable via `auto_agent_dup_summary_limit` (0 disables). (4) **Agent output paths under `~/.cheetahclaws/`** — `/agent` wizard now resolves relative output filenames (e.g. `research_notes.md`) to absolute paths under `~/.cheetahclaws/agents/<name>/output/` instead of CWD; `AgentRunner` exposes `runner.output_dir`, eagerly mkdir'd; Summary block + post-start info show the resolved path in green; absolute paths pass through unchanged. **Tests:** +47 new (fanout 23, ctx cap 18, dup-stop 13, output paths 8). **Full suite: 2139 passing, zero regressions.** User-side guide: [`docs/guides/extensions.md`](guides/extensions.md).
+- May 10, 2026 (latest): **Web Chat UI fixes — slash commands no longer reply twice; `--web --model X` actually applies the model.** Two related issues that surfaced when wiring a self-hosted vLLM endpoint into the Chat UI. (1) **Issue #111 — slash commands duplicated in Chat UI but not in terminal.** `web/api.py:handle_slash_sync` was both returning events inline in the HTTP response **and** broadcasting the same events to the WS subscribers of the same client; `chat.js` then iterated `data.events` AND fired `_handleEvent` from `ws.onmessage`, rendering every reply twice. Same bug in `handle_slash_stream` for SSE-streamed long commands (`/brainstorm`, `/worker`, `/agent`, `/plan`). Both helpers now deliver events through a single channel — HTTP/SSE only — so `_handleEvent` runs exactly once per event. Background-thread events (sentinel flows, agent runs) are unaffected: by the time the worker thread emits, `_broadcast` is already restored to the live WS broadcaster in `finally`. (2) **`--web --model X` was silently ignored.** The CLI override branch only ran in the interactive-REPL path; the `if args.web:` branch loaded config straight from disk and started the server, so `python cheetahclaws.py --web --model custom/qwen2.5-72b` would happily boot but every request handler reloaded `~/.cheetahclaws/config.json` with the previous model name (e.g. `gemma-4-31B-it`), producing a confusing `404: model does not exist` against the new endpoint. Fix: `cheetahclaws.py` now persists `args.model` to config before calling `start_web_server`, matching the documented behavior; `provider:model` → `provider/model` normalization is identical to the REPL path. User-side guide: [`docs/guides/web-ui.md`](docs/guides/web-ui.md) (Troubleshooting + Architecture notes updated).
+  
+- May 10, 2026 : **Small-context local models survive large workloads — 4-part fix: ctx cap, auto-fanout, stagnation-stop, output paths under `~/.cheetahclaws/`.** Repro that motivated the work: running `/agent → 1 (Research Assistant)` on a 6.6 MB PDF (`AutoRedTeamer.pdf` — ~70k tokens of extracted text) with `custom/qwen2.5-72b` (32k ctx). Old behavior: 400 BadRequest "context length 32768"; the agent_runner kept polling the template every 2 s; the model produced **1500+ identical "task complete" summaries** before anything stopped it. New behavior, four cooperating layers: (1) **Per-model context-window registry + dynamic max_tokens cap** (`providers._MODEL_CONTEXT_LIMITS` + `get_model_context_window` + `dynamic_cap_max_tokens`) — covers Qwen 2.5/3, Llama 3.x, Mistral/Mixtral, Phi, Gemma, DeepSeek local variants; `_fetch_custom_model_limit` now backfills `PROVIDERS["custom"]["context_limit"]` so compaction sees the live `/v1/models` value; per-call shrink based on actual prompt size keeps `input + output + 1024 safety ≤ ctx`. `compaction.get_context_limit` gains an optional `config` arg so custom-endpoint detection works on the very first turn. (2) **Auto-fanout for oversize tool outputs** (`multi_agent/fanout.py`) — when a single tool result (Read on a huge PDF, Grep over a giant tree, WebFetch of a long article) exceeds 0.4 × ctx_window, split into chunks at paragraph boundaries with token-overlap, dispatch parallel sub-LLM map calls (one per chunk, default cap 5 subagents), merge with a single reduce call; substitutes the merged summary in conversation history instead of letting the next API call overflow. Hooked at the tool-result append site in `agent.py`; transparent UX prints `[Auto-fanout: <Tool> returned ~N chars (>threshold) → dispatching K parallel sub-summaries]`. Configurable: `auto_fanout_enabled` / `_threshold` / `_max_subagents` / `_chunk_overlap_tokens`. (3) **Stagnation-stop in `agent_runner.py`** — when the model emits the same summary N iterations in a row (default 3, whitespace/case-normalized), stop the loop with a clear notification instead of burning thousands of API calls; configurable via `auto_agent_dup_summary_limit` (0 disables). (4) **Agent output paths under `~/.cheetahclaws/`** — `/agent` wizard now resolves relative output filenames (e.g. `research_notes.md`) to absolute paths under `~/.cheetahclaws/agents/<name>/output/` instead of CWD; `AgentRunner` exposes `runner.output_dir`, eagerly mkdir'd; Summary block + post-start info show the resolved path in green; absolute paths pass through unchanged. **Tests:** +47 new (fanout 23, ctx cap 18, dup-stop 13, output paths 8). **Full suite: 2139 passing, zero regressions.** User-side guide: [`docs/guides/extensions.md`](guides/extensions.md).
 
 - May 9, 2026: **Read tool auto-redirects on overflow — defense-in-depth for the case where model ignores the template instruction.** Re-running the same `/agent + autodan.pdf` failure showed two real-world problems with the prior fix: (1) The user was running the **pip-installed** binary (`/home/shangdinggu/anaconda3/bin/cheetahclaws`), not the source tree. New tools / templates added to source had no effect. (2) Even if the user reinstalled, qwen2.5-72b would likely still call `Read` instead of `SummarizeLargeFile` — models default to familiar tools no matter what the template says. The fix moves the routing decision into the Read tool itself. (a) **New `_maybe_redirect_to_summarize` helper (`tools/files.py`).** When `Read` or `ReadPDF` would return content too large to safely fit in the next API call, it instead returns a **short redirect message** like `[ReadTooLarge: file is too large — call SummarizeLargeFile with file_path='X' instead] PREVIEW: …`. The model sees the redirect, calls `SummarizeLargeFile`, gets a chunked-and-merged summary back. The raw content never enters the API call. (b) **CJK-aware token estimation.** CJK content tokenizes at ~1 token per character (vs ~2.8 chars/token for English). New `_is_cjk_heavy()` heuristic: ≥20% CJK characters → use 1:1 char-to-token estimate. A 24K-char Chinese file is 24K tokens, not 8.6K, and now triggers redirect on a 32K-context model. (c) **Conservative ceiling for unreliable provider declarations.** `custom/<model>` provider declares 128K context by default but the underlying model is often 32K (qwen2.5-72b, llama 3 8B, etc.). New `safe_ctx = min(declared_ctx, 30000)` caps the threshold at 30K tokens regardless of provider claims — the redirect now fires on the user's exact ~25K-token PDF case (would NOT have fired with the unconditional 128K ceiling, which is exactly the bug). (d) **Wrapped Read registration (`tools/__init__.py`).** New `_read_with_overflow_check` lambda calls `_maybe_redirect_to_summarize` after `_read` returns; for results <8KB it skips (not worth the check). ReadPDF gets the same treatment inline in `_read_pdf`. **Why this works even on the old install**: as soon as the user updates `tools/files.py` and `tools/__init__.py`, the redirect fires regardless of whether SummarizeLargeFile / template changes are present. The redirect's prose tells the model exactly which tool to call and with what args. Tests: 14 new pytest cases (`tests/test_read_overflow_redirect.py`) — CJK detection (English / Chinese / Japanese / mixed-minority / empty), threshold logic (small file → no redirect; user's exact failure case → redirect with right pointer; CJK at lower char count triggers vs same chars in English; conservative ceiling protects against overconfident provider; preview included for context). Plus 2 integration tests via `execute_tool("Read", ...)` confirming the wrapper applies the redirect end-to-end. **2077 targeted regression tests pass** (2063 prior + 14 new), zero regressions across the whole repo.