fix(cli): scale compaction pruning by model budget#9557
fix(cli): scale compaction pruning by model budget#9557marius-kilocode wants to merge 3 commits intomainfrom
Conversation
| }), | ||
| })) | ||
| } | ||
| // kilocode_change end |
There was a problem hiding this comment.
why not keep this whole section in a separate file, and import it to keep the merge simpler?
markijbema
left a comment
There was a problem hiding this comment.
It seems quite reasonable; how did you test this? Compacting too much or too litle both can be impactful
|
@markijbema I don't think it reliably works yet. Waiting for @chrarnoldus 's opinion. |
| ? { type: "text" as const, text: `[Attached ${part.mime}: ${part.filename ?? "file"}]` } | ||
| : part | ||
| // kilocode_change start - shrink replayed overflow content before auto-continuing | ||
| const cleaned = cap ? sanitize({ part, budget: cap }) : part |
There was a problem hiding this comment.
WARNING: Replay truncation uses the compaction model budget
cap is computed from model, which can resolve to the hidden compaction agent's model. But this replayed turn is re-enqueued with original.model and sent on the next real request using that model. If the compaction agent is configured with a larger context window than the user's model, this sanitization can still leave the replay too large and immediately overflow again.
Code Review SummaryStatus: 3 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Files Reviewed (3 files)
Reviewed by gpt-5.4-20260305 · 1,515,012 tokens |
| const BUDGET_NORMAL_RATIO = 0.2 | ||
| const BUDGET_OVERFLOW_RATIO = 0.05 | ||
| const BUDGET_PROMPT_RATIO = 0.1 | ||
| const BUDGET_NORMAL_MIN = 8_000 |
There was a problem hiding this comment.
WARNING: The minimum budgets still overshoot small-window models
budget() is meant to scale by model capacity, but these floors force normal >= 8_000 and overflow >= 2_000 even when usable is smaller than that. On 4k/8k-context models the compaction path can still keep more tool/text content than the model can fit, so overflow recovery can recurse instead of reliably making the summary request fit.
|
|
||
| // kilocode_change start - scale protected tool-output window with the active model | ||
| const last = msgs.findLast((msg) => msg.info.role === "user") | ||
| const model = last?.info.role === "user" ? yield* provider.getModel(last.info.model.providerID, last.info.model.modelID) : undefined |
There was a problem hiding this comment.
WARNING: Pruning now silently stops when the session model is no longer available
provider.getModel(...) throws for deleted or renamed models. prompt.ts forks compaction.prune(...).pipe(Effect.ignore), so this turns background pruning into a silent no-op and old sessions keep their large tool outputs indefinitely. Falling back to PRUNE_PROTECT/PRUNE_MINIMUM when lookup fails would preserve the previous behavior.
Code Review SummaryStatus: 2 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)CRITICAL
WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Files Reviewed (3 files)
Fix these issues in Kilo Cloud Reviewed by gpt-5.4-20260305 · 1,857,009 tokens |
|
Is this a workaround for Vercel's upload limit? FUNCTION_PAYLOAD_TOO_LARGE looks like a Vercel error |
Summary
Design Decisions
The immediate failure mode was that auto-compaction could be triggered only after a session was already too large, then the compaction request itself also exceeded the model limit. The previous background pruning path kept a fixed
40_000estimated tokens of recent tool output and only ran after a turn completed, which made it too late and too model-insensitive for overflow recovery.The old extension handled this class of problem through context management rather than tool-output pruning. In
kilocode-legacy, the effective cap wasallowedTokens = contextWindow * 0.9 - maxTokens: reserve model output plus a 10% context-window buffer before condensing or sliding-window truncating. On hard context-window errors it forced a more aggressive reduction path usingFORCED_CONTEXT_REDUCTION_PERCENT = 75. That worked because the old extension controlled the whole conversation window before the request was sent.This PR keeps the same design principle but adapts it to current OpenCode's architecture. Current compaction has two distinct phases: background tool-output pruning and overflow summary recovery. For models with only a context limit, the new helper mirrors the legacy shape by reserving
maxOutputTokensplus a 10% prompt/context overhead buffer. For models with a separate input limit, it preserves the existingcompaction.reservedbehavior while still reserving a 10% overhead buffer. That overhead is intentionally conservative because the final request can include system prompts, MCP/tool schemas, AGENTS instructions, reminders, and plugin-provided compaction context that are not represented by old tool outputs alone.Normal pruning and overflow recovery now use separate budgets. Normal pruning preserves more recent tool context for quality, scaled to the model's usable budget. Overflow recovery is deliberately stricter because the priority is making the compaction call fit at all. The overflow shrink step is in-memory only, so stored session history remains intact and can still be shown, imported, or revisited later.
Upstream Comparison
This PR covers the same failure class as upstream OpenCode issues
#15849and#17340: the original request overflows, then the recovery compaction request also overflows because it still includes too much context.Upstream
#14707is already present in our codebase. It made 413/context-overflow errors trigger auto-compaction, strips media during compaction, and stops when compaction itself overflows. That is necessary but not sufficient for this bug because non-media context can still be too large: long tool outputs, synthetic text, repeated tool/question loops, system/reminder/plugin context, and MCP/tool schemas can exceed the model even after media stripping.Upstream
#20718is the closest proposed fix. It pre-prunes overflow compaction input by keeping the most recent 40 messages, truncating completed tool outputs to 500 chars, and truncating synthetic text parts to 2,000 chars. This PR adopts that same placement in the pipeline: shrink the overflow compaction input before the summary model call, without mutating persisted session history.The main difference from
#20718is that this PR makes the limits model-aware instead of fixed. The old fixed 40-message/500-char/2,000-char strategy helps, but it treats 32k, 128k, 200k, and 1M-context models the same. This PR derives the recovery budget from the active model's input/context limits, reserves output budget plus 10% prompt/context overhead, and then derives normal-prune and overflow-recovery budgets from that usable space. This is meant to preserve more useful context on large models while being more aggressive on smaller ones.Upstream
#20516is broader hardening: circuit breaker, retries, output caps, post-compact budget accounting, and attachment restoration filtering. This PR does not attempt to solve all of that. It is a focused input-side fix for the unrecoverable overflow-compaction request. It should compose with later breaker/retry/output-cap work, but it intentionally avoids adding a larger compaction policy system in this patch.In short:
#14707lets OpenCode attempt recovery,#20718shows the right input-side recovery point,#20516proposes broader loop/output hardening, and this PR adapts the input-side recovery fix to Kilo with model-aware budgets and legacy-style context headroom.Validation
bun test test/session/compaction.test.tsbun run typecheckbun run script/check-opencode-annotations.tsLocal Reproduction Note
After testing against the local GPT 5.2 sessions in this worktree, the remaining
FUNCTION_PAYLOAD_TOO_LARGEcase was not dominated by tool output. The failing sessions contained repeated normal user text parts around 400k characters each, totaling multiple megabytes. This PR now also truncates oversized regular text parts during overflow-compaction input shrinking and truncates replayed overflow user text before auto-continuing, while preserving the original stored session messages.