Skip to content

fix(cli): scale compaction pruning by model budget#9557

Open
marius-kilocode wants to merge 3 commits intomainfrom
fix/model-aware-compaction
Open

fix(cli): scale compaction pruning by model budget#9557
marius-kilocode wants to merge 3 commits intomainfrom
fix/model-aware-compaction

Conversation

@marius-kilocode
Copy link
Copy Markdown
Collaborator

@marius-kilocode marius-kilocode commented Apr 27, 2026

Summary

  • Scale compaction pruning budgets from the active model's input/context limits instead of relying on a fixed 40k tool-output window.
  • Shrink overflow-triggered compaction input before the summary model call by trimming older messages and truncating large tool/synthetic text parts in-memory.
  • Add regression coverage for model-aware pruning, overflow compaction shrinking, plugin/MCP-style extra context, and persisted-message immutability.

Design Decisions

The immediate failure mode was that auto-compaction could be triggered only after a session was already too large, then the compaction request itself also exceeded the model limit. The previous background pruning path kept a fixed 40_000 estimated tokens of recent tool output and only ran after a turn completed, which made it too late and too model-insensitive for overflow recovery.

The old extension handled this class of problem through context management rather than tool-output pruning. In kilocode-legacy, the effective cap was allowedTokens = contextWindow * 0.9 - maxTokens: reserve model output plus a 10% context-window buffer before condensing or sliding-window truncating. On hard context-window errors it forced a more aggressive reduction path using FORCED_CONTEXT_REDUCTION_PERCENT = 75. That worked because the old extension controlled the whole conversation window before the request was sent.

This PR keeps the same design principle but adapts it to current OpenCode's architecture. Current compaction has two distinct phases: background tool-output pruning and overflow summary recovery. For models with only a context limit, the new helper mirrors the legacy shape by reserving maxOutputTokens plus a 10% prompt/context overhead buffer. For models with a separate input limit, it preserves the existing compaction.reserved behavior while still reserving a 10% overhead buffer. That overhead is intentionally conservative because the final request can include system prompts, MCP/tool schemas, AGENTS instructions, reminders, and plugin-provided compaction context that are not represented by old tool outputs alone.

Normal pruning and overflow recovery now use separate budgets. Normal pruning preserves more recent tool context for quality, scaled to the model's usable budget. Overflow recovery is deliberately stricter because the priority is making the compaction call fit at all. The overflow shrink step is in-memory only, so stored session history remains intact and can still be shown, imported, or revisited later.

Upstream Comparison

This PR covers the same failure class as upstream OpenCode issues #15849 and #17340: the original request overflows, then the recovery compaction request also overflows because it still includes too much context.

Upstream #14707 is already present in our codebase. It made 413/context-overflow errors trigger auto-compaction, strips media during compaction, and stops when compaction itself overflows. That is necessary but not sufficient for this bug because non-media context can still be too large: long tool outputs, synthetic text, repeated tool/question loops, system/reminder/plugin context, and MCP/tool schemas can exceed the model even after media stripping.

Upstream #20718 is the closest proposed fix. It pre-prunes overflow compaction input by keeping the most recent 40 messages, truncating completed tool outputs to 500 chars, and truncating synthetic text parts to 2,000 chars. This PR adopts that same placement in the pipeline: shrink the overflow compaction input before the summary model call, without mutating persisted session history.

The main difference from #20718 is that this PR makes the limits model-aware instead of fixed. The old fixed 40-message/500-char/2,000-char strategy helps, but it treats 32k, 128k, 200k, and 1M-context models the same. This PR derives the recovery budget from the active model's input/context limits, reserves output budget plus 10% prompt/context overhead, and then derives normal-prune and overflow-recovery budgets from that usable space. This is meant to preserve more useful context on large models while being more aggressive on smaller ones.

Upstream #20516 is broader hardening: circuit breaker, retries, output caps, post-compact budget accounting, and attachment restoration filtering. This PR does not attempt to solve all of that. It is a focused input-side fix for the unrecoverable overflow-compaction request. It should compose with later breaker/retry/output-cap work, but it intentionally avoids adding a larger compaction policy system in this patch.

In short: #14707 lets OpenCode attempt recovery, #20718 shows the right input-side recovery point, #20516 proposes broader loop/output hardening, and this PR adapts the input-side recovery fix to Kilo with model-aware budgets and legacy-style context headroom.

Validation

  • bun test test/session/compaction.test.ts
  • bun run typecheck
  • bun run script/check-opencode-annotations.ts

Local Reproduction Note

After testing against the local GPT 5.2 sessions in this worktree, the remaining FUNCTION_PAYLOAD_TOO_LARGE case was not dominated by tool output. The failing sessions contained repeated normal user text parts around 400k characters each, totaling multiple megabytes. This PR now also truncates oversized regular text parts during overflow-compaction input shrinking and truncates replayed overflow user text before auto-continuing, while preserving the original stored session messages.

}),
}))
}
// kilocode_change end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keep this whole section in a separate file, and import it to keep the merge simpler?

Copy link
Copy Markdown
Contributor

@markijbema markijbema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems quite reasonable; how did you test this? Compacting too much or too litle both can be impactful

@marius-kilocode
Copy link
Copy Markdown
Collaborator Author

@markijbema I don't think it reliably works yet. Waiting for @chrarnoldus 's opinion.

? { type: "text" as const, text: `[Attached ${part.mime}: ${part.filename ?? "file"}]` }
: part
// kilocode_change start - shrink replayed overflow content before auto-continuing
const cleaned = cap ? sanitize({ part, budget: cap }) : part
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Replay truncation uses the compaction model budget

cap is computed from model, which can resolve to the hidden compaction agent's model. But this replayed turn is re-enqueued with original.model and sent on the next real request using that model. If the compaction agent is configured with a larger context window than the user's model, this sanitization can still leave the replay too large and immediately overflow again.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 27, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 3
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
packages/opencode/src/session/compaction.ts 39 The minimum budget floors can exceed the usable window on small-context models, so overflow recovery can still assemble a compaction request that does not fit.
packages/opencode/src/session/compaction.ts 200 prune() now depends on provider.getModel(...); if that lookup fails for an old or deleted model, background pruning becomes a silent no-op.
packages/opencode/src/session/compaction.ts 414 Replay truncation is budgeted against the compaction model, so a larger compaction model can still replay too much context for the original user model.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
packages/opencode/src/session/compaction.ts 456 The overflow fallback message still says the failure was caused by large media attachments, which is misleading for text/tool-output overflows handled by this PR.
Files Reviewed (3 files)
  • .changeset/model-aware-compaction.md - 0 issues
  • packages/opencode/src/session/compaction.ts - 3 warnings, 1 other observation
  • packages/opencode/test/session/compaction.test.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 1,515,012 tokens

const BUDGET_NORMAL_RATIO = 0.2
const BUDGET_OVERFLOW_RATIO = 0.05
const BUDGET_PROMPT_RATIO = 0.1
const BUDGET_NORMAL_MIN = 8_000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: The minimum budgets still overshoot small-window models

budget() is meant to scale by model capacity, but these floors force normal >= 8_000 and overflow >= 2_000 even when usable is smaller than that. On 4k/8k-context models the compaction path can still keep more tool/text content than the model can fit, so overflow recovery can recurse instead of reliably making the summary request fit.


// kilocode_change start - scale protected tool-output window with the active model
const last = msgs.findLast((msg) => msg.info.role === "user")
const model = last?.info.role === "user" ? yield* provider.getModel(last.info.model.providerID, last.info.model.modelID) : undefined
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Pruning now silently stops when the session model is no longer available

provider.getModel(...) throws for deleted or renamed models. prompt.ts forks compaction.prune(...).pipe(Effect.ignore), so this turns background pruning into a silent no-op and old sessions keep their large tool outputs indefinitely. Falling back to PRUNE_PROTECT/PRUNE_MINIMUM when lookup fails would preserve the previous behavior.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 27, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 0
Issue Details (click to expand)

CRITICAL

File Line Issue

WARNING

File Line Issue
packages/opencode/src/session/compaction.ts 39 Minimum pruning/recovery floors can still exceed the usable budget on very small-context models
packages/opencode/src/session/compaction.ts 200 Background pruning becomes a silent no-op if the original session model can no longer be resolved
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
Files Reviewed (3 files)
  • .changeset/model-aware-compaction.md - 0 issues
  • packages/opencode/src/session/compaction.ts - 2 issues
  • packages/opencode/test/session/compaction.test.ts - 0 issues

Fix these issues in Kilo Cloud


Reviewed by gpt-5.4-20260305 · 1,857,009 tokens

@chrarnoldus
Copy link
Copy Markdown
Collaborator

Is this a workaround for Vercel's upload limit? FUNCTION_PAYLOAD_TOO_LARGE looks like a Vercel error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants