Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,68 +1,88 @@
# Adaptive Output Token Escalation Design

> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens.
> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens, with multi-turn recovery for responses that exceed even the escalated limit.

## Problem

Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.

## Solution

Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`), automatically retry once with an escalated limit of **64K**. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`):

1. **Escalate** to the model's full output limit (with 64K as a floor for unknown models)
2. If still truncated, **recover** by keeping the partial response in history and injecting a continuation message, up to 3 times
3. If recovery is exhausted, fall back to the tool scheduler's truncation guidance

Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.

## Architecture

```
┌─────────────────────────┐
│ Request starts │
│ max_tokens = 8K │
└───────────┬─────────────┘
┌─────────────────────────┐
│ Stream response │
└───────────┬─────────────┘
┌─────────┴─────────┐
│ │
finish_reason finish_reason
!= MAX_TOKENS == MAX_TOKENS
│ │
▼ ▼
┌───────────┐ ┌─────────────────────┐
│ Done │ │ Check conditions: │
└───────────┘ │ - No user override? │
│ - No env override? │
│ - Not already │
│ escalated? │
└─────────┬───────────┘
YES │ NO
┌─────────┴────┐
│ │
▼ ▼
┌─────────────┐ ┌──────────┐
│ Pop partial │ │ Done │
│ model resp │ │ (truncd) │
│ from history│ └──────────┘
│ │
│ Yield RETRY │
│ event │
│ │
│ Re-send │
│ max_tokens │
│ = 64K │
└─────────────┘
Request (max_tokens = 8K)
┌─────────────────────────┐
│ Response truncated? │──── No ──▶ Done ✓
│ (MAX_TOKENS) │
└───────────┬──────────────┘
│ Yes
┌──────────────────────────────────────────────────┐
│ Layer 1: Escalate to model output limit │
│ ┌────────────────────────────────────────────┐ │
│ │ Pop partial response from history │ │
│ │ RETRY (isContinuation: false → reset UI) │ │
│ │ Re-send at max(64K, model output limit) │ │
│ └────────────────────────────────────────────┘ │
└───────────┬──────────────────────────────────────┘
┌─────────────────────────┐
│ Still truncated? │──── No ──▶ Done ✓
│ (MAX_TOKENS) │
└───────────┬──────────────┘
│ Yes
┌──────────────────────────────────────────────────┐
│ Layer 2: Multi-turn recovery (up to 3×) │
│ ┌────────────────────────────────────────────┐ │
│ │ Keep partial response in history │ │
│ │ Push user message: "Resume directly..." │ │
│ │ RETRY (isContinuation: true → keep UI buf) │ │
│ │ Re-send with updated history │ │
│ │ Model continues from where it left off │ │
│ └──────────────┬─────────────────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Succeeded? │── Yes ──▶ Done ✓ │
│ └──────┬──────┘ │
│ │ No (still truncated) │
│ ▼ │
│ attempt < 3? ── Yes ──▶ loop back ↑ │
└───────────┬──────────────────────────────────────┘
│ No (exhausted)
┌──────────────────────────────────────────────────┐
│ Layer 3: Tool scheduler fallback │
│ ┌────────────────────────────────────────────┐ │
│ │ Reject truncated Edit/Write tool calls │ │
│ │ Return guidance: "You MUST split into │ │
│ │ smaller parts — write skeleton first, │ │
│ │ then edit incrementally." │ │
│ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
```

## Token limit determination

The effective `max_tokens` is resolved in the following priority order:

| Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior |
| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ------------------------------ |
| 1 (highest) | User config (`samplingParams.max_tokens`) | `min(userValue, modelLimit)` | `userValue` | No escalation |
| 2 | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)` | `envValue` | No escalation |
| 3 (lowest) | Capped default | `min(modelLimit, 8K)` | `min(32K, 8K)` = 8K | Escalates to 64K on truncation |
| Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior |
| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ----------------------------------------------- |
| 1 (highest) | User config (`samplingParams.max_tokens`) | `min(userValue, modelLimit)` | `userValue` | No escalation |
| 2 | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)` | `envValue` | No escalation |
| 3 (lowest) | Capped default | `min(modelLimit, 8K)` | `min(32K, 8K)` = 8K | Escalates to model limit (64K floor) + recovery |

A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.

Expand All @@ -88,9 +108,25 @@ The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry
3. Guard checks pass:
- maxTokensEscalated === false (prevent infinite escalation)
- hasUserMaxTokensOverride === false (respect user intent)
4. Pop the partial model response from chat history
5. Yield RETRY event → UI discards partial output
6. Re-send the same request with maxOutputTokens: 64K
4. Compute escalated limit: max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output'))
5. Pop the partial model response from chat history
6. Yield RETRY event (isContinuation: false) → UI discards partial output and resets buffers
7. Re-send the same request with maxOutputTokens: escalatedLimit
```

### Recovery steps (geminiChat.ts)

If the escalated response is also truncated (finishReason === MAX_TOKENS), the recovery loop runs up to `MAX_OUTPUT_RECOVERY_ATTEMPTS` (3) times:

```
1. Partial model response is already in history (pushed by processStreamResponse)
2. Push a recovery user message: OUTPUT_RECOVERY_MESSAGE
3. Yield RETRY event (isContinuation: true) → UI keeps text buffer for continuation
4. Re-send with updated history (model sees its partial output + recovery instruction)
5. If still truncated and attempts remain, loop back to step 1
6. If recovery attempt throws (empty response, network error):
- Pop the dangling recovery message from history
- Break out of recovery loop
```

### State cleanup on RETRY (turn.ts)
Expand All @@ -102,14 +138,26 @@ When the `Turn` class receives a RETRY event, it clears accumulated state to pre
- `debugResponses` — cleared to avoid stale debug data
- `finishReason` — reset to `undefined` so the new response's finish reason is used

The `isContinuation` flag is passed through to the UI so it can decide whether to reset text buffers (escalation) or keep them (recovery).

## Constants

Defined in `tokenLimits.ts`:
Defined in `geminiChat.ts` and `tokenLimits.ts`:

| Constant | Value | Purpose |
| ------------------------------ | ------ | ------------------------------------------------------- |
| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000 | Default output token limit when no user override is set |
| `ESCALATED_MAX_TOKENS` | 64,000 | Floor for escalation (used when model limit is unknown) |
| `MAX_OUTPUT_RECOVERY_ATTEMPTS` | 3 | Max multi-turn recovery attempts after escalation |

The effective escalated limit is `max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output'))`:

| Constant | Value | Purpose |
| --------------------------- | ------ | ------------------------------------------------------- |
| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000 | Default output token limit when no user override is set |
| `ESCALATED_MAX_TOKENS` | 64,000 | Output token limit used on truncation retry |
| Model | Escalated limit |
| ---------------- | --------------- |
| Claude Opus 4.6 | 131,072 (128K) |
| GPT-5 / o-series | 131,072 (128K) |
| Qwen3.x | 65,536 (64K) |
| Unknown models | 64,000 (floor) |

## Design decisions

Expand All @@ -119,20 +167,22 @@ Defined in `tokenLimits.ts`:
- 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
- Reduces average slot reservation from 32K to 8K (4x improvement)

### Why 64K escalated limit?
### Why escalate to model limit instead of fixed 64K?

- Covers the vast majority of long outputs that were truncated at 8K
- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate
- Models with higher output limits (Claude Opus 128K, GPT-5 128K) were constrained to 64K unnecessarily
- Using the model's actual limit captures the vast majority of long outputs without a second retry
- `ESCALATED_MAX_TOKENS` (64K) serves as a floor for unknown models where `tokenLimit()` returns the default 32K

### Why not progressive escalation (8K → 16K → 32K → 64K)?
### Why multi-turn recovery instead of progressive escalation?

- Each retry adds latency (the full response must be regenerated)
- A single retry is the simplest approach that captures almost all cases
- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K
- Progressive escalation (8K → 16K → 32K → 64K) requires regenerating the full response each time
- Multi-turn recovery keeps the partial response and lets the model continue, saving tokens and latency
- Recovery messages are cheap (~40 tokens each) compared to regenerating large responses
- The 3-attempt limit prevents infinite loops while covering most practical cases

### Why is escalation outside the retry loop?

- Truncation is a success case, not an error
- Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
- Keeps the retry loop focused on its original purpose (transient error recovery)
- Recovery errors are caught separately to avoid aborting the entire conversation
22 changes: 19 additions & 3 deletions packages/cli/src/ui/hooks/useGeminiStream.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1145,10 +1145,26 @@ export const useGeminiStream = (
loopDetectedRef.current = true;
break;
case ServerGeminiEventType.Retry:
// Clear any pending partial content from the failed attempt
if (pendingHistoryItemRef.current) {
setPendingHistoryItem(null);
// On fresh restart (escalation / rate-limit / invalid stream),
// clear pending content and buffers to discard the failed attempt.
// On continuation (recovery), keep the pending gemini item AND
// buffers so the model's continuation text appends to them —
// otherwise handleContentEvent would see a null pending item,
// create a fresh one, and reset the buffer to just the new chunk,
// losing the partial text we meant to preserve.
if (!event.isContinuation) {
if (pendingHistoryItemRef.current) {
setPendingHistoryItem(null);
}
geminiMessageBuffer = '';
thoughtBuffer = '';
}
// Always discard tool call requests from the truncated/failed
// attempt to prevent duplicate execution after escalation or
// recovery. The recovery path now skips turns that already
// contain a functionCall (see geminiChat.ts), so this only
// clears stale requests from pre-RETRY accumulation.
toolCallRequests.length = 0;
// Show retry info if available (rate-limit / throttling errors)
if (event.retryInfo) {
startRetryCountdown(event.retryInfo);
Expand Down
16 changes: 9 additions & 7 deletions packages/core/src/core/coreToolScheduler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -72,20 +72,22 @@ import { IdeClient } from '../ide/ide-client.js';

const TRUNCATION_PARAM_GUIDANCE =
'Note: Your previous response was truncated due to max_tokens limit, ' +
'which likely caused incomplete tool call parameters. ' +
'which caused incomplete tool call parameters. ' +
'Please retry the tool call with complete parameters. ' +
'If the content is too large for a single response, ' +
'consider splitting it into smaller parts.';
'you MUST split it into smaller parts: ' +
'first write_file with a skeleton/partial content, ' +
'then use edit to add the remaining sections incrementally.';

const TRUNCATION_EDIT_REJECTION =
'Your previous response was truncated due to max_tokens limit, ' +
'which likely produced incomplete file content. ' +
'which produced incomplete file content. ' +
'The tool call has been rejected to prevent writing ' +
'truncated content to the file. ' +
'Please retry the tool call with complete content. ' +
'If the content is too large for a single response, ' +
'consider splitting it into smaller parts ' +
'(e.g., write_file for initial content, then edit for additions).';
'You MUST split the content into smaller parts: ' +
'first write_file with a skeleton/partial content, ' +
'then use edit to add the remaining sections incrementally. ' +
'Do NOT retry with the same large content.';

export type ValidatingToolCall = {
status: 'validating';
Expand Down
Loading
Loading