QwenLM · wenshao · Apr 16, 2026 · Apr 16, 2026 · Apr 16, 2026 · Apr 17, 2026
diff --git a/...ign/adaptive-output-token-escalation/adaptive-output-token-escalation-design.md b/...ign/adaptive-output-token-escalation/adaptive-output-token-escalation-design.md
@@ -1,68 +1,88 @@
 # Adaptive Output Token Escalation Design
 
-> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens.
+> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens, with multi-turn recovery for responses that exceed even the escalated limit.
 
 ## Problem
 
 Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.
 
 ## Solution
 
-Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`), automatically retry once with an escalated limit of **64K**. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
+Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`):
+
+1. **Escalate** to the model's full output limit (with 64K as a floor for unknown models)
+2. If still truncated, **recover** by keeping the partial response in history and injecting a continuation message, up to 3 times
+3. If recovery is exhausted, fall back to the tool scheduler's truncation guidance
+
+Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
 
 ## Architecture
 
 ```
-                      ┌─────────────────────────┐
-                      │   Request starts        │
-                      │   max_tokens = 8K       │
-                      └───────────┬─────────────┘
-                                  │
-                                  ▼
-                      ┌─────────────────────────┐
-                      │   Stream response       │
-                      └───────────┬─────────────┘
-                                  │
-                        ┌─────────┴─────────┐
-                        │                   │
-                   finish_reason        finish_reason
-                   != MAX_TOKENS        == MAX_TOKENS
-                        │                   │
-                        ▼                   ▼
-                  ┌───────────┐   ┌─────────────────────┐
-                  │   Done    │   │  Check conditions:   │
-                  └───────────┘   │  - No user override? │
-                                  │  - No env override?  │
-                                  │  - Not already       │
-                                  │    escalated?        │
-                                  └─────────┬───────────┘
-                                     YES    │    NO
-                                  ┌─────────┴────┐
-                                  │              │
-                                  ▼              ▼
-                          ┌─────────────┐  ┌──────────┐
-                          │ Pop partial │  │  Done    │
-                          │ model resp  │  │ (truncd) │
-                          │ from history│  └──────────┘
-                          │             │
-                          │ Yield RETRY │
-                          │ event       │
-                          │             │
-                          │ Re-send     │
-                          │ max_tokens  │
-                          │   = 64K     │
-                          └─────────────┘
+Request (max_tokens = 8K)
+│
+▼
+┌─────────────────────────┐
+│  Response truncated?     │──── No ──▶ Done ✓
+│  (MAX_TOKENS)            │
+└───────────┬──────────────┘
+            │ Yes
+            ▼
+┌──────────────────────────────────────────────────┐
+│  Layer 1: Escalate to model output limit         │
+│  ┌────────────────────────────────────────────┐  │
+│  │ Pop partial response from history          │  │
+│  │ RETRY (isContinuation: false → reset UI)   │  │
+│  │ Re-send at max(64K, model output limit)    │  │
+│  └────────────────────────────────────────────┘  │
+└───────────┬──────────────────────────────────────┘
+            │
+            ▼
+┌─────────────────────────┐
+│  Still truncated?        │──── No ──▶ Done ✓
+│  (MAX_TOKENS)            │
+└───────────┬──────────────┘
+            │ Yes
+            ▼
+┌──────────────────────────────────────────────────┐
+│  Layer 2: Multi-turn recovery (up to 3×)         │
+│  ┌────────────────────────────────────────────┐  │
+│  │ Keep partial response in history           │  │
+│  │ Push user message: "Resume directly..."    │  │
+│  │ RETRY (isContinuation: true → keep UI buf) │  │
+│  │ Re-send with updated history               │  │
+│  │ Model continues from where it left off     │  │
+│  └──────────────┬─────────────────────────────┘  │
+│                 │                                 │
+│          ┌──────┴──────┐                          │
+│          │ Succeeded?  │── Yes ──▶ Done ✓         │
+│          └──────┬──────┘                          │
+│                 │ No (still truncated)            │
+│                 ▼                                 │
+│          attempt < 3? ── Yes ──▶ loop back ↑      │
+└───────────┬──────────────────────────────────────┘
+            │ No (exhausted)
+            ▼
+┌──────────────────────────────────────────────────┐
+│  Layer 3: Tool scheduler fallback                │
+│  ┌────────────────────────────────────────────┐  │
+│  │ Reject truncated Edit/Write tool calls     │  │
+│  │ Return guidance: "You MUST split into      │  │
+│  │ smaller parts — write skeleton first,      │  │
+│  │ then edit incrementally."                  │  │
+│  └────────────────────────────────────────────┘  │
+└──────────────────────────────────────────────────┘
 ```
 
 ## Token limit determination
 
 The effective `max_tokens` is resolved in the following priority order:
 
-| Priority    | Source                                               | Value (known model)          | Value (unknown model) | Escalation behavior            |
-| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ------------------------------ |
-| 1 (highest) | User config (`samplingParams.max_tokens`)            | `min(userValue, modelLimit)` | `userValue`           | No escalation                  |
-| 2           | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)`  | `envValue`            | No escalation                  |
-| 3 (lowest)  | Capped default                                       | `min(modelLimit, 8K)`        | `min(32K, 8K)` = 8K   | Escalates to 64K on truncation |
+| Priority    | Source                                               | Value (known model)          | Value (unknown model) | Escalation behavior                             |
+| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ----------------------------------------------- |
+| 1 (highest) | User config (`samplingParams.max_tokens`)            | `min(userValue, modelLimit)` | `userValue`           | No escalation                                   |
+| 2           | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)`  | `envValue`            | No escalation                                   |
+| 3 (lowest)  | Capped default                                       | `min(modelLimit, 8K)`        | `min(32K, 8K)` = 8K   | Escalates to model limit (64K floor) + recovery |
 
 A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.
 
@@ -88,9 +108,25 @@ The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry
 3. Guard checks pass:
    - maxTokensEscalated === false (prevent infinite escalation)
    - hasUserMaxTokensOverride === false (respect user intent)
-4. Pop the partial model response from chat history
-5. Yield RETRY event → UI discards partial output
-6. Re-send the same request with maxOutputTokens: 64K
+4. Compute escalated limit: max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output'))
+5. Pop the partial model response from chat history
+6. Yield RETRY event (isContinuation: false) → UI discards partial output and resets buffers
+7. Re-send the same request with maxOutputTokens: escalatedLimit
+```
+
+### Recovery steps (geminiChat.ts)
+
+If the escalated response is also truncated (finishReason === MAX_TOKENS), the recovery loop runs up to `MAX_OUTPUT_RECOVERY_ATTEMPTS` (3) times:
+
+```
+1. Partial model response is already in history (pushed by processStreamResponse)
+2. Push a recovery user message: OUTPUT_RECOVERY_MESSAGE
+3. Yield RETRY event (isContinuation: true) → UI keeps text buffer for continuation
+4. Re-send with updated history (model sees its partial output + recovery instruction)
+5. If still truncated and attempts remain, loop back to step 1
+6. If recovery attempt throws (empty response, network error):
+   - Pop the dangling recovery message from history
+   - Break out of recovery loop
 ```
 
 ### State cleanup on RETRY (turn.ts)
@@ -102,14 +138,26 @@ When the `Turn` class receives a RETRY event, it clears accumulated state to pre
 - `debugResponses` — cleared to avoid stale debug data
 - `finishReason` — reset to `undefined` so the new response's finish reason is used
 
+The `isContinuation` flag is passed through to the UI so it can decide whether to reset text buffers (escalation) or keep them (recovery).
+
 ## Constants
 
-Defined in `tokenLimits.ts`:
+Defined in `geminiChat.ts` and `tokenLimits.ts`:
+
+| Constant                       | Value  | Purpose                                                 |
+| ------------------------------ | ------ | ------------------------------------------------------- |
+| `CAPPED_DEFAULT_MAX_TOKENS`    | 8,000  | Default output token limit when no user override is set |
+| `ESCALATED_MAX_TOKENS`         | 64,000 | Floor for escalation (used when model limit is unknown) |
+| `MAX_OUTPUT_RECOVERY_ATTEMPTS` | 3      | Max multi-turn recovery attempts after escalation       |
+
+The effective escalated limit is `max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output'))`:
 
-| Constant                    | Value  | Purpose                                                 |
-| --------------------------- | ------ | ------------------------------------------------------- |
-| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000  | Default output token limit when no user override is set |
-| `ESCALATED_MAX_TOKENS`      | 64,000 | Output token limit used on truncation retry             |
+| Model            | Escalated limit |
+| ---------------- | --------------- |
+| Claude Opus 4.6  | 131,072 (128K)  |
+| GPT-5 / o-series | 131,072 (128K)  |
+| Qwen3.x          | 65,536 (64K)    |
+| Unknown models   | 64,000 (floor)  |
 
 ## Design decisions
 
@@ -119,20 +167,22 @@ Defined in `tokenLimits.ts`:
 - 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
 - Reduces average slot reservation from 32K to 8K (4x improvement)
 
-### Why 64K escalated limit?
+### Why escalate to model limit instead of fixed 64K?
 
-- Covers the vast majority of long outputs that were truncated at 8K
-- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
-- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate
+- Models with higher output limits (Claude Opus 128K, GPT-5 128K) were constrained to 64K unnecessarily
+- Using the model's actual limit captures the vast majority of long outputs without a second retry
+- `ESCALATED_MAX_TOKENS` (64K) serves as a floor for unknown models where `tokenLimit()` returns the default 32K
 
-### Why not progressive escalation (8K → 16K → 32K → 64K)?
+### Why multi-turn recovery instead of progressive escalation?
 
-- Each retry adds latency (the full response must be regenerated)
-- A single retry is the simplest approach that captures almost all cases
-- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K
+- Progressive escalation (8K → 16K → 32K → 64K) requires regenerating the full response each time
+- Multi-turn recovery keeps the partial response and lets the model continue, saving tokens and latency
+- Recovery messages are cheap (~40 tokens each) compared to regenerating large responses
+- The 3-attempt limit prevents infinite loops while covering most practical cases
 
 ### Why is escalation outside the retry loop?
 
 - Truncation is a success case, not an error
 - Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
 - Keeps the retry loop focused on its original purpose (transient error recovery)
+- Recovery errors are caught separately to avoid aborting the entire conversation
diff --git a/packages/cli/src/ui/hooks/useGeminiStream.ts b/packages/cli/src/ui/hooks/useGeminiStream.ts
@@ -1145,10 +1145,26 @@ export const useGeminiStream = (
             loopDetectedRef.current = true;
             break;
           case ServerGeminiEventType.Retry:
-            // Clear any pending partial content from the failed attempt
-            if (pendingHistoryItemRef.current) {
-              setPendingHistoryItem(null);
+            // On fresh restart (escalation / rate-limit / invalid stream),
+            // clear pending content and buffers to discard the failed attempt.
+            // On continuation (recovery), keep the pending gemini item AND
+            // buffers so the model's continuation text appends to them —
+            // otherwise handleContentEvent would see a null pending item,
+            // create a fresh one, and reset the buffer to just the new chunk,
+            // losing the partial text we meant to preserve.
+            if (!event.isContinuation) {
+              if (pendingHistoryItemRef.current) {
+                setPendingHistoryItem(null);
+              }
+              geminiMessageBuffer = '';
+              thoughtBuffer = '';
             }
+            // Always discard tool call requests from the truncated/failed
+            // attempt to prevent duplicate execution after escalation or
+            // recovery. The recovery path now skips turns that already
+            // contain a functionCall (see geminiChat.ts), so this only
+            // clears stale requests from pre-RETRY accumulation.
+            toolCallRequests.length = 0;
             // Show retry info if available (rate-limit / throttling errors)
             if (event.retryInfo) {
               startRetryCountdown(event.retryInfo);

diff --git a/packages/core/src/core/coreToolScheduler.ts b/packages/core/src/core/coreToolScheduler.ts
@@ -72,20 +72,22 @@ import { IdeClient } from '../ide/ide-client.js';
 
 const TRUNCATION_PARAM_GUIDANCE =
   'Note: Your previous response was truncated due to max_tokens limit, ' +
-  'which likely caused incomplete tool call parameters. ' +
+  'which caused incomplete tool call parameters. ' +
   'Please retry the tool call with complete parameters. ' +
   'If the content is too large for a single response, ' +
-  'consider splitting it into smaller parts.';
+  'you MUST split it into smaller parts: ' +
+  'first write_file with a skeleton/partial content, ' +
+  'then use edit to add the remaining sections incrementally.';
 
 const TRUNCATION_EDIT_REJECTION =
   'Your previous response was truncated due to max_tokens limit, ' +
-  'which likely produced incomplete file content. ' +
+  'which produced incomplete file content. ' +
   'The tool call has been rejected to prevent writing ' +
   'truncated content to the file. ' +
-  'Please retry the tool call with complete content. ' +
-  'If the content is too large for a single response, ' +
-  'consider splitting it into smaller parts ' +
-  '(e.g., write_file for initial content, then edit for additions).';
+  'You MUST split the content into smaller parts: ' +
+  'first write_file with a skeleton/partial content, ' +
+  'then use edit to add the remaining sections incrementally. ' +
+  'Do NOT retry with the same large content.';
 
 export type ValidatingToolCall = {
   status: 'validating';