feat(thinking): pass client thinking.budget_tokens through to fake reasoning tags by kilhyeonjun · Pull Request #112 · jwadow/kiro-gateway

kilhyeonjun · 2026-03-24T07:40:05Z

Summary

Fixes #111 — The fake reasoning feature now respects the client's thinking.budget_tokens from the OpenAI-compatible request body, instead of always using the hardcoded FAKE_REASONING_MAX_TOKENS env var.

Changes

models_openai.py: Add optional thinking: Dict[str, Any] field to ChatCompletionRequest
converters_openai.py: Extract budget_tokens from request.thinking and pass it downstream
converters_core.py: Accept optional max_tokens param in inject_thinking_tags() and thinking_budget param in build_kiro_payload(), falling back to FAKE_REASONING_MAX_TOKENS when not provided

Behavior

Scenario	Before	After
Client sends `thinking.budget_tokens: 16000`	`<max_thinking_length>4000</max_thinking_length>` (env default)	`<max_thinking_length>16000</max_thinking_length>`
Client sends no thinking field	`<max_thinking_length>4000</max_thinking_length>`	`<max_thinking_length>4000</max_thinking_length>` (unchanged)

Testing

Verified locally via Docker with debug logging enabled:

DEBUG - Client requested thinking budget: 10000
DEBUG - Injecting fake reasoning tags with max_tokens=10000

…asoning tags The gateway currently hardcodes FAKE_REASONING_MAX_TOKENS for the <max_thinking_length> XML tag, ignoring the client-provided thinking.budget_tokens from the OpenAI-compatible request body. This means clients (e.g. OpenCode IDE) that send a thinking budget have no way to control reasoning depth per-request — every request gets the same static value from the env var. Changes: - models_openai.py: Add optional `thinking` field to ChatCompletionRequest - converters_openai.py: Extract budget_tokens from request.thinking dict - converters_core.py: Accept optional max_tokens param in inject_thinking_tags() and thinking_budget param in build_kiro_payload(), falling back to FAKE_REASONING_MAX_TOKENS when not provided Constraint: FAKE_REASONING_MAX_TOKENS remains the default when no client budget is sent Confidence: high Scope-risk: narrow

…ustion When clients send large thinking.budget_tokens (e.g. 32768 for effort=max), the fake reasoning XML tag tells the model it can think for up to 32K tokens. Unlike native thinking where thinking tokens are separately allocated, fake reasoning shares the output token pool — the model spends all output tokens on reasoning and never produces actual content. Add FAKE_REASONING_BUDGET_CAP env var (default: 10000) that limits how high client budget_tokens can push the fake reasoning max_thinking_length tag. Set to 0 to disable capping. Constraint: FAKE_REASONING_MAX_TOKENS (env default) is never capped — only client overrides Confidence: high Scope-risk: narrow

cla-bot bot added the cla-signed Contributor License Agreement has been signed label Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(thinking): pass client thinking.budget_tokens through to fake reasoning tags#112

feat(thinking): pass client thinking.budget_tokens through to fake reasoning tags#112
kilhyeonjun wants to merge 2 commits intojwadow:mainfrom
kilhyeonjun:feat/thinking-budget-passthrough

kilhyeonjun commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kilhyeonjun commented Mar 24, 2026

Summary

Changes

Behavior

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant