Skip to content

feat(thinking): pass client thinking.budget_tokens through to fake reasoning tags#112

Open
kilhyeonjun wants to merge 2 commits intojwadow:mainfrom
kilhyeonjun:feat/thinking-budget-passthrough
Open

feat(thinking): pass client thinking.budget_tokens through to fake reasoning tags#112
kilhyeonjun wants to merge 2 commits intojwadow:mainfrom
kilhyeonjun:feat/thinking-budget-passthrough

Conversation

@kilhyeonjun
Copy link
Copy Markdown
Contributor

Summary

Fixes #111 — The fake reasoning feature now respects the client's thinking.budget_tokens from the OpenAI-compatible request body, instead of always using the hardcoded FAKE_REASONING_MAX_TOKENS env var.

Changes

  • models_openai.py: Add optional thinking: Dict[str, Any] field to ChatCompletionRequest
  • converters_openai.py: Extract budget_tokens from request.thinking and pass it downstream
  • converters_core.py: Accept optional max_tokens param in inject_thinking_tags() and thinking_budget param in build_kiro_payload(), falling back to FAKE_REASONING_MAX_TOKENS when not provided

Behavior

Scenario Before After
Client sends thinking.budget_tokens: 16000 <max_thinking_length>4000</max_thinking_length> (env default) <max_thinking_length>16000</max_thinking_length>
Client sends no thinking field <max_thinking_length>4000</max_thinking_length> <max_thinking_length>4000</max_thinking_length> (unchanged)

Testing

Verified locally via Docker with debug logging enabled:

DEBUG - Client requested thinking budget: 10000
DEBUG - Injecting fake reasoning tags with max_tokens=10000

…asoning tags

The gateway currently hardcodes FAKE_REASONING_MAX_TOKENS for the
<max_thinking_length> XML tag, ignoring the client-provided
thinking.budget_tokens from the OpenAI-compatible request body.

This means clients (e.g. OpenCode IDE) that send a thinking budget
have no way to control reasoning depth per-request — every request
gets the same static value from the env var.

Changes:
- models_openai.py: Add optional `thinking` field to ChatCompletionRequest
- converters_openai.py: Extract budget_tokens from request.thinking dict
- converters_core.py: Accept optional max_tokens param in inject_thinking_tags()
  and thinking_budget param in build_kiro_payload(), falling back to
  FAKE_REASONING_MAX_TOKENS when not provided

Constraint: FAKE_REASONING_MAX_TOKENS remains the default when no client budget is sent
Confidence: high
Scope-risk: narrow
@cla-bot cla-bot bot added the cla-signed Contributor License Agreement has been signed label Mar 24, 2026
…ustion

When clients send large thinking.budget_tokens (e.g. 32768 for effort=max),
the fake reasoning XML tag tells the model it can think for up to 32K tokens.
Unlike native thinking where thinking tokens are separately allocated, fake
reasoning shares the output token pool — the model spends all output tokens
on reasoning and never produces actual content.

Add FAKE_REASONING_BUDGET_CAP env var (default: 10000) that limits how high
client budget_tokens can push the fake reasoning max_thinking_length tag.
Set to 0 to disable capping.

Constraint: FAKE_REASONING_MAX_TOKENS (env default) is never capped — only client overrides
Confidence: high
Scope-risk: narrow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed Contributor License Agreement has been signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fake reasoning ignores client thinking.budget_tokens — always uses hardcoded FAKE_REASONING_MAX_TOKENS

1 participant