feat(thinking): pass client thinking.budget_tokens through to fake reasoning tags#112
Open
kilhyeonjun wants to merge 2 commits intojwadow:mainfrom
Open
feat(thinking): pass client thinking.budget_tokens through to fake reasoning tags#112kilhyeonjun wants to merge 2 commits intojwadow:mainfrom
kilhyeonjun wants to merge 2 commits intojwadow:mainfrom
Conversation
…asoning tags The gateway currently hardcodes FAKE_REASONING_MAX_TOKENS for the <max_thinking_length> XML tag, ignoring the client-provided thinking.budget_tokens from the OpenAI-compatible request body. This means clients (e.g. OpenCode IDE) that send a thinking budget have no way to control reasoning depth per-request — every request gets the same static value from the env var. Changes: - models_openai.py: Add optional `thinking` field to ChatCompletionRequest - converters_openai.py: Extract budget_tokens from request.thinking dict - converters_core.py: Accept optional max_tokens param in inject_thinking_tags() and thinking_budget param in build_kiro_payload(), falling back to FAKE_REASONING_MAX_TOKENS when not provided Constraint: FAKE_REASONING_MAX_TOKENS remains the default when no client budget is sent Confidence: high Scope-risk: narrow
…ustion When clients send large thinking.budget_tokens (e.g. 32768 for effort=max), the fake reasoning XML tag tells the model it can think for up to 32K tokens. Unlike native thinking where thinking tokens are separately allocated, fake reasoning shares the output token pool — the model spends all output tokens on reasoning and never produces actual content. Add FAKE_REASONING_BUDGET_CAP env var (default: 10000) that limits how high client budget_tokens can push the fake reasoning max_thinking_length tag. Set to 0 to disable capping. Constraint: FAKE_REASONING_MAX_TOKENS (env default) is never capped — only client overrides Confidence: high Scope-risk: narrow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #111 — The fake reasoning feature now respects the client's
thinking.budget_tokensfrom the OpenAI-compatible request body, instead of always using the hardcodedFAKE_REASONING_MAX_TOKENSenv var.Changes
models_openai.py: Add optionalthinking: Dict[str, Any]field toChatCompletionRequestconverters_openai.py: Extractbudget_tokensfromrequest.thinkingand pass it downstreamconverters_core.py: Accept optionalmax_tokensparam ininject_thinking_tags()andthinking_budgetparam inbuild_kiro_payload(), falling back toFAKE_REASONING_MAX_TOKENSwhen not providedBehavior
thinking.budget_tokens: 16000<max_thinking_length>4000</max_thinking_length>(env default)<max_thinking_length>16000</max_thinking_length><max_thinking_length>4000</max_thinking_length><max_thinking_length>4000</max_thinking_length>(unchanged)Testing
Verified locally via Docker with debug logging enabled: