Skip to content

fix(streaming): wire first-token retry and return correct finish_reason on truncation#114

Open
kilhyeonjun wants to merge 2 commits intojwadow:mainfrom
kilhyeonjun:feat/streaming-reliability
Open

fix(streaming): wire first-token retry and return correct finish_reason on truncation#114
kilhyeonjun wants to merge 2 commits intojwadow:mainfrom
kilhyeonjun:feat/streaming-reliability

Conversation

@kilhyeonjun
Copy link
Copy Markdown
Contributor

Summary

Fixes #113 — Two streaming reliability fixes:

  1. Wire stream_with_first_token_retry into OpenAI streaming path — The retry mechanism was implemented but never connected. Now routes_openai.py uses it with a closure that reuses the initial validated response on first attempt, only creating new requests on retry.

  2. Return correct finish_reason/stop_reason on truncation — When Kiro API truncates mid-stream (no completion signals), now returns finish_reason: "length" (OpenAI) / stop_reason: "max_tokens" (Anthropic) instead of "stop" / "end_turn".

Changes

  • routes_openai.py: Replace stream_kiro_to_openai with stream_with_first_token_retry in streaming handler. Uses make_retry_request closure that reuses the initial 200 response on first attempt.
  • streaming_openai.py: When content_was_truncated is true, set finish_reason = "length" instead of "stop"
  • streaming_anthropic.py: When content_was_truncated is true, set stop_reason = "max_tokens" instead of "end_turn"

Behavior

Scenario Before After
First token timeout (OpenAI) Immediate failure, no retry Up to 3 retries (FIRST_TOKEN_MAX_RETRIES)
Kiro API truncation (OpenAI) finish_reason: "stop" finish_reason: "length"
Kiro API truncation (Anthropic) stop_reason: "end_turn" stop_reason: "max_tokens"
Normal completion finish_reason: "stop" finish_reason: "stop" (unchanged)

Testing

Verified locally via Docker:

  • Streaming responses complete normally with finish_reason: "stop"
  • First-token retry mechanism is wired (confirmed via log path routes_openai:stream_wrapper)
  • Truncation detection logic unchanged — only the reported finish_reason differs

…on on truncation

Two issues fixed:

1. stream_with_first_token_retry was implemented but never wired into
   the OpenAI streaming path. routes_openai.py called stream_kiro_to_openai
   directly, bypassing retry logic entirely. Now uses
   stream_with_first_token_retry with a make_request closure that reuses
   the initial validated response on first attempt.

2. When Kiro API truncates a response mid-stream (no usage/context_usage
   completion signals), the gateway returned finish_reason="stop" (OpenAI)
   / stop_reason="end_turn" (Anthropic), making clients believe the
   response completed normally. Now returns "length" / "max_tokens"
   respectively, so clients can detect truncation and request continuation.

Constraint: First attempt reuses the already-validated 200 response to avoid double-requesting
Confidence: high
Scope-risk: moderate
@cla-bot cla-bot bot added the cla-signed Contributor License Agreement has been signed label Mar 25, 2026
Tests that mock parse_kiro_stream without usage/context_usage events
now trigger truncation detection, causing finish_reason="length" and
stop_reason="max_tokens" instead of the expected "stop"/"end_turn".

Add usage event (OpenAI) and context_usage event (Anthropic) to mock
streams so they represent normal completions, not truncated ones.

Confidence: high
Scope-risk: narrow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed Contributor License Agreement has been signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming: first-token retry not wired in OpenAI path; truncation returns wrong finish_reason

1 participant