Skip to content

[Serve][LLM] Support concurrent streaming in SGLang completions() endpoint#64094

Open
Truc54 wants to merge 3 commits into
ray-project:masterfrom
Truc54:fix/sglang-concurrent-streaming
Open

[Serve][LLM] Support concurrent streaming in SGLang completions() endpoint#64094
Truc54 wants to merge 3 commits into
ray-project:masterfrom
Truc54:fix/sglang-concurrent-streaming

Conversation

@Truc54

@Truc54 Truc54 commented Jun 15, 2026

Copy link
Copy Markdown

Description

This PR addresses issue #63901 by refactoring the streaming path in SGLangServer.completions() to process concurrent generation prompts in parallel instead of sequentially.

Why is this needed?

In the current implementation of the SGLang server engine, completing multiple prompts with stream=True runs a sequential for loop over the prompts. This means prompt i must stream to completion before prompt i+1 starts generating. This sequential bottleneck dramatically limits throughput and increases request latency under concurrent load.

Solution:

  • Refactored the request.stream branch in SGLangServer.completions() to execute _stream_generate concurrently for all prompts using asyncio.create_task().
  • Collected yielded text delta chunks from all concurrent tasks into a single shared asyncio.Queue.
  • Drained the queue in the main async generator and yielded SSE chunks as they arrive, enabling natural interleaving.
  • Added proper exception propagation from workers to the main generator.
  • Added a finally block to cancel outstanding tasks if the generator is closed or an error occurs, preventing task leaks.

Related issues

Closes #63901

Additional information

Implementation details:

  1. Concurrency (asyncio.Queue): Parallelized completion streams using an asynchronous producer-consumer queue pattern.
  2. Robustness & Cleanup: Active tasks are tracked and canceled in case of early client disconnection or errors.
  3. Tests: Added a new CPU-based mock unit test suite in python/ray/llm/tests/serve/cpu/deployments/test_sglang_server.py that utilizes a meta-path importer hook to bypass native compilation dependencies.

Both tests pass successfully:

  • test_concurrent_streaming_completions (Verifies chunks are properly interleaved and delta-decoded).
  • test_concurrent_streaming_completions_exception_handling (Verifies exception propagation and task cancellation).

…point

Refactors SGLang completions() endpoint under request.stream=True to run prompt generations concurrently instead of sequentially:
- Uses asyncio.create_task to run _stream_generate in parallel for each prompt in the request.
- Uses an asyncio.Queue to collect streaming chunks from tasks and yields them as they arrive.
- Implements proper error propagation and cancellation of all active producer tasks inside a finally block to prevent resource leaks.

Closes ray-project#63901.

Signed-off-by: Truc54 <trungtruc5405@gmail.com>
@Truc54 Truc54 requested a review from a team as a code owner June 15, 2026 03:21

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the SGLang engine's streaming completions to process multiple prompts concurrently using an asyncio queue and background producer tasks, rather than sequentially. It also adds comprehensive unit tests to verify concurrent streaming and exception handling. The review feedback suggests awaiting the cancelled background tasks in the finally block to prevent potential resource leaks or unhandled task exception warnings.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +446 to +449
finally:
for task in tasks:
if not task.done():
task.cancel()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When cancelling background tasks in the finally block, it is important to await them (e.g., using asyncio.gather with return_exceptions=True). Otherwise, the tasks might continue running their cleanup/finally blocks in the background after the generator has returned, which can lead to resource leaks, race conditions, or "Task exception was never retrieved" warnings if any task failed.

Suggested change
finally:
for task in tasks:
if not task.done():
task.cancel()
finally:
for task in tasks:
if not task.done():
task.cancel()
if tasks:
await asyncio.gather(*tasks, return_exceptions=True)

…s() finally block

Awaits cancelled background producer tasks using asyncio.gather with return_exceptions=True
in the finally block of the SGLang completions() streaming generator. This avoids potential
resource leaks, race conditions, and "Task exception was never retrieved" warnings if any
background task failed.

Signed-off-by: Truc54 <trungtruc5405@gmail.com>
@Truc54 Truc54 force-pushed the fix/sglang-concurrent-streaming branch from 8edf7b0 to 5c625d8 Compare June 15, 2026 04:14
Comment thread python/ray/llm/tests/serve/cpu/deployments/test_sglang_server.py
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Jun 15, 2026
@Truc54 Truc54 force-pushed the fix/sglang-concurrent-streaming branch from 5c625d8 to c0a12ac Compare June 15, 2026 12:15
Fix flakiness in test_concurrent_streaming_completions by replacing time-based
asyncio.sleep delays with explicit asyncio.Event synchronization objects.
CI environments with high CPU load can suffer scheduling delays, causing
asynchronous generators to yield out-of-order and fail text assertions.
Using events guarantees deterministic interleaving of chunks.

Signed-off-by: Truc54 <trungtruc5405@gmail.com>
@Truc54 Truc54 force-pushed the fix/sglang-concurrent-streaming branch from c0a12ac to 5ac4f52 Compare June 16, 2026 00:57

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 5ac4f52. Configure here.

if isinstance(item, Exception):
raise item
elif item is None:
completed_tasks += 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None sentinel masks producer failures

Medium Severity

Each streaming producer always enqueues None in finally after the try/except, but only Exception subclasses are forwarded on the queue. Failures such as asyncio.CancelledError or KeyboardInterrupt bypass except Exception, yet still emit the completion sentinel, so the main loop can finish as if every prompt completed when a worker actually aborted.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5ac4f52. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Serve][LLM] SGLangServer: Fix Sequential Batch Processing in completions()

1 participant