[Serve][LLM] Support concurrent streaming in SGLang completions() endpoint by Truc54 · Pull Request #64094 · ray-project/ray

Truc54 · 2026-06-15T03:21:39Z

Description

This PR addresses issue #63901 by refactoring the streaming path in SGLangServer.completions() to process concurrent generation prompts in parallel instead of sequentially.

Why is this needed?

In the current implementation of the SGLang server engine, completing multiple prompts with stream=True runs a sequential for loop over the prompts. This means prompt i must stream to completion before prompt i+1 starts generating. This sequential bottleneck dramatically limits throughput and increases request latency under concurrent load.

Solution:

Refactored the request.stream branch in SGLangServer.completions() to execute _stream_generate concurrently for all prompts using asyncio.create_task().
Collected yielded text delta chunks from all concurrent tasks into a single shared asyncio.Queue.
Drained the queue in the main async generator and yielded SSE chunks as they arrive, enabling natural interleaving.
Added proper exception propagation from workers to the main generator.
Added a finally block to cancel outstanding tasks if the generator is closed or an error occurs, preventing task leaks.

Related issues

Closes #63901

Additional information

Implementation details:

Concurrency (asyncio.Queue): Parallelized completion streams using an asynchronous producer-consumer queue pattern.
Robustness & Cleanup: Active tasks are tracked and canceled in case of early client disconnection or errors.
Tests: Added a new CPU-based mock unit test suite in python/ray/llm/tests/serve/cpu/deployments/test_sglang_server.py that utilizes a meta-path importer hook to bypass native compilation dependencies.

Both tests pass successfully:

test_concurrent_streaming_completions (Verifies chunks are properly interleaved and delta-decoded).
test_concurrent_streaming_completions_exception_handling (Verifies exception propagation and task cancellation).

…point Refactors SGLang completions() endpoint under request.stream=True to run prompt generations concurrently instead of sequentially: - Uses asyncio.create_task to run _stream_generate in parallel for each prompt in the request. - Uses an asyncio.Queue to collect streaming chunks from tasks and yields them as they arrive. - Implements proper error propagation and cancellation of all active producer tasks inside a finally block to prevent resource leaks. Closes ray-project#63901. Signed-off-by: Truc54 <trungtruc5405@gmail.com>

gemini-code-assist

Code Review

This pull request refactors the SGLang engine's streaming completions to process multiple prompts concurrently using an asyncio queue and background producer tasks, rather than sequentially. It also adds comprehensive unit tests to verify concurrent streaming and exception handling. The review feedback suggests awaiting the cancelled background tasks in the finally block to prevent potential resource leaks or unhandled task exception warnings.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T03:22:26Z

+            finally:
+                for task in tasks:
+                    if not task.done():
+                        task.cancel()


When cancelling background tasks in the finally block, it is important to await them (e.g., using asyncio.gather with return_exceptions=True). Otherwise, the tasks might continue running their cleanup/finally blocks in the background after the generator has returned, which can lead to resource leaks, race conditions, or "Task exception was never retrieved" warnings if any task failed.

Suggested change

finally:

for task in tasks:

if not task.done():

task.cancel()

finally:

for task in tasks:

if not task.done():

task.cancel()

if tasks:

await asyncio.gather(*tasks, return_exceptions=True)

…s() finally block Awaits cancelled background producer tasks using asyncio.gather with return_exceptions=True in the finally block of the SGLang completions() streaming generator. This avoids potential resource leaks, race conditions, and "Task exception was never retrieved" warnings if any background task failed. Signed-off-by: Truc54 <trungtruc5405@gmail.com>

Fix flakiness in test_concurrent_streaming_completions by replacing time-based asyncio.sleep delays with explicit asyncio.Event synchronization objects. CI environments with high CPU load can suffer scheduling delays, causing asynchronous generators to yield out-of-order and fail text assertions. Using events guarantees deterministic interleaving of chunks. Signed-off-by: Truc54 <trungtruc5405@gmail.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 5ac4f52. Configure here.}

cursor · 2026-06-16T01:00:16Z

+                    if isinstance(item, Exception):
+                        raise item
+                    elif item is None:
+                        completed_tasks += 1


None sentinel masks producer failures

Medium Severity

Each streaming producer always enqueues None in finally after the try/except, but only Exception subclasses are forwarded on the queue. Failures such as asyncio.CancelledError or KeyboardInterrupt bypass except Exception, yet still emit the completion sentinel, so the main loop can finish as if every prompt completed when a worker actually aborted.

^{Reviewed by Cursor Bugbot for commit 5ac4f52. Configure here.}

Truc54 requested a review from a team as a code owner June 15, 2026 03:21

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Truc54 force-pushed the fix/sglang-concurrent-streaming branch from 8edf7b0 to 5c625d8 Compare June 15, 2026 04:14

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread python/ray/llm/tests/serve/cpu/deployments/test_sglang_server.py

ray-gardener Bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Jun 15, 2026

Truc54 force-pushed the fix/sglang-concurrent-streaming branch from 5c625d8 to c0a12ac Compare June 15, 2026 12:15

Truc54 force-pushed the fix/sglang-concurrent-streaming branch from c0a12ac to 5ac4f52 Compare June 16, 2026 00:57

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve][LLM] Support concurrent streaming in SGLang completions() endpoint#64094

[Serve][LLM] Support concurrent streaming in SGLang completions() endpoint#64094
Truc54 wants to merge 3 commits into
ray-project:masterfrom
Truc54:fix/sglang-concurrent-streaming

Truc54 commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Truc54 commented Jun 15, 2026

Description

Why is this needed?

Solution:

Related issues

Additional information

Implementation details:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 16, 2026

Choose a reason for hiding this comment

None sentinel masks producer failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant