-
Notifications
You must be signed in to change notification settings - Fork 1.2k
chore: introduce write queue for inference_store #3383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3ab319c
to
3cd55fb
Compare
ashwinb
reviewed
Sep 9, 2025
ashwinb
reviewed
Sep 9, 2025
b2eb8a0
to
ae449e1
Compare
ashwinb
reviewed
Sep 9, 2025
ashwinb
reviewed
Sep 10, 2025
ashwinb
approved these changes
Sep 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah seems reasonable.
# What does this PR do? Adds a write worker queue for writes to inference store. This avoids overwhelming request processing with slow inference writes. ## Test Plan Benchmark: ``` cd /docs/source/distributions/k8s-benchmark # start mock server python openai-mock-server.py --port 8000 # start stack server uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml # run benchmark script uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct ``` Before: ============================================================ BENCHMARK RESULTS Response Time Statistics: Mean: 1.111s Median: 0.982s Min: 0.466s Max: 15.190s Std Dev: 1.091s Percentiles: P50: 0.982s P90: 1.281s P95: 1.439s P99: 5.476s Time to First Token (TTFT) Statistics: Mean: 0.474s Median: 0.347s Min: 0.175s Max: 15.129s Std Dev: 0.819s TTFT Percentiles: P50: 0.347s P90: 0.661s P95: 0.762s P99: 2.788s Streaming Statistics: Mean chunks per response: 67.2 Total chunks received: 122154 ============================================================ Total time: 120.00s Concurrent users: 50 Total requests: 1919 Successful requests: 1819 Failed requests: 100 Success rate: 94.8% Requests per second: 15.16 Errors (showing first 5): Request error: Request error: Request error: Request error: Request error: Benchmark completed. Stopping server (PID: 679)... Server stopped. After: ============================================================ BENCHMARK RESULTS Response Time Statistics: Mean: 1.085s Median: 1.089s Min: 0.451s Max: 2.002s Std Dev: 0.212s Percentiles: P50: 1.089s P90: 1.343s P95: 1.409s P99: 1.617s Time to First Token (TTFT) Statistics: Mean: 0.407s Median: 0.361s Min: 0.182s Max: 1.178s Std Dev: 0.175s TTFT Percentiles: P50: 0.361s P90: 0.644s P95: 0.744s P99: 0.932s Streaming Statistics: Mean chunks per response: 66.8 Total chunks received: 367240 ============================================================ Total time: 120.00s Concurrent users: 50 Total requests: 5495 Successful requests: 5495 Failed requests: 0 Success rate: 100.0% Requests per second: 45.79 Benchmark completed. Stopping server (PID: 97169)... Server stopped.
mattf
reviewed
Sep 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehhuang this is cool. did you consider enabling the write-ahead-log and using executemany?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a write worker queue for writes to inference store. This avoids overwhelming request processing with slow inference writes.
Test Plan
Benchmark:
RPS from 21 -> 57