chore: introduce write queue for inference_store #3383

ehhuang · 2025-09-08T23:26:13Z

What does this PR do?

Adds a write worker queue for writes to inference store. This avoids overwhelming request processing with slow inference writes.

Test Plan

Benchmark:

cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct

RPS from 21 -> 57

llama_stack/providers/utils/inference/inference_store.py

ashwinb

yeah seems reasonable.

# What does this PR do? Adds a write worker queue for writes to inference store. This avoids overwhelming request processing with slow inference writes. ## Test Plan Benchmark: ``` cd /docs/source/distributions/k8s-benchmark # start mock server python openai-mock-server.py --port 8000 # start stack server uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml # run benchmark script uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct ``` Before: ============================================================ BENCHMARK RESULTS Response Time Statistics: Mean: 1.111s Median: 0.982s Min: 0.466s Max: 15.190s Std Dev: 1.091s Percentiles: P50: 0.982s P90: 1.281s P95: 1.439s P99: 5.476s Time to First Token (TTFT) Statistics: Mean: 0.474s Median: 0.347s Min: 0.175s Max: 15.129s Std Dev: 0.819s TTFT Percentiles: P50: 0.347s P90: 0.661s P95: 0.762s P99: 2.788s Streaming Statistics: Mean chunks per response: 67.2 Total chunks received: 122154 ============================================================ Total time: 120.00s Concurrent users: 50 Total requests: 1919 Successful requests: 1819 Failed requests: 100 Success rate: 94.8% Requests per second: 15.16 Errors (showing first 5): Request error: Request error: Request error: Request error: Request error: Benchmark completed. Stopping server (PID: 679)... Server stopped. After: ============================================================ BENCHMARK RESULTS Response Time Statistics: Mean: 1.085s Median: 1.089s Min: 0.451s Max: 2.002s Std Dev: 0.212s Percentiles: P50: 1.089s P90: 1.343s P95: 1.409s P99: 1.617s Time to First Token (TTFT) Statistics: Mean: 0.407s Median: 0.361s Min: 0.182s Max: 1.178s Std Dev: 0.175s TTFT Percentiles: P50: 0.361s P90: 0.644s P95: 0.744s P99: 0.932s Streaming Statistics: Mean chunks per response: 66.8 Total chunks received: 367240 ============================================================ Total time: 120.00s Concurrent users: 50 Total requests: 5495 Successful requests: 5495 Failed requests: 0 Success rate: 100.0% Requests per second: 45.79 Benchmark completed. Stopping server (PID: 97169)... Server stopped.

mattf

@ehhuang this is cool. did you consider enabling the write-ahead-log and using executemany?

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 8, 2025

ehhuang force-pushed the pr3383 branch 2 times, most recently from 3ab319c to 3cd55fb Compare September 9, 2025 17:01

ehhuang marked this pull request as ready for review September 9, 2025 17:09

ehhuang requested review from ashwinb, yanxi0830, hardikjshah, raghotham, terrytangyuan, leseb, bbrowning, reluctantfuturist, mattf and slekkala1 as code owners September 9, 2025 17:09

ehhuang mentioned this pull request Sep 9, 2025

Performance testing #2384

Open

ashwinb reviewed Sep 9, 2025

View reviewed changes

llama_stack/providers/utils/inference/inference_store.py Show resolved Hide resolved

ashwinb reviewed Sep 9, 2025

View reviewed changes

llama_stack/providers/utils/inference/inference_store.py Show resolved Hide resolved

ehhuang force-pushed the pr3383 branch 2 times, most recently from b2eb8a0 to ae449e1 Compare September 9, 2025 18:01

ashwinb reviewed Sep 9, 2025

View reviewed changes

llama_stack/providers/utils/inference/inference_store.py Show resolved Hide resolved

ashwinb reviewed Sep 10, 2025

View reviewed changes

llama_stack/providers/utils/inference/inference_store.py Show resolved Hide resolved

ashwinb approved these changes Sep 10, 2025

View reviewed changes

ehhuang force-pushed the pr3383 branch from ae449e1 to 556bbca Compare September 10, 2025 18:43

ehhuang force-pushed the pr3383 branch from 556bbca to e721ca9 Compare September 10, 2025 18:50

ehhuang merged commit e980436 into llamastack:main Sep 10, 2025
27 of 45 checks passed

mattf reviewed Sep 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: introduce write queue for inference_store #3383

chore: introduce write queue for inference_store #3383

Uh oh!

ehhuang commented Sep 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashwinb left a comment

Uh oh!

Uh oh!

mattf left a comment

Uh oh!

Uh oh!

chore: introduce write queue for inference_store #3383

chore: introduce write queue for inference_store #3383

Uh oh!

Conversation

ehhuang commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

RPS from 21 -> 57

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashwinb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ehhuang commented Sep 8, 2025 •

edited

Loading