You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-05-20_announce.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ With llm-d, users can operationalize gen AI deployments with a modular, high-per
28
28
29
29
Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing.
30
30
31
-

31
+

32
32
33
33
This simple pattern is very effective for most request patterns, which have the following characteristics:
34
34
@@ -41,7 +41,7 @@ This simple pattern is very effective for most request patterns, which have the
41
41
42
42
The LLM inference workload, however, is unique with slow, non-uniform, expensive requests. This means that typical scale-out and load-balancing patterns fall short of optimal performance.
43
43
44
-

44
+

45
45
46
46
Let's take a look at each one step-by-step:
47
47
@@ -51,7 +51,7 @@ Let's take a look at each one step-by-step:
51
51
* RAG has long inputs \- prompt and retrieved docs \- and short generated outputs
52
52
* Reasoning has a short or medium inputs and long generated outputs
53
53
54
-

54
+

55
55
56
56
* These differences in request times can lead to significant imbalances across instances, which are compounded as loaded instances get overwhelmed. Overloads lead to longer ITL (Inter-Token Latency), which leads to more load, which leads to more ITL.
57
57
@@ -61,11 +61,11 @@ Let's take a look at each one step-by-step:
61
61
* Agentic (tool calls are iterative request flow)
62
62
* Code completion task (requests reuse current codebase as context)
* LLM inference servers like vLLM implement a method called "automatic prefix caching", which enables "skipping" a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
*C. Specializing and coordinating replicas to process a single request can lead to more throughput per GPU.*
71
71
@@ -78,7 +78,7 @@ Let's take a look at each one step-by-step:
78
78
79
79
* DeepSeek released a [discussion of the design of their inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which leverages aggressive disaggregation to achieve remarkable performance at scale.
80
80
81
-

81
+

82
82
83
83
*D. Production deployments often have a range of quality of service (QoS) requirements.*
84
84
@@ -163,7 +163,7 @@ We conducted a series of experiments to evaluate the performance of the [llm-d-i
Copy file name to clipboardExpand all lines: blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,7 +66,7 @@ Our benchmarking suite has matured significantly. It now supports testing any pr
66
66
67
67
For this release, we ran sweeps to characterize throughput and scaling to demonstrate the benefits of P/D disaggregation for long context workloads. Scenarios cover representative workload shapes (input/output ratios of 10:1 and 100:1) and explore various parallelism schemes and P/D disaggregation ratios. For each setup, we're measuring throughput scaling (tokens per second per user and tokens per second per GPU) across increasing concurrency levels. These results provide direct comparison with and without P/D separation (just load-aware), highlighting where llm-d’s optimizations deliver significant benefits.
Copy file name to clipboardExpand all lines: blog/2025-09-03_intelligent-inference-scheduling-with-llm-d.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ The llm-d project lays out clear, “well-lit” paths for anyone to adopt the l
23
23
Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.
@@ -51,7 +51,7 @@ The default EPP in IGW follows a structured scheduling cycle for each incoming r
51
51
Building on IGW’s foundation, **llm-d****augments the EPP with more advanced scheduling capabilities**. It introduces scorers that optimize for KV cache locality (boosting prefix-cache hit rates) and orchestrates multiple scheduling passes to disaggregate prefill and decode phases onto specialized pod variants. The result is a fully LLM-aware scheduler that drives higher throughput, lower tail latencies, and finer resource efficiency across the board.
@@ -125,8 +125,8 @@ When cache hits are rare, prefix-awareness provides little benefit, and both sco
125
125
126
126
Under low prefix sharing workloads, the benefits of prefix-aware routing naturally diminish. In this case, adding load-awareness or prefix-awareness makes little difference \- both strategies scale smoothly and meet latency targets.
127
127
128
-

129
-

128
+

129
+

Copy file name to clipboardExpand all lines: blog/2025-09-24_kvcache-wins-you-can-see.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -71,15 +71,15 @@ The power of vLLM's caching isn't theoretical; it directly maps to the structure
71
71
72
72
In any multi-turn dialogue, from a customer service bot to a long-form assistant, the entire chat history and system prompt form a massive **prefix**. Each new user message is a tiny **suffix**. Effective caching means only the latest turn is prefilled, keeping the conversation fluid and responsive, preventing latency from increasing as the dialogue gets longer.
73
73
74
-

74
+

75
75
76
76
<small>*__FIGURE 1__: A diagram showing the conversational history as a growing prefix that gets cached, with only the new user query requiring prefill.*</small>
77
77
78
78
#### **Agentic Workflows**
79
79
80
80
AI agents represent the most extreme case of prefix dominance. These systems operate in reasoning loops where the prefix contains the agent's goals, tool definitions, and a long history of actions and observations. Production data shows this can lead to input-to-output ratios exceeding **100:1***(from the Manus [blog](https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus))*, making the prefix overwhelmingly large. Reusing context at every step makes agents computationally viable.
<small>*__FIGURE 2__: A visual of an agent loop, showing the massive, static context (tools, step-history) as the cached prefix and the new observation/action as the small suffix.*</small>
85
85
@@ -97,7 +97,7 @@ What happens when we move from single-instance environment to distributed produc
97
97
98
98
Let's revisit our agentic workflow example to see the direct impact of being blind to this unmanaged, disaggregated cache:
99
99
100
-

100
+

101
101
102
102
<small>*__FIGURE 3__: A heartbreaking KV-cache miss scenario.*</small>
103
103
@@ -129,7 +129,7 @@ The `KVEvents` provide a live feed of all physical cache changes across the clus
129
129
130
130
This two-layered architecture provides a continuously updated, scalable view of the cluster's cache state, which is the key to enabling intelligent, cache-aware routing.
<small>*__FIGURE 5__: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates.*</small>
207
207
@@ -219,7 +219,7 @@ The following graphs were captured throughout the benchmark runs. Schedulers are
219
219
220
220
First, we measure the **Effective Cache Throughput**\- the number of prompt **tokens** per second served directly from the cache. This metric quantifies the computational work the GPUs ***avoided***. A high value means the system is consistently saving massive amounts of expensive prefill computation.
<small>*__FIGURE 6__: The total computational work **saved** by the KV-cache across the cluster, over the course of the benchmarks.*</small>
225
225
@@ -231,10 +231,10 @@ The chart clearly shows that `precise-scheduling` sustains a massive and stable
231
231
232
232
This saved work translates directly into system health. By avoiding prefill bottlenecks, the GPUs can focus on productive decoding. We can see this by comparing the number of "**Waiting**" requests (**queued**) and "**Running**" requests (**in decode**).
<small>*__FIGURE 8__: The number of **running requests****(decoding)** in vLLM over the course of the benchmark.*</small>
239
239
240
240
The **`precise-scheduling`** plots on the left show a stable system. By effectively utilizing the disaggregated KV-cache, it maintains minimal waiting queues and maximizes the number of actively running requests. In contrast, the other schedulers are clearly overwhelmed; their growing waiting queues choke the system and prevent work from being done efficiently.
0 commit comments