Skip to content

Commit cd8a4d7

Browse files
authored
Removing unused assets and converting images to webp for faster page loads and improved SEO (#152)
1 parent 9b97f85 commit cd8a4d7

File tree

162 files changed

+66
-84
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

162 files changed

+66
-84
lines changed

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ image_url: https://your-website.com/path/to/your-photo.jpg
190190

191191
**Examples from existing authors:**
192192
- GitHub avatar: [`robshaw`](blog/authors.yml#L10-L11) uses `https://avatars.githubusercontent.com/u/114415538?v=4`
193-
- Local image: [`cnuland`](blog/authors.yml#L39) uses `/img/blogs/cnuland.jpeg`
193+
- Local image: [`cnuland`](blog/authors.yml#L39) uses `/img/blogs/cnuland.webp`
194194

195195
#### 4. **Write Your Content**
196196

blog/2025-05-20_announce.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ With llm-d, users can operationalize gen AI deployments with a modular, high-per
2828

2929
Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing.
3030

31-
![Figure 1: Deploying a service to multiple vLLM instances](../docs/assets/images/image5_46.png)
31+
![Figure 1: Deploying a service to multiple vLLM instances](../docs/assets/images/image5_46.webp)
3232

3333
This simple pattern is very effective for most request patterns, which have the following characteristics:
3434

@@ -41,7 +41,7 @@ This simple pattern is very effective for most request patterns, which have the
4141

4242
The LLM inference workload, however, is unique with slow, non-uniform, expensive requests. This means that typical scale-out and load-balancing patterns fall short of optimal performance.
4343

44-
![Figure 2: Comparison of modern HTTP requests](../docs/assets/images/image7_33.png)
44+
![Figure 2: Comparison of modern HTTP requests](../docs/assets/images/image7_33.webp)
4545

4646
Let's take a look at each one step-by-step:
4747

@@ -51,7 +51,7 @@ Let's take a look at each one step-by-step:
5151
* RAG has long inputs \- prompt and retrieved docs \- and short generated outputs
5252
* Reasoning has a short or medium inputs and long generated outputs
5353

54-
![Figure 3: Comparing the RAG pattern and Thinking/Reasoning pattern with prefill and decode stages](../docs/assets/images/image2_4.jpg)
54+
![Figure 3: Comparing the RAG pattern and Thinking/Reasoning pattern with prefill and decode stages](../docs/assets/images/image2_4.webp)
5555

5656
* These differences in request times can lead to significant imbalances across instances, which are compounded as loaded instances get overwhelmed. Overloads lead to longer ITL (Inter-Token Latency), which leads to more load, which leads to more ITL.
5757

@@ -61,11 +61,11 @@ Let's take a look at each one step-by-step:
6161
* Agentic (tool calls are iterative request flow)
6262
* Code completion task (requests reuse current codebase as context)
6363

64-
![The agentic pattern sequence](../docs/assets/images/image8_0.jpg)
64+
![The agentic pattern sequence](../docs/assets/images/image8_0.webp)
6565

6666
* LLM inference servers like vLLM implement a method called "automatic prefix caching", which enables "skipping" a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
6767

68-
![The prefix aching method](../docs/assets/images/image3.jpg)
68+
![The prefix aching method](../docs/assets/images/image3.webp)
6969

7070
*C. Specializing and coordinating replicas to process a single request can lead to more throughput per GPU.*
7171

@@ -78,7 +78,7 @@ Let's take a look at each one step-by-step:
7878

7979
* DeepSeek released a [discussion of the design of their inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which leverages aggressive disaggregation to achieve remarkable performance at scale.
8080

81-
![Disaggregation separates the prefill and decode phases](../docs/assets/images/image4_57.png)
81+
![Disaggregation separates the prefill and decode phases](../docs/assets/images/image4_57.webp)
8282

8383
*D. Production deployments often have a range of quality of service (QoS) requirements.*
8484

@@ -163,7 +163,7 @@ We conducted a series of experiments to evaluate the performance of the [llm-d-i
163163
| **S2** | LlaMA 4 Scout FP8 | TP2, 4 replicas | 12,000 | 100 | P95 TTFT \<= 2s |
164164
| **S3** | Llama 3.1 70B FP16 | TP2, 4 replicas | 8,000 | 100 | P95 TTFT \<= 2s |
165165

166-
# ![](../docs/assets/images/image1_116.png)
166+
# ![](../docs/assets/images/image1_116.webp)
167167

168168
**Key Observations:**
169169

blog/2025-06-03_week_1_round_up.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ hide_table_of_contents: false
1919

2020
We've hit 1000 ⭐️'s on [GitHub](https://github.com/llm-d/llm-d)
2121

22-
![llm-d Star Chart](../docs/assets/images/star-history-202563.png)
22+
![llm-d Star Chart](../docs/assets/images/star-history-202563.webp)
2323

2424
<!-- truncate -->
2525

blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ Our benchmarking suite has matured significantly. It now supports testing any pr
6666

6767
For this release, we ran sweeps to characterize throughput and scaling to demonstrate the benefits of P/D disaggregation for long context workloads. Scenarios cover representative workload shapes (input/output ratios of 10:1 and 100:1) and explore various parallelism schemes and P/D disaggregation ratios. For each setup, we're measuring throughput scaling (tokens per second per user and tokens per second per GPU) across increasing concurrency levels. These results provide direct comparison with and without P/D separation (just load-aware), highlighting where llm-d’s optimizations deliver significant benefits.
6868

69-
![v0.2-llama-scout-1](../docs/assets/images/v0.2-llama-scout-1.png)
69+
![v0.2-llama-scout-1](../docs/assets/images/v0.2-llama-scout-1.webp)
7070

7171
Fig. 1: Pareto curve for Llama-Scout on dual 8×H200 IB nodes, comparing monolithic (4tp4) and P/D-disaggregated (4ptp2–2dtp4) topologies.
7272

blog/2025-09-03_intelligent-inference-scheduling-with-llm-d.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ The llm-d project lays out clear, “well-lit” paths for anyone to adopt the l
2323
Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.
2424

2525
<div style={{textAlign: 'center', margin: '20px 0'}}>
26-
<img src="/img/blogs/inference-scheduling/image01.png" alt="Intelligent inference scheduling diagram" style={{width: '75%', height: 'auto'}} />
26+
<img src="/img/blogs/inference-scheduling/image01.webp" alt="Intelligent inference scheduling diagram" style={{width: '75%', height: 'auto'}} />
2727
</div>
2828

2929
<!-- truncate -->
@@ -51,7 +51,7 @@ The default EPP in IGW follows a structured scheduling cycle for each incoming r
5151
Building on IGW’s foundation, **llm-d** **augments the EPP with more advanced scheduling capabilities**. It introduces scorers that optimize for KV cache locality (boosting prefix-cache hit rates) and orchestrates multiple scheduling passes to disaggregate prefill and decode phases onto specialized pod variants. The result is a fully LLM-aware scheduler that drives higher throughput, lower tail latencies, and finer resource efficiency across the board.
5252

5353
<div style={{textAlign: 'center', margin: '20px 0'}}>
54-
<img src="/img/blogs/inference-scheduling/image02.png" alt="Diagram" style={{width: '75%', height: 'auto'}} />
54+
<img src="/img/blogs/inference-scheduling/image02.webp" alt="Diagram" style={{width: '75%', height: 'auto'}} />
5555
</div>
5656

5757
### Intelligent Inference Scheduling with llm-d
@@ -83,21 +83,21 @@ When cache locality is abundant, the results are dramatic:
8383

8484
<div style={{margin: '20px 0'}}>
8585
<div style={{marginBottom: '20px'}}>
86-
<img src="/img/blogs/inference-scheduling/image03.png" alt="Throughput vs Request Rate" style={{width: '100%', height: 'auto'}} />
86+
<img src="/img/blogs/inference-scheduling/image03.webp" alt="Throughput vs Request Rate" style={{width: '100%', height: 'auto'}} />
8787
<p style={{textAlign: 'center', fontSize: '0.9em', marginTop: '8px'}}><em>Throughput vs Request Rate</em></p>
8888
</div>
8989

9090
<div style={{display: 'grid', gridTemplateColumns: '1fr 1fr 1fr', gap: '15px', alignItems: 'start'}}>
9191
<div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
92-
<img src="/img/blogs/inference-scheduling/image04.png" alt="Success Rate" style={{width: '100%', height: 'auto'}} />
92+
<img src="/img/blogs/inference-scheduling/image04.webp" alt="Success Rate" style={{width: '100%', height: 'auto'}} />
9393
<p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>Success Rate</em></p>
9494
</div>
9595
<div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
96-
<img src="/img/blogs/inference-scheduling/image05.png" alt="TTFT and QPS" style={{width: '100%', height: 'auto'}} />
96+
<img src="/img/blogs/inference-scheduling/image05.webp" alt="TTFT and QPS" style={{width: '100%', height: 'auto'}} />
9797
<p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>TTFT and QPS</em></p>
9898
</div>
9999
<div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
100-
<img src="/img/blogs/inference-scheduling/image06.png" alt="Intertoken Latency" style={{width: '100%', height: 'auto'}} />
100+
<img src="/img/blogs/inference-scheduling/image06.webp" alt="Intertoken Latency" style={{width: '100%', height: 'auto'}} />
101101
<p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>Intertoken Latency</em></p>
102102
</div>
103103
</div>
@@ -125,8 +125,8 @@ When cache hits are rare, prefix-awareness provides little benefit, and both sco
125125

126126
Under low prefix sharing workloads, the benefits of prefix-aware routing naturally diminish. In this case, adding load-awareness or prefix-awareness makes little difference \- both strategies scale smoothly and meet latency targets.
127127

128-
![Latency vs request rate](/img/blogs/inference-scheduling/image07.png)
129-
![Throughput vs Request rate](/img/blogs/inference-scheduling/image08.png)
128+
![Latency vs request rate](/img/blogs/inference-scheduling/image07.webp)
129+
![Throughput vs Request rate](/img/blogs/inference-scheduling/image08.webp)
130130

131131
### **Takeaway**
132132

blog/2025-09-24_kvcache-wins-you-can-see.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -71,15 +71,15 @@ The power of vLLM's caching isn't theoretical; it directly maps to the structure
7171

7272
In any multi-turn dialogue, from a customer service bot to a long-form assistant, the entire chat history and system prompt form a massive **prefix**. Each new user message is a tiny **suffix**. Effective caching means only the latest turn is prefilled, keeping the conversation fluid and responsive, preventing latency from increasing as the dialogue gets longer.
7373

74-
![Conversational AI prefix caching diagram](/img/blogs/kv-cache-wins/image1.png)
74+
![Conversational AI prefix caching diagram](/img/blogs/kv-cache-wins/image1.webp)
7575

7676
<small>*__FIGURE 1__: A diagram showing the conversational history as a growing prefix that gets cached, with only the new user query requiring prefill.*</small>
7777

7878
#### **Agentic Workflows**
7979

8080
AI agents represent the most extreme case of prefix dominance. These systems operate in reasoning loops where the prefix contains the agent's goals, tool definitions, and a long history of actions and observations. Production data shows this can lead to input-to-output ratios exceeding **100:1** *(from the Manus [blog](https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus))*, making the prefix overwhelmingly large. Reusing context at every step makes agents computationally viable.
8181

82-
![Agentic workflow prefix caching diagram](/img/blogs/kv-cache-wins/image2.png)
82+
![Agentic workflow prefix caching diagram](/img/blogs/kv-cache-wins/image2.webp)
8383

8484
<small>*__FIGURE 2__: A visual of an agent loop, showing the massive, static context (tools, step-history) as the cached prefix and the new observation/action as the small suffix.*</small>
8585

@@ -97,7 +97,7 @@ What happens when we move from single-instance environment to distributed produc
9797

9898
Let's revisit our agentic workflow example to see the direct impact of being blind to this unmanaged, disaggregated cache:
9999

100-
![KV-cache miss scenario diagram](/img/blogs/kv-cache-wins/image3.png)
100+
![KV-cache miss scenario diagram](/img/blogs/kv-cache-wins/image3.webp)
101101

102102
<small>*__FIGURE 3__: A heartbreaking KV-cache miss scenario.*</small>
103103

@@ -129,7 +129,7 @@ The `KVEvents` provide a live feed of all physical cache changes across the clus
129129

130130
This two-layered architecture provides a continuously updated, scalable view of the cluster's cache state, which is the key to enabling intelligent, cache-aware routing.
131131

132-
![llm-d architecture diagram](/img/blogs/kv-cache-wins/image4.png)
132+
![llm-d architecture diagram](/img/blogs/kv-cache-wins/image4.webp)
133133

134134
<small>*__FIGURE 4__: Simplified architecture diagram. (1) \- (3) show the read path, while (A) \- (B) show the write pipeline.*</small>
135135

@@ -201,7 +201,7 @@ This efficiency in latency directly translates to higher system capacity. `preci
201201

202202
This allows you to handle significantly more traffic on the exact same hardware, simply by eliminating the waste of cache misses.
203203

204-
![Performance benchmark charts](/img/blogs/kv-cache-wins/image5.png)
204+
![Performance benchmark charts](/img/blogs/kv-cache-wins/image5.webp)
205205

206206
<small>*__FIGURE 5__: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates.*</small>
207207

@@ -219,7 +219,7 @@ The following graphs were captured throughout the benchmark runs. Schedulers are
219219

220220
First, we measure the **Effective Cache Throughput** \- the number of prompt **tokens** per second served directly from the cache. This metric quantifies the computational work the GPUs ***avoided***. A high value means the system is consistently saving massive amounts of expensive prefill computation.
221221

222-
![Effective cache throughput metrics](/img/blogs/kv-cache-wins/image6.png)
222+
![Effective cache throughput metrics](/img/blogs/kv-cache-wins/image6.webp)
223223

224224
<small>*__FIGURE 6__: The total computational work **saved** by the KV-cache across the cluster, over the course of the benchmarks.*</small>
225225

@@ -231,10 +231,10 @@ The chart clearly shows that `precise-scheduling` sustains a massive and stable
231231

232232
This saved work translates directly into system health. By avoiding prefill bottlenecks, the GPUs can focus on productive decoding. We can see this by comparing the number of "**Waiting**" requests (**queued**) and "**Running**" requests (**in decode**).
233233

234-
![vLLM waiting requests metrics](/img/blogs/kv-cache-wins/image7.png)
234+
![vLLM waiting requests metrics](/img/blogs/kv-cache-wins/image7.webp)
235235
<small>*__FIGURE 7__: The number of **waiting requests** in vLLM over the course of the benchmark.*</small>
236236

237-
![vLLM running requests metrics](/img/blogs/kv-cache-wins/image8.png)
237+
![vLLM running requests metrics](/img/blogs/kv-cache-wins/image8.webp)
238238
<small>*__FIGURE 8__: The number of **running requests** **(decoding)** in vLLM over the course of the benchmark.*</small>
239239

240240
The **`precise-scheduling`** plots on the left show a stable system. By effectively utilizing the disaggregated KV-cache, it maintains minimal waiting queues and maximizes the number of actively running requests. In contrast, the other schedulers are clearly overwhelmed; their growing waiting queues choke the system and prevent work from being done efficiently.

0 commit comments

Comments
 (0)