llm-d
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 1 deletion b/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎blog/2025-05-20_announce.md‎
Lines changed: 7 additions & 7 deletions b/‎blog/2025-05-20_announce.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎blog/2025-06-03_week_1_round_up.md‎
Lines changed: 1 addition & 1 deletion b/‎blog/2025-06-03_week_1_round_up.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md‎
Lines changed: 1 addition & 1 deletion b/‎blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎blog/2025-09-03_intelligent-inference-scheduling-with-llm-d.md‎
Lines changed: 8 additions & 8 deletions b/‎blog/2025-09-03_intelligent-inference-scheduling-with-llm-d.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎blog/2025-09-24_kvcache-wins-you-can-see.md‎
Lines changed: 8 additions & 8 deletions b/‎blog/2025-09-24_kvcache-wins-you-can-see.md‎
Lines changed: 8 additions & 8 deletions
@@ -190,7 +190,7 @@ image_url: https://your-website.com/path/to/your-photo.jpg
 
 **Examples from existing authors:**
 - GitHub avatar: [`robshaw`](blog/authors.yml#L10-L11) uses `https://avatars.githubusercontent.com/u/114415538?v=4`
-- Local image: [`cnuland`](blog/authors.yml#L39) uses `/img/blogs/cnuland.jpeg`
+- Local image: [`cnuland`](blog/authors.yml#L39) uses `/img/blogs/cnuland.webp`
 
 #### 4. **Write Your Content**
 
 
@@ -28,7 +28,7 @@ With llm-d, users can operationalize gen AI deployments with a modular, high-per
 
 Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing.
 
-![Figure 1: Deploying a service to multiple vLLM instances](../docs/assets/images/image5_46.png)
+![Figure 1: Deploying a service to multiple vLLM instances](../docs/assets/images/image5_46.webp)
 
 This simple pattern is very effective for most request patterns, which have the following characteristics:
 
@@ -41,7 +41,7 @@ This simple pattern is very effective for most request patterns, which have the
 
 The LLM inference workload, however, is unique with slow, non-uniform, expensive requests. This means that typical scale-out and load-balancing patterns fall short of optimal performance.
 
-![Figure 2: Comparison of modern HTTP requests](../docs/assets/images/image7_33.png)
+![Figure 2: Comparison of modern HTTP requests](../docs/assets/images/image7_33.webp)
 
 Let's take a look at each one step-by-step:
 
@@ -51,7 +51,7 @@ Let's take a look at each one step-by-step:
   * RAG has long inputs \- prompt and retrieved docs \- and short generated outputs  
   * Reasoning has a short or medium inputs and long generated outputs
 
-![Figure 3: Comparing the RAG pattern and Thinking/Reasoning pattern with prefill and decode stages](../docs/assets/images/image2_4.jpg)
+![Figure 3: Comparing the RAG pattern and Thinking/Reasoning pattern with prefill and decode stages](../docs/assets/images/image2_4.webp)
 
 * These differences in request times can lead to significant imbalances across instances, which are compounded as loaded instances get overwhelmed. Overloads lead to longer ITL (Inter-Token Latency), which leads to more load, which leads to more ITL.
 
@@ -61,11 +61,11 @@ Let's take a look at each one step-by-step:
   * Agentic (tool calls are iterative request flow)  
   * Code completion task (requests reuse current codebase as context)
 
-![The agentic pattern sequence](../docs/assets/images/image8_0.jpg)
+![The agentic pattern sequence](../docs/assets/images/image8_0.webp)
 
 * LLM inference servers like vLLM implement a method called "automatic prefix caching", which enables "skipping" a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
 
-![The prefix aching method](../docs/assets/images/image3.jpg)
+![The prefix aching method](../docs/assets/images/image3.webp)
 
 *C. Specializing and coordinating replicas to process a single request can lead to more throughput per GPU.*
 
@@ -78,7 +78,7 @@ Let's take a look at each one step-by-step:
 
   * DeepSeek released a [discussion of the design of their inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which leverages aggressive disaggregation to achieve remarkable performance at scale.
 
-![Disaggregation separates the prefill and decode phases](../docs/assets/images/image4_57.png)
+![Disaggregation separates the prefill and decode phases](../docs/assets/images/image4_57.webp)
 
 *D. Production deployments often have a range of quality of service (QoS) requirements.*
 
@@ -163,7 +163,7 @@ We conducted a series of experiments to evaluate the performance of the [llm-d-i
 | **S2** | LlaMA 4 Scout FP8 | TP2, 4 replicas | 12,000 | 100 | P95 TTFT \<= 2s |
 | **S3** | Llama 3.1 70B FP16 | TP2, 4 replicas | 8,000 | 100 | P95 TTFT \<= 2s |
 
-# ![](../docs/assets/images/image1_116.png)
+# ![](../docs/assets/images/image1_116.webp)
 
 **Key Observations:**
 
 
@@ -19,7 +19,7 @@ hide_table_of_contents: false
 
 We've hit 1000 ⭐️'s on [GitHub](https://github.com/llm-d/llm-d)
 
-![llm-d Star Chart](../docs/assets/images/star-history-202563.png)
+![llm-d Star Chart](../docs/assets/images/star-history-202563.webp)
 
 <!-- truncate -->
 
 
@@ -66,7 +66,7 @@ Our benchmarking suite has matured significantly. It now supports testing any pr
 
 For this release, we ran sweeps to characterize throughput and scaling to demonstrate the benefits of P/D disaggregation for long context workloads. Scenarios cover representative workload shapes (input/output ratios of 10:1 and 100:1) and explore various parallelism schemes and P/D disaggregation ratios. For each setup, we're measuring throughput scaling (tokens per second per user and tokens per second per GPU) across increasing concurrency levels. These results provide direct comparison with and without P/D separation (just load-aware), highlighting where llm-d’s optimizations deliver significant benefits.
 
-![v0.2-llama-scout-1](../docs/assets/images/v0.2-llama-scout-1.png)
+![v0.2-llama-scout-1](../docs/assets/images/v0.2-llama-scout-1.webp)
 
 Fig. 1: Pareto curve for Llama-Scout on dual 8×H200 IB nodes, comparing monolithic (4tp4) and P/D-disaggregated (4ptp2–2dtp4) topologies.
 
 
@@ -23,7 +23,7 @@ The llm-d project lays out clear, “well-lit” paths for anyone to adopt the l
 Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.
 
 <div style={{textAlign: 'center', margin: '20px 0'}}>
-  <img src="/img/blogs/inference-scheduling/image01.png" alt="Intelligent inference scheduling diagram" style={{width: '75%', height: 'auto'}} />
+  <img src="/img/blogs/inference-scheduling/image01.webp" alt="Intelligent inference scheduling diagram" style={{width: '75%', height: 'auto'}} />
 </div>  
 
 <!-- truncate -->
@@ -51,7 +51,7 @@ The default EPP in IGW follows a structured scheduling cycle for each incoming r
 Building on IGW’s foundation, **llm-d** **augments the EPP with more advanced scheduling capabilities**. It introduces scorers that optimize for KV cache locality (boosting prefix-cache hit rates) and orchestrates multiple scheduling passes to disaggregate prefill and decode phases onto specialized pod variants. The result is a fully LLM-aware scheduler that drives higher throughput, lower tail latencies, and finer resource efficiency across the board.
 
 <div style={{textAlign: 'center', margin: '20px 0'}}>
-  <img src="/img/blogs/inference-scheduling/image02.png" alt="Diagram" style={{width: '75%', height: 'auto'}} />
+  <img src="/img/blogs/inference-scheduling/image02.webp" alt="Diagram" style={{width: '75%', height: 'auto'}} />
 </div>
 
 ### Intelligent Inference Scheduling with llm-d
@@ -83,21 +83,21 @@ When cache locality is abundant, the results are dramatic:
 
 <div style={{margin: '20px 0'}}>
   <div style={{marginBottom: '20px'}}>
-    <img src="/img/blogs/inference-scheduling/image03.png" alt="Throughput vs Request Rate" style={{width: '100%', height: 'auto'}} />
+    <img src="/img/blogs/inference-scheduling/image03.webp" alt="Throughput vs Request Rate" style={{width: '100%', height: 'auto'}} />
     <p style={{textAlign: 'center', fontSize: '0.9em', marginTop: '8px'}}><em>Throughput vs Request Rate</em></p>
   </div>
 
   <div style={{display: 'grid', gridTemplateColumns: '1fr 1fr 1fr', gap: '15px', alignItems: 'start'}}>
     <div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
-      <img src="/img/blogs/inference-scheduling/image04.png" alt="Success Rate" style={{width: '100%', height: 'auto'}} />
+      <img src="/img/blogs/inference-scheduling/image04.webp" alt="Success Rate" style={{width: '100%', height: 'auto'}} />
       <p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>Success Rate</em></p>
     </div>
     <div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
-      <img src="/img/blogs/inference-scheduling/image05.png" alt="TTFT and QPS" style={{width: '100%', height: 'auto'}} />
+      <img src="/img/blogs/inference-scheduling/image05.webp" alt="TTFT and QPS" style={{width: '100%', height: 'auto'}} />
       <p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>TTFT and QPS</em></p>
     </div>
     <div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
-      <img src="/img/blogs/inference-scheduling/image06.png" alt="Intertoken Latency" style={{width: '100%', height: 'auto'}} />
+      <img src="/img/blogs/inference-scheduling/image06.webp" alt="Intertoken Latency" style={{width: '100%', height: 'auto'}} />
       <p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>Intertoken Latency</em></p>
     </div>
   </div>
@@ -125,8 +125,8 @@ When cache hits are rare, prefix-awareness provides little benefit, and both sco
 
 Under low prefix sharing workloads, the benefits of prefix-aware routing naturally diminish. In this case, adding load-awareness or prefix-awareness makes little difference \- both strategies scale smoothly and meet latency targets.
 
-![Latency vs request rate](/img/blogs/inference-scheduling/image07.png)
-![Throughput vs Request rate](/img/blogs/inference-scheduling/image08.png)
+![Latency vs request rate](/img/blogs/inference-scheduling/image07.webp)
+![Throughput vs Request rate](/img/blogs/inference-scheduling/image08.webp)
 
 ### **Takeaway**
 
 
@@ -71,15 +71,15 @@ The power of vLLM's caching isn't theoretical; it directly maps to the structure
 
 In any multi-turn dialogue, from a customer service bot to a long-form assistant, the entire chat history and system prompt form a massive **prefix**. Each new user message is a tiny **suffix**. Effective caching means only the latest turn is prefilled, keeping the conversation fluid and responsive, preventing latency from increasing as the dialogue gets longer.
 
-![Conversational AI prefix caching diagram](/img/blogs/kv-cache-wins/image1.png)
+![Conversational AI prefix caching diagram](/img/blogs/kv-cache-wins/image1.webp)
 
 <small>*__FIGURE 1__: A diagram showing the conversational history as a growing prefix that gets cached, with only the new user query requiring prefill.*</small>
 
 #### **Agentic Workflows**
 
 AI agents represent the most extreme case of prefix dominance. These systems operate in reasoning loops where the prefix contains the agent's goals, tool definitions, and a long history of actions and observations. Production data shows this can lead to input-to-output ratios exceeding **100:1** *(from the Manus [blog](https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus))*, making the prefix overwhelmingly large. Reusing context at every step makes agents computationally viable.
 
-![Agentic workflow prefix caching diagram](/img/blogs/kv-cache-wins/image2.png)
+![Agentic workflow prefix caching diagram](/img/blogs/kv-cache-wins/image2.webp)
 
 <small>*__FIGURE 2__: A visual of an agent loop, showing the massive, static context (tools, step-history) as the cached prefix and the new observation/action as the small suffix.*</small>
 
@@ -97,7 +97,7 @@ What happens when we move from single-instance environment to distributed produc
 
 Let's revisit our agentic workflow example to see the direct impact of being blind to this unmanaged, disaggregated cache:
 
-![KV-cache miss scenario diagram](/img/blogs/kv-cache-wins/image3.png)
+![KV-cache miss scenario diagram](/img/blogs/kv-cache-wins/image3.webp)
 
 <small>*__FIGURE 3__: A heartbreaking KV-cache miss scenario.*</small>
 
@@ -129,7 +129,7 @@ The `KVEvents` provide a live feed of all physical cache changes across the clus
 
 This two-layered architecture provides a continuously updated, scalable view of the cluster's cache state, which is the key to enabling intelligent, cache-aware routing.
 
-![llm-d architecture diagram](/img/blogs/kv-cache-wins/image4.png)
+![llm-d architecture diagram](/img/blogs/kv-cache-wins/image4.webp)
 
 <small>*__FIGURE 4__: Simplified architecture diagram. (1) \- (3) show the read path, while (A) \- (B) show the write pipeline.*</small>
 
@@ -201,7 +201,7 @@ This efficiency in latency directly translates to higher system capacity. `preci
 
 This allows you to handle significantly more traffic on the exact same hardware, simply by eliminating the waste of cache misses.
 
-![Performance benchmark charts](/img/blogs/kv-cache-wins/image5.png)
+![Performance benchmark charts](/img/blogs/kv-cache-wins/image5.webp)
 
 <small>*__FIGURE 5__: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates.*</small>
 
@@ -219,7 +219,7 @@ The following graphs were captured throughout the benchmark runs. Schedulers are
 
 First, we measure the **Effective Cache Throughput** \- the number of prompt **tokens** per second served directly from the cache. This metric quantifies the computational work the GPUs ***avoided***. A high value means the system is consistently saving massive amounts of expensive prefill computation.
 
-![Effective cache throughput metrics](/img/blogs/kv-cache-wins/image6.png)
+![Effective cache throughput metrics](/img/blogs/kv-cache-wins/image6.webp)
 
 <small>*__FIGURE 6__: The total computational work **saved** by the KV-cache across the cluster, over the course of the benchmarks.*</small>
 
@@ -231,10 +231,10 @@ The chart clearly shows that `precise-scheduling` sustains a massive and stable
 
 This saved work translates directly into system health. By avoiding prefill bottlenecks, the GPUs can focus on productive decoding. We can see this by comparing the number of "**Waiting**" requests (**queued**) and "**Running**" requests (**in decode**).
 
-![vLLM waiting requests metrics](/img/blogs/kv-cache-wins/image7.png)  
+![vLLM waiting requests metrics](/img/blogs/kv-cache-wins/image7.webp)  
 <small>*__FIGURE 7__: The number of **waiting requests** in vLLM over the course of the benchmark.*</small>
 
-![vLLM running requests metrics](/img/blogs/kv-cache-wins/image8.png)  
+![vLLM running requests metrics](/img/blogs/kv-cache-wins/image8.webp)  
 <small>*__FIGURE 8__: The number of **running requests** **(decoding)** in vLLM over the course of the benchmark.*</small>
 
 The **`precise-scheduling`** plots on the left show a stable system. By effectively utilizing the disaggregated KV-cache, it maintains minimal waiting queues and maximizes the number of actively running requests. In contrast, the other schedulers are clearly overwhelmed; their growing waiting queues choke the system and prevent work from being done efficiently.