-
Notifications
You must be signed in to change notification settings - Fork 415
Description
[RFC]: Omni Connector For Full Disaggregation Architecture 2026 Q1 Roadmap
Motivation
To achieve higher scalability and resource efficiency in multimodal large model serving, we propose a roadmap centered on Full Disaggregation Architecture. By leveraging Omni Connector and RDMA technologies, we aim to decouple computation stages (Prefill, Decode, Generate, Encode) and enable flexible distributed execution across nodes. This roadmap focuses on architectural refactoring, high-performance transport, and expanding model support for Qwen Omni and Hunyuan Image families.
Proposed Change
1. Omni Connector: High-Performance Transport
Enhancing the data transport layer to support low-latency, asynchronous transfers required for disaggregated serving.
- P0: Mooncake Transfer Engine Integration
- Goal: Provide full RDMA support based on the
Mooncaketransfer engine, specifically optimized for stage-to-stage data transmission and high-bandwidth AR/DiT KV cache communication. - Action: Deep integration of Mooncake's optimized RDMA primitives into
MooncakeRDMAConnector. - Reference: [RFC]: vLLM-Omni RDMA connector Feature Design #955
[WIP]vLLM-Omni RDMA connector #1019
- Goal: Provide full RDMA support based on the
- P0: Asynchronous Transmission
- Goal: Enable non-blocking data transfer to maximize hardware utilization by overlapping communication overhead with computation tasks.
- Action: Implement asynchronous operations in
OmniConnectorto hide transport latency in disaggregated pipelines. - Reference: [wip]async scheduling to overlap chunk IO and compute #951
2. Full Disaggregation Functionality (EPDG)
Implementing fine-grained separation of duties (Encoder-Prefill-Decode-Generate) for the Qwen Omni 2.5/3 model family to achieve optimal resource utilization and scalability.
-
P0: PD Separation (Prefill-Decode Disaggregation)
- Definition: Decoupling the Prefill stage from the Decode stage into separate instances/stages for AR models.
- Target: Qwen Omni 2.5 / 3 (Thinker & Talker).
- Rationale: Prefill is compute-bound while Decode is memory-bound. Separating them allows for independent scaling (e.g., using different GPU types or parallelism strategies) and prevents the "head-of-line blocking" problem where long prefills delay decoding requests. Given that vLLM already provides a mature implementation of PD disaggregation, we aim to leverage its existing Mooncake-based mechanism as much as possible.
- Implementation:
- Leverage vllm kv connector to transmit the KV Cache generated during the Prefill stage to the Decode stage via RDMA.
- Ensure seamless handover of request state (e.g., SamplingParams, Request ID) alongside KV data, while maintaining full compatibility and interoperability with the existing Omni Connector architecture.
- Reference: [RFC]: Support Prefill-Decode Disaggregation for vLLM-Omni Thinker Stage via vLLM KV Transfer #1188
-
P1: E->P Separation (Encoder-Prefill Disaggregation)
- Definition: Decoupling the Multimodal Encoder (e.g., Audio/Vision Encoders) from the LLM Prefill stage.
- Target: Qwen Omni 2.5 / 3.
- Rationale: Audio/Vision encoding has different compute characteristics compared to the LLM Prefill. Separating
Eallows the heavy lifting of signal processing to be offloaded (e.g., to CPU or specialized workers), reducing the load on the main GPU cluster running the LLM. - Implementation:
- Leverage existing separation patterns in vLLM to decouple the Encoder from the Prefill stage, standardizing the interface for transmitting embeddings and features via Omni Connector.
3. Disaggregated Model Support
Extending disaggregated serving support to advanced multimodal models.
-
P0: Bagel Disaggregation Optimization
- Goal: Optimize the existing Bagel disaggregation pipeline performance.
- Details:
- Implement Cross-node RDMA KV Cache Transfer to minimize latency in multi-node deployments
- Reference: RFC: Bagel deployment #936
-
P1: Hunyuan Image AR -> DiT Separation
- Details: Implement the separation between the AutoRegressive (AR) text/image understanding component and the Diffusion Transformer (DiT) generation component.
- Transport: Enable RDMA transmission via
MooncakeRDMAConnectorbetween the AR and DiT stages. - Reference: [Model] Add Hunyuan Image3 AR Support #759
, [Model] SupportHunyuanImage3 Diffusion Model in GPU #1085
-
P1: Data Transmission Optimization
- Details: Implement
pin_memoryoptimization for Hunyuan Image data transfers to maximize RDMA throughput and minimize CPU overhead. - Reference: [WIP]vLLM-Omni RDMA connector #1019
- Details: Implement
4. Core Architecture: Distributed Execution Refactoring
Aligning with vLLM's native distributed architecture to support flexible multi-node deployment.
- P1: Native Worker Actor Pattern
- Goal: Remove the
OmniStageRay actor wrapper. Currently,OmniStageand the worker must stay on the same node, which limits the scaling flexibility (E.G. TP>8). - Action: Refactor
omni_stage.pyto let the Worker act as the Ray Actor (consistent with vLLM main repo). This enables a single stage to span multiple GPUs/nodes and facilitates TP/CP/PP etc. across boundaries.
- Goal: Remove the
Roadmap Overview
| Priority | Category | Task | Description |
|---|---|---|---|
| P0 | Connector | Mooncake RDMA | Implement RDMA transport for Stage/KV communication. |
| P0 | Connector | Async Transfer | Support non-blocking transmission to overlap Comm/Comp. |
| P0 | Functionality | PD Separation | Implement Prefill-Decode disaggregation for Qwen Omni. |
| P0 | Model | Bagel Opt | Optimize Bagel disaggregation with cross-node RDMA. |
| P1 | Model | Hunyuan Separation | Support AR -> DiT separation with RDMA connector. |
| P1 | Architecture | Native Worker Actor | Cancel OmniStage actor wrapper; enable multi-node scaling. |
| P1 | Functionality | E->P Separation | Decouple Audio/Vision Encoders from the LLM Prefill stage. |
| P1 | Model | Pin Memory Opt | Optimize Hunyuan Image transfer using pinned memory. |
Feedback Period
- We welcome feedback on the architectural alignment with vLLM and the design of the asynchronous connector APIs.
CC List
@hsliuustc0106 @Gaohan123 @princepride @Shirley125 @ahengljh @spencerr221
Feedback Period.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.