[RFC]: Omni Connector for Full Disaggregation Architecture 2026 Q1  Roadmap


***

# [RFC]: Omni Connector For Full Disaggregation Architecture  2026 Q1  Roadmap

## Motivation
To achieve higher scalability and resource efficiency in multimodal large model serving, we propose a roadmap centered on **Full Disaggregation Architecture**. By leveraging **Omni Connector** and **RDMA** technologies, we aim to decouple computation stages (Prefill, Decode, Generate, Encode) and enable flexible distributed execution across nodes. This roadmap focuses on architectural refactoring, high-performance transport, and expanding model support for Qwen Omni and Hunyuan Image families.

## Proposed Change

### 1. Omni Connector: High-Performance Transport
Enhancing the data transport layer to support low-latency, asynchronous transfers required for disaggregated serving.

*   **P0: Mooncake Transfer Engine Integration**
    *   **Goal:** Provide full RDMA support based on the `Mooncake` transfer engine, specifically optimized for **stage-to-stage data transmission** and high-bandwidth **AR/DiT KV cache communication**.
    *   **Action:** Deep integration of Mooncake's optimized RDMA primitives into `MooncakeRDMAConnector`.
    *  **Reference:** [[RFC]: vLLM-Omni RDMA connector Feature Design #955
](https://github.com/vllm-project/vllm-omni/issues/955) [[WIP]vLLM-Omni RDMA connector #1019](https://github.com/vllm-project/vllm-omni/pull/1019)
*   **P0: Asynchronous Transmission**
    *   **Goal:** Enable **non-blocking data transfer** to maximize hardware utilization by overlapping communication overhead with computation tasks.
    *   **Action:** Implement asynchronous  operations in `OmniConnector` to hide transport latency in disaggregated pipelines.
    *  **Reference:** [[wip]async scheduling to overlap chunk IO and compute #951
](https://github.com/vllm-project/vllm-omni/pull/951)
### 2. Full Disaggregation Functionality (EPDG)
Implementing fine-grained separation of duties (Encoder-Prefill-Decode-Generate) for the Qwen Omni 2.5/3 model family to achieve optimal resource utilization and scalability.

*   **P0: PD Separation (Prefill-Decode Disaggregation)**
    *   **Definition:** Decoupling the **Prefill stage**  from the **Decode stage**  into separate instances/stages for AR models.
    *   **Target:** Qwen Omni 2.5 / 3 (Thinker & Talker).
    *   **Rationale:** Prefill is compute-bound while Decode is memory-bound. Separating them allows for independent scaling (e.g., using different GPU types or parallelism strategies) and prevents the "head-of-line blocking" problem where long prefills delay decoding requests. Given that vLLM already provides a mature implementation of PD disaggregation, we aim to leverage its existing Mooncake-based mechanism as much as possible.
    *   **Implementation:** 
        *   Leverage vllm kv connector to transmit the KV Cache generated during the Prefill stage to the Decode stage via RDMA.
        *   Ensure seamless handover of request state (e.g., SamplingParams, Request ID) alongside KV data, while maintaining full compatibility and interoperability with the existing Omni Connector architecture.
    *   **Reference:** [[RFC]: Support Prefill-Decode Disaggregation for vLLM-Omni Thinker Stage via vLLM KV Transfer #1188
](https://github.com/vllm-project/vllm-omni/issues/1188)


*   **P1: E->P Separation (Encoder-Prefill Disaggregation)**
    *   **Definition:** Decoupling the Multimodal **Encoder** (e.g., Audio/Vision Encoders) from the LLM **Prefill** stage.
    *   **Target:** Qwen Omni 2.5 / 3.
    *   **Rationale:** Audio/Vision encoding has different compute characteristics compared to the LLM Prefill. Separating `E` allows the heavy lifting of signal processing to be offloaded (e.g., to CPU or specialized workers), reducing the load on the main GPU cluster running the LLM.
    *   **Implementation:**
        *   Leverage existing separation patterns in vLLM to decouple the Encoder from the Prefill stage, standardizing the interface for transmitting embeddings and features via Omni Connector.

### 3. Disaggregated Model Support
Extending disaggregated serving support to advanced multimodal models.
*   **P0: Bagel Disaggregation Optimization**
    *   **Goal:** Optimize the existing Bagel disaggregation pipeline  performance.
    *   **Details:** 
        *   Implement **Cross-node RDMA KV Cache Transfer** to minimize latency in multi-node deployments
    *   **Reference:** [RFC: Bagel deployment #936](https://github.com/vllm-project/vllm-omni/issues/936)
*   **P1: Hunyuan Image AR -> DiT Separation**
    *   **Details:** Implement the separation between the AutoRegressive (AR) text/image understanding component and the Diffusion Transformer (DiT) generation component.
    *   **Transport:** Enable RDMA transmission via `MooncakeRDMAConnector` between the AR and DiT stages.
    *   **Reference:** [[Model] Add Hunyuan Image3 AR Support #759
](https://github.com/vllm-project/vllm-omni/pull/759), [[Model] SupportHunyuanImage3 Diffusion Model in GPU #1085](https://github.com/vllm-project/vllm-omni/pull/1085)


*   **P1: Data Transmission Optimization**
    *   **Details:** Implement `pin_memory` optimization for Hunyuan Image data transfers to maximize RDMA throughput and minimize CPU overhead.
    *  **Reference:** [[WIP]vLLM-Omni RDMA connector #1019](https://github.com/vllm-project/vllm-omni/pull/1019)


### 4. Core Architecture: Distributed Execution Refactoring
Aligning with vLLM's native distributed architecture to support flexible multi-node deployment.

*   **P1: Native Worker Actor Pattern**
    *   **Goal:** Remove the `OmniStage` Ray actor wrapper. Currently, `OmniStage` and the worker must stay on the same node, which limits the scaling flexibility (E.G. TP>8).
    *   **Action:** Refactor `omni_stage.py` to let the Worker act as the Ray Actor (consistent with vLLM main repo). This enables a single stage to span multiple GPUs/nodes and facilitates  TP/CP/PP etc. across boundaries.

## Roadmap Overview

| Priority | Category | Task | Description |
| :--- | :--- | :--- | :--- |
| **P0** | **Connector** | **Mooncake RDMA** | Implement RDMA transport for Stage/KV communication. |
| **P0** | **Connector** | **Async Transfer** | Support non-blocking transmission to overlap Comm/Comp. |
| **P0** | **Functionality** | **PD Separation** | Implement Prefill-Decode disaggregation for Qwen Omni. |
| **P0** | **Model** | **Bagel Opt** | Optimize Bagel disaggregation with cross-node RDMA. |
| **P1** | **Model** | **Hunyuan Separation** | Support AR -> DiT separation with RDMA connector. |
| **P1** | **Architecture** | **Native Worker Actor** | Cancel `OmniStage` actor wrapper; enable multi-node scaling. |
| **P1** | **Functionality** | **E->P Separation** | Decouple Audio/Vision Encoders from the LLM Prefill stage. |
| **P1** | **Model** | **Pin Memory Opt** | Optimize Hunyuan Image transfer using pinned memory. |


### Feedback Period
*   We welcome feedback on the architectural alignment with vLLM and the design of the asynchronous connector APIs.

### CC List
@hsliuustc0106 @Gaohan123 @princepride @Shirley125 @ahengljh @spencerr221


### Feedback Period.

_No response_



### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Omni Connector for Full Disaggregation Architecture 2026 Q1 Roadmap #1192

[RFC]: Omni Connector For Full Disaggregation Architecture 2026 Q1 Roadmap

Motivation

Proposed Change

1. Omni Connector: High-Performance Transport

2. Full Disaggregation Functionality (EPDG)

3. Disaggregated Model Support

4. Core Architecture: Distributed Execution Refactoring

Roadmap Overview

Feedback Period

CC List

Feedback Period.

Any Other Things.

Before submitting a new issue...

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Priority	Category	Task	Description
P0	Connector	Mooncake RDMA	Implement RDMA transport for Stage/KV communication.
P0	Connector	Async Transfer	Support non-blocking transmission to overlap Comm/Comp.
P0	Functionality	PD Separation	Implement Prefill-Decode disaggregation for Qwen Omni.
P0	Model	Bagel Opt	Optimize Bagel disaggregation with cross-node RDMA.
P1	Model	Hunyuan Separation	Support AR -> DiT separation with RDMA connector.
P1	Architecture	Native Worker Actor	Cancel `OmniStage` actor wrapper; enable multi-node scaling.
P1	Functionality	E->P Separation	Decouple Audio/Vision Encoders from the LLM Prefill stage.
P1	Model	Pin Memory Opt	Optimize Hunyuan Image transfer using pinned memory.

[RFC]: Omni Connector for Full Disaggregation Architecture 2026 Q1 Roadmap #1192

Description

[RFC]: Omni Connector For Full Disaggregation Architecture 2026 Q1 Roadmap

Motivation

Proposed Change

1. Omni Connector: High-Performance Transport

2. Full Disaggregation Functionality (EPDG)

3. Disaggregated Model Support

4. Core Architecture: Distributed Execution Refactoring

Roadmap Overview

Feedback Period

CC List

Feedback Period.

Any Other Things.

Before submitting a new issue...

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions