Skip to content

[RFC]: Omni Connector for Full Disaggregation Architecture 2026 Q1 Roadmap #1192

@natureofnature

Description

@natureofnature

[RFC]: Omni Connector For Full Disaggregation Architecture 2026 Q1 Roadmap

Motivation

To achieve higher scalability and resource efficiency in multimodal large model serving, we propose a roadmap centered on Full Disaggregation Architecture. By leveraging Omni Connector and RDMA technologies, we aim to decouple computation stages (Prefill, Decode, Generate, Encode) and enable flexible distributed execution across nodes. This roadmap focuses on architectural refactoring, high-performance transport, and expanding model support for Qwen Omni and Hunyuan Image families.

Proposed Change

1. Omni Connector: High-Performance Transport

Enhancing the data transport layer to support low-latency, asynchronous transfers required for disaggregated serving.

  • P0: Mooncake Transfer Engine Integration
  • P0: Asynchronous Transmission
    • Goal: Enable non-blocking data transfer to maximize hardware utilization by overlapping communication overhead with computation tasks.
    • Action: Implement asynchronous operations in OmniConnector to hide transport latency in disaggregated pipelines.
    • Reference: [wip]async scheduling to overlap chunk IO and compute #951

2. Full Disaggregation Functionality (EPDG)

Implementing fine-grained separation of duties (Encoder-Prefill-Decode-Generate) for the Qwen Omni 2.5/3 model family to achieve optimal resource utilization and scalability.

  • P0: PD Separation (Prefill-Decode Disaggregation)

    • Definition: Decoupling the Prefill stage from the Decode stage into separate instances/stages for AR models.
    • Target: Qwen Omni 2.5 / 3 (Thinker & Talker).
    • Rationale: Prefill is compute-bound while Decode is memory-bound. Separating them allows for independent scaling (e.g., using different GPU types or parallelism strategies) and prevents the "head-of-line blocking" problem where long prefills delay decoding requests. Given that vLLM already provides a mature implementation of PD disaggregation, we aim to leverage its existing Mooncake-based mechanism as much as possible.
    • Implementation:
      • Leverage vllm kv connector to transmit the KV Cache generated during the Prefill stage to the Decode stage via RDMA.
      • Ensure seamless handover of request state (e.g., SamplingParams, Request ID) alongside KV data, while maintaining full compatibility and interoperability with the existing Omni Connector architecture.
    • Reference: [RFC]: Support Prefill-Decode Disaggregation for vLLM-Omni Thinker Stage via vLLM KV Transfer #1188
  • P1: E->P Separation (Encoder-Prefill Disaggregation)

    • Definition: Decoupling the Multimodal Encoder (e.g., Audio/Vision Encoders) from the LLM Prefill stage.
    • Target: Qwen Omni 2.5 / 3.
    • Rationale: Audio/Vision encoding has different compute characteristics compared to the LLM Prefill. Separating E allows the heavy lifting of signal processing to be offloaded (e.g., to CPU or specialized workers), reducing the load on the main GPU cluster running the LLM.
    • Implementation:
      • Leverage existing separation patterns in vLLM to decouple the Encoder from the Prefill stage, standardizing the interface for transmitting embeddings and features via Omni Connector.

3. Disaggregated Model Support

Extending disaggregated serving support to advanced multimodal models.

  • P0: Bagel Disaggregation Optimization

    • Goal: Optimize the existing Bagel disaggregation pipeline performance.
    • Details:
      • Implement Cross-node RDMA KV Cache Transfer to minimize latency in multi-node deployments
    • Reference: RFC: Bagel deployment #936
  • P1: Hunyuan Image AR -> DiT Separation

  • P1: Data Transmission Optimization

    • Details: Implement pin_memory optimization for Hunyuan Image data transfers to maximize RDMA throughput and minimize CPU overhead.
    • Reference: [WIP]vLLM-Omni RDMA connector #1019

4. Core Architecture: Distributed Execution Refactoring

Aligning with vLLM's native distributed architecture to support flexible multi-node deployment.

  • P1: Native Worker Actor Pattern
    • Goal: Remove the OmniStage Ray actor wrapper. Currently, OmniStage and the worker must stay on the same node, which limits the scaling flexibility (E.G. TP>8).
    • Action: Refactor omni_stage.py to let the Worker act as the Ray Actor (consistent with vLLM main repo). This enables a single stage to span multiple GPUs/nodes and facilitates TP/CP/PP etc. across boundaries.

Roadmap Overview

Priority Category Task Description
P0 Connector Mooncake RDMA Implement RDMA transport for Stage/KV communication.
P0 Connector Async Transfer Support non-blocking transmission to overlap Comm/Comp.
P0 Functionality PD Separation Implement Prefill-Decode disaggregation for Qwen Omni.
P0 Model Bagel Opt Optimize Bagel disaggregation with cross-node RDMA.
P1 Model Hunyuan Separation Support AR -> DiT separation with RDMA connector.
P1 Architecture Native Worker Actor Cancel OmniStage actor wrapper; enable multi-node scaling.
P1 Functionality E->P Separation Decouple Audio/Vision Encoders from the LLM Prefill stage.
P1 Model Pin Memory Opt Optimize Hunyuan Image transfer using pinned memory.

Feedback Period

  • We welcome feedback on the architectural alignment with vLLM and the design of the asynchronous connector APIs.

CC List

@hsliuustc0106 @Gaohan123 @princepride @Shirley125 @ahengljh @spencerr221

Feedback Period.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sub-issues

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions