Skip to content

Tensor parallel shared-memory RPC lacks worker acknowledgement and can corrupt commands #246

Description

@banfeb

Summary

In nano-vllm tensor parallel mode (tensor_parallel_size > 1), rank0 sends method calls to worker ranks through a shared-memory buffer plus multiprocessing.Event.

The current protocol only has a rank0-to-worker notification event. There is no worker-to-rank0 acknowledgement that the previous command has been fully read and executed. As a result, rank0 may overwrite the shared-memory buffer with the next RPC while one or more worker ranks are still reading or processing the previous RPC.

This can corrupt the pickled command payload on worker ranks and lead to _pickle.UnpicklingError. After one worker rank exits, the remaining ranks can hang in NCCL collectives and eventually timeout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions