Tensor parallel shared-memory RPC lacks worker acknowledgement and can corrupt commands

## Summary

In nano-vllm tensor parallel mode (`tensor_parallel_size > 1`), rank0 sends method calls to worker ranks through a shared-memory buffer plus `multiprocessing.Event`.

The current protocol only has a rank0-to-worker notification event. There is no worker-to-rank0 acknowledgement that the previous command has been fully read and executed. As a result, rank0 may overwrite the shared-memory buffer with the next RPC while one or more worker ranks are still reading or processing the previous RPC.

This can corrupt the pickled command payload on worker ranks and lead to `_pickle.UnpicklingError`. After one worker rank exits, the remaining ranks can hang in NCCL collectives and eventually timeout.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor parallel shared-memory RPC lacks worker acknowledgement and can corrupt commands #246

Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tensor parallel shared-memory RPC lacks worker acknowledgement and can corrupt commands #246

Description

Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions