Summary
In nano-vllm tensor parallel mode (tensor_parallel_size > 1), rank0 sends method calls to worker ranks through a shared-memory buffer plus multiprocessing.Event.
The current protocol only has a rank0-to-worker notification event. There is no worker-to-rank0 acknowledgement that the previous command has been fully read and executed. As a result, rank0 may overwrite the shared-memory buffer with the next RPC while one or more worker ranks are still reading or processing the previous RPC.
This can corrupt the pickled command payload on worker ranks and lead to _pickle.UnpicklingError. After one worker rank exits, the remaining ranks can hang in NCCL collectives and eventually timeout.
Summary
In nano-vllm tensor parallel mode (
tensor_parallel_size > 1), rank0 sends method calls to worker ranks through a shared-memory buffer plusmultiprocessing.Event.The current protocol only has a rank0-to-worker notification event. There is no worker-to-rank0 acknowledgement that the previous command has been fully read and executed. As a result, rank0 may overwrite the shared-memory buffer with the next RPC while one or more worker ranks are still reading or processing the previous RPC.
This can corrupt the pickled command payload on worker ranks and lead to
_pickle.UnpicklingError. After one worker rank exits, the remaining ranks can hang in NCCL collectives and eventually timeout.