I see torchrec is abusing _pin_and_move to do data copy from host to device. This results in calling cudaMemcpyAsync which occupies the data copy engine on GPU. If there is a another cudaMemcpyAsync running on a different CUDA stream simultaneously, the one of the two cudaMemcpyAsync will have to wait for the other to complete before starting the execution. This will block one of the CUDA stream and break the concurrency of the program.
In practice, the user should be in charge of using the pinned memory instead of the program. I suggest torchrec completely replace the _pin_and_move function with _move (without moving the tensor to pinned memory) or alternatively adding an interface that allows the user to use pinned memory.
In my opinion, anything tensor that is not pre-allocated should not use .pin_memory().to(...). We have _pin_and_move being used in the these places. Almost every one is being used by a dynamic tensor, which sounds incorrect to me.
I see torchrec is abusing
_pin_and_moveto do data copy from host to device. This results in callingcudaMemcpyAsyncwhich occupies the data copy engine on GPU. If there is a anothercudaMemcpyAsyncrunning on a different CUDA stream simultaneously, the one of the twocudaMemcpyAsyncwill have to wait for the other to complete before starting the execution. This will block one of the CUDA stream and break the concurrency of the program.In practice, the user should be in charge of using the pinned memory instead of the program. I suggest torchrec completely replace the
_pin_and_movefunction with_move(without moving the tensor to pinned memory) or alternatively adding an interface that allows the user to use pinned memory.In my opinion, anything tensor that is not pre-allocated should not use
.pin_memory().to(...). We have_pin_and_movebeing used in the these places. Almost every one is being used by a dynamic tensor, which sounds incorrect to me.