`_pin_and_move` Can Break Program Concurrency

I see torchrec is abusing `_pin_and_move` to do data copy from host to device. This results in calling `cudaMemcpyAsync` which occupies the data copy engine on GPU. If there is a another `cudaMemcpyAsync` running on a different CUDA stream simultaneously, the one of the two `cudaMemcpyAsync` will have to wait for the other to complete before starting the execution. This will block one of the CUDA stream and break the concurrency of the program.

In practice, the user should be in charge of using the pinned memory instead of the program. I suggest torchrec completely replace the `_pin_and_move` function with `_move` (without moving the tensor to pinned memory) or alternatively adding an interface that allows the user to use pinned memory.

In my opinion, anything tensor that is not pre-allocated should not use `.pin_memory().to(...)`. We have `_pin_and_move` being used in the these [places](https://github.com/search?q=repo%3Ameta-pytorch%2Ftorchrec+_pin_and_move&type=code). Almost every one is being used by a dynamic tensor, which sounds incorrect to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_pin_and_move` Can Break Program Concurrency #3667

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

_pin_and_move Can Break Program Concurrency #3667

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`_pin_and_move` Can Break Program Concurrency #3667