Skip to content

_pin_and_move Can Break Program Concurrency #3667

@leimao

Description

@leimao

I see torchrec is abusing _pin_and_move to do data copy from host to device. This results in calling cudaMemcpyAsync which occupies the data copy engine on GPU. If there is a another cudaMemcpyAsync running on a different CUDA stream simultaneously, the one of the two cudaMemcpyAsync will have to wait for the other to complete before starting the execution. This will block one of the CUDA stream and break the concurrency of the program.

In practice, the user should be in charge of using the pinned memory instead of the program. I suggest torchrec completely replace the _pin_and_move function with _move (without moving the tensor to pinned memory) or alternatively adding an interface that allows the user to use pinned memory.

In my opinion, anything tensor that is not pre-allocated should not use .pin_memory().to(...). We have _pin_and_move being used in the these places. Almost every one is being used by a dynamic tensor, which sounds incorrect to me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions