we noticed an issue from _wait_impl, that self._output_tensor.view(self.num_workers, -1).T.tolist() is using pageable memory for D2H. If we run embedding lookup from a different CUDA stream, it will block the main stream process.
Once we change this part to pin-memory it makes cudaMemcpyAsync non-blocking:
view = self._output_tensor.view(self.num_workers, -1).T
if view.is_cuda:
pinned = torch.empty(
view.shape, dtype=view.dtype, device="cpu", pin_memory=True
)
pinned.copy_(view)
ret = pinned.tolist()
else:
ret = view.tolist()
I think there are more in torchrec embedding lookup, should we change all those to pin memory?
we noticed an issue from _wait_impl, that
self._output_tensor.view(self.num_workers, -1).T.tolist()is using pageable memory for D2H. If we run embedding lookup from a different CUDA stream, it will block the main stream process.Once we change this part to pin-memory it makes cudaMemcpyAsync non-blocking:
I think there are more in torchrec embedding lookup, should we change all those to pin memory?