[release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) #1937

jerrymannil · 2025-03-01T00:02:45Z

No description provided.

Summary: This slow path is bad because it has a sync point which makes CPU really slow. I'm not very sure if AMD actually needs this with the newer rocm versino {F1870213925} Test Plan: CI Differential Revision: D62731130 Pull Request resolved: pytorch#136136 Approved by: https://github.com/danzimm, https://github.com/jeffdaily, https://github.com/eqy

On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel. None of the existing unit tests were exercising the large index case so a new unit test was added. It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel. Pull Request resolved: pytorch#138259 Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy

) This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <[email protected]> Pull Request resolved: pytorch#146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

rocm-repo-management-api · 2025-03-01T00:25:35Z

Jenkins build for 9ea188fa4becbca088263584e67a1b68658939ca commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

jerrymannil · 2025-03-04T18:15:42Z

!cherry-pick --onto release/2.6

rocm-mici · 2025-03-04T18:17:36Z

Created branch release/2.6_cherry-pick_pr-1937 and #1942

xw285cornell and others added 3 commits February 28, 2025 23:08

jerrymannil changed the title ~~[release/2.5][ROCm] Indexing backward kernel improvements~~ [release/2.5][ROCm] Indexing backward kernel improvements from mainline (Mutiple commits) Mar 1, 2025

jerrymannil changed the title ~~[release/2.5][ROCm] Indexing backward kernel improvements from mainline (Mutiple commits)~~ [release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) Mar 1, 2025

jerrymannil requested a review from pruthvistony March 1, 2025 00:05

pruthvistony approved these changes Mar 4, 2025

View reviewed changes

pruthvistony merged commit c040e57 into ROCm:release/2.5 Mar 4, 2025
0 of 7 checks passed

rocm-mici mentioned this pull request Mar 4, 2025

[AUTOGENERATED] [release/2.6] [release/2.5][ROCm] Indexing backward kernel improvements from mainline #1942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) #1937

[release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) #1937

Uh oh!

jerrymannil commented Mar 1, 2025

Uh oh!

rocm-repo-management-api bot commented Mar 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

jerrymannil commented Mar 4, 2025

Uh oh!

rocm-mici commented Mar 4, 2025

Uh oh!

Uh oh!

[release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) #1937

[release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) #1937

Uh oh!

Conversation

jerrymannil commented Mar 1, 2025

Uh oh!

rocm-repo-management-api bot commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jerrymannil commented Mar 4, 2025

Uh oh!

rocm-mici commented Mar 4, 2025

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 1, 2025 •

edited

Loading