Skip to content

[release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) #1937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 4, 2025

Conversation

jerrymannil
Copy link

No description provided.

xw285cornell and others added 3 commits February 28, 2025 23:08
Summary:
This slow path is bad because it has a sync point which makes CPU really slow. I'm not very sure if AMD actually needs this with the newer rocm versino

{F1870213925}

Test Plan: CI

Differential Revision: D62731130

Pull Request resolved: pytorch#136136
Approved by: https://github.com/danzimm, https://github.com/jeffdaily, https://github.com/eqy
On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel.  None of the existing unit tests were exercising the large index case so a new unit test was added.

It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel.

Pull Request resolved: pytorch#138259
Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy
)

This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <[email protected]>

Pull Request resolved: pytorch#146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
@jerrymannil jerrymannil changed the title [release/2.5][ROCm] Indexing backward kernel improvements [release/2.5][ROCm] Indexing backward kernel improvements from mainline (Mutiple commits) Mar 1, 2025
@jerrymannil jerrymannil changed the title [release/2.5][ROCm] Indexing backward kernel improvements from mainline (Mutiple commits) [release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) Mar 1, 2025
@jerrymannil jerrymannil requested a review from pruthvistony March 1, 2025 00:05
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 1, 2025

Jenkins build for 9ea188fa4becbca088263584e67a1b68658939ca commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony pruthvistony merged commit c040e57 into ROCm:release/2.5 Mar 4, 2025
0 of 7 checks passed
@jerrymannil
Copy link
Author

!cherry-pick --onto release/2.6

@rocm-mici
Copy link

Created branch release/2.6_cherry-pick_pr-1937 and #1942

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants