Skip to content

[AUTOGENERATED] [release/2.6] [release/2.5][ROCm] Indexing backward kernel improvements from mainline #1942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 11, 2025

Conversation

rocm-mici
Copy link

@rocm-mici rocm-mici commented Mar 4, 2025

Cherry-pick of #1937
The first 2 commits are already in 2.6 branch

)

This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <[email protected]>

Pull Request resolved: pytorch#146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 4, 2025

Jenkins build for 8448168b8c9cd6c2b8f1bc2db23b2b533875748b commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jerrymannil jerrymannil requested a review from pruthvistony March 4, 2025 22:46
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 5, 2025

Jenkins build for 8448168b8c9cd6c2b8f1bc2db23b2b533875748b commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony
Copy link
Collaborator

@jerrymannil ,
What is the reason to keep this in draft?

@jerrymannil jerrymannil marked this pull request as ready for review March 11, 2025 17:06
@jerrymannil jerrymannil merged commit 92b55d0 into release/2.6 Mar 11, 2025
0 of 4 checks passed
@jerrymannil jerrymannil deleted the release/2.6_cherry-pick_pr-1937 branch March 11, 2025 17:06
@jerrymannil jerrymannil changed the title [AUTOGENERATED] [release/2.6] [release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) [AUTOGENERATED] [release/2.6] [release/2.5][ROCm] Indexing backward kernel improvements from mainline Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants