Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AUTOGENERATED] [release/2.6] [release/2.5][ROCm] Indexing backward kernel improvements from mainline (mutiple commits) #1942

Draft
wants to merge 1 commit into
base: release/2.6
Choose a base branch
from

Conversation

rocm-mici
Copy link

@rocm-mici rocm-mici commented Mar 4, 2025

Cherry-pick of #1937
The first 2 commits are already in 2.6 branch

)

This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <[email protected]>

Pull Request resolved: pytorch#146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 4, 2025

Jenkins build for 8448168b8c9cd6c2b8f1bc2db23b2b533875748b commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jerrymannil jerrymannil requested a review from pruthvistony March 4, 2025 22:46
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Mar 5, 2025

Jenkins build for 8448168b8c9cd6c2b8f1bc2db23b2b533875748b commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony
Copy link
Collaborator

@jerrymannil ,
What is the reason to keep this in draft?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants