[ROCm] Initial prototype for ck_tile sdpa FA backend #1592

alugorey · 2024-09-13T21:23:05Z

Initial prototype for sdpa ck backend. Does not support odd number of attention heads

CK gemm header

Add templatization to ck kernel

rocm-mici · 2024-10-09T07:06:17Z

Jenkins build for 7be0d16ab01184a2ee6140ea2698af54113ea234 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-mici · 2024-10-09T07:07:00Z

Jenkins build for 7be0d16ab01184a2ee6140ea2698af54113ea234 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-mici · 2024-10-14T06:04:16Z

Jenkins build for 7be0d16ab01184a2ee6140ea2698af54113ea234 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2024-12-11T22:21:43Z

Jenkins build for 7be0d16ab01184a2ee6140ea2698af54113ea234 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@tridao

Replaces ROCm#1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: #138947 Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian Co-authored-by: Xiaodong Wang <[email protected]>

@tridao

Replaces ROCm#1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#138947 Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian Co-authored-by: Xiaodong Wang <[email protected]>

@tridao

Replace #138947 for re-import. Replaces ROCm#1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: #143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <[email protected]> Co-authored-by: Jithun Nair <[email protected]>

@tridao

Replace pytorch#138947 for re-import. Replaces ROCm#1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <[email protected]> Co-authored-by: Jithun Nair <[email protected]>

@tridao

Replace pytorch#138947 for re-import. Replaces #1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <[email protected]> Co-authored-by: Jithun Nair <[email protected]> (cherry picked from commit 0a94bb4)

@tridao

Replace pytorch#138947 for re-import. Replaces #1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <[email protected]> Co-authored-by: Jithun Nair <[email protected]> (cherry picked from commit 0a94bb4)

@tridao

Replace pytorch#138947 for re-import. Replaces #1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <[email protected]> Co-authored-by: Jithun Nair <[email protected]> (cherry picked from commit 0a94bb4)

jeffdaily and others added 8 commits June 26, 2024 23:45

add ck blas backend selector

b7a21fb

add composable_kernel submodule

9534797

copy bfloat16 gemm from fbgemm

851d4b5

CK gemm header (ROCm#1445)

9cbbb40

CK gemm header

use BLAS arg types for ck gemm kernel

ae6a64b

swap bf16 for float

9420e57

Ck template (ROCm#1447)

6cdf163

Add templatization to ck kernel

[ROCm] Initial prototype for ck_tile sdpa FA backend

7be0d16

pruthvistony force-pushed the rocm_gemm_ck branch from 3033c62 to 00abd95 Compare October 9, 2024 22:01

alugorey mentioned this pull request Oct 25, 2024

[ROCm] CK Flash Attention Backend pytorch/pytorch#138947

Closed

xw285cornell mentioned this pull request Dec 21, 2024

[ROCm] CK Flash Attention Backend pytorch/pytorch#143695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Initial prototype for ck_tile sdpa FA backend #1592

[ROCm] Initial prototype for ck_tile sdpa FA backend #1592

alugorey commented Sep 13, 2024

rocm-mici commented Oct 9, 2024

rocm-mici commented Oct 9, 2024

rocm-mici commented Oct 14, 2024

rocm-repo-management-api bot commented Dec 11, 2024 •

edited

Loading

[ROCm] Initial prototype for ck_tile sdpa FA backend #1592

Are you sure you want to change the base?

[ROCm] Initial prototype for ck_tile sdpa FA backend #1592

Conversation

alugorey commented Sep 13, 2024

rocm-mici commented Oct 9, 2024

rocm-mici commented Oct 9, 2024

rocm-mici commented Oct 14, 2024

rocm-repo-management-api bot commented Dec 11, 2024 • edited Loading

rocm-repo-management-api bot commented Dec 11, 2024 •

edited

Loading