Refactoring attention.py part 1 #1542

KshitijLakhani · 2025-03-06T11:41:39Z

Description

attention.py has grown in size to 8500+ lines of code and hence the motivation to refactor the code. This PR is part 1 (of a 2 part PR effort) to refactor attention.py into sub-modules for ease of development and testing.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Create a new module, dot_product_attention which contains sub modules inference.py, rope.py and utils.py and move the appropriate methods from attention.py into these 3 sub modules. The details of these methods are in a collapsed list below for those interested:

List of Functions Moved/Added

rope.py (~200 lines of code)
class RotaryPositionEmbedding(torch.nn.Module)
class FusedRoPEFunc(torch.autograd.Function):
def _rotate_half(x: torch.Tensor) -> torch.Tensor:
def apply_rotary_pos_emb(
inference.py (estimated to be ~1000 when finished)
class InferenceParams: # pylint: disable=too-few-public-methods
utils.py (~1500 lines of code)
class AttentionParams: Agree with a different class. Unsure of it being in utils.py
def get_attention_backend(
def get_full_mask(
def get_alibi(
def get_cu_seqlens(mask: torch.Tensor) -> torch.Tensor:
def get_cu_seqlens_and_indices(mask: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
def get_indices(max_seqlen: int, cu_seqlens: torch.Tensor) -> torch.Tensor:
def _get_full_cu_seqlens(
def pack_tensor(
def pack_2_tensors(
def pack_3_tensors(
def unpack_tensor(
def unpack_2_tensors(
def unpack_3_tensors(
class PackTensors(torch.autograd.Function):
class UnpackTensor(torch.autograd.Function):
def get_qkv_layout(
def check_set_window_size(
def get_attention_quantizers - Added new clas
Create AttentionLogging class - Added new class
Create FlashAttentioUtils class - Added new class

Notable function signature changes(comments have been added for these in the respective functions) :

get_attention_backend() - This will not populate the global _attention_backends cache now but rather rest the responsibility of populating the global _attention_backends on the caller of this function
get_alibi() - This will now accept the global _alibi_cache as a function parameter from the caller and read/write to the _alibi_cache cache from within the function

TODO: Refactoring part 2 PR will create new modules/submodules for MultiHeadAttention.py, context_parallelism while moving some more generic pytorch utils functions to pytorch/utils.py and also moving whatever is leftover of attention.py into dot_product_attention .
Additionally, the part 2 PR will also address any larger changes suggested as part of this part 1 PR's review process.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

cyanguwa · 2025-03-11T00:30:34Z

I wonder if it makes sense to leave those flash-attn version checks/imports as is, for now. We'd have to think of a cleaner way to do it, but I feel FlashAttentionUtils might not cut it.

Also, could you have a look at the usage of inference.py and rope.py, see if it makes more sense to keep them in dot_product_attention/ or outside that directory? Thanks!

KshitijLakhani · 2025-03-12T17:12:07Z

/te-ci pytorch L0 L1 L2 L3

yaox12 · 2025-03-13T03:03:45Z

Also, could you have a look at the usage of inference.py and rope.py, see if it makes more sense to keep them in dot_product_attention/ or outside that directory? Thanks!

I'm thinking if we should create a folder naming like transformer_engine.pytorch.functional (similar to torch.nn.functional) for operators (here I mean what we usually use as autograd functions, not through layers or modules). And move RoPE, softmax, permutation, cross_entropy to it. How do you like it? @cyanguwa

transformer_engine/pytorch/attention.py

cyanguwa · 2025-03-13T19:37:43Z

transformer_engine/pytorch/attention.py

I think this probably is a good structure for our finished refactoring (maybe not now):

te/pytorch: - transformer.py: TELayer (~800 LOC) - multihead_attention.py: MHA (~800 LOC) - attention/ - attention.py: DPA (~1000 LOC) - backends.py: FusedAttention, FlashAttention, UnfusedDPA (~1200 LOC) - context_parallel.py: P2P, A2A, AllGather, attn_with_cp (~3300 LOC) - softmax.py: only used in attention.py IIUC - utils.py: ~1600 LOC, but can move _SplitAlongDim, _combine_tesnsors to te/pytorch/utils.py (they can probably be merged with noop_cat, need investigation), and can condense PackTensors/UnpackTensors related funcs (one pack_tensors func instead of 3) - rope.py: ~200 LOC, but might get longer (a couple of PRs in the pipeline) - inference.py: ~50 LOC, but will get longer (PR 1355, 800 LOC)

Sounds good.
Will address it in the part 2 PR

I feel that if we're keeping multihead_attention.py outside, we should rename the dir structure as follows:

te/pytorch: - transformer.py: TELayer (~800 LOC) - multihead_attention.py: MHA (~800 LOC) - dot_product_attention/ - dot_product_attention.py: DPA (~1000 LOC) - backends.py: FusedAttention, FlashAttention, UnfusedDPA (~1200 LOC) - context_parallel.py: P2P, A2A, AllGather, attn_with_cp (~3300 LOC) - softmax.py: only used in attention.py IIUC - utils.py: ~1600 LOC, but can move _SplitAlongDim, _combine_tesnsors to te/pytorch/utils.py (they can probably be merged with noop_cat, need investigation), and can condense PackTensors/UnpackTensors related funcs (one pack_tensors func instead of 3) - rope.py: ~200 LOC, but might get longer (a couple of PRs in the pipeline) - inference.py: ~50 LOC, but will get longer (PR 1355, 800 LOC)

Or keep multihead_attention.py inside attention as follows. Keeping m_h_a.py as a file top level in addition to attention as directory with attention/attention.py as a file could be misleading.

te/pytorch: - transformer.py: TELayer (~800 LOC) - attention/ - multihead_attention.py: MHA/GQA/MLA - dot_product_attention.py: DPA (~1000 LOC) - backends.py: FusedAttention, FlashAttention, UnfusedDPA (~1200 LOC) - context_parallel.py: P2P, A2A, AllGather, attn_with_cp (~3300 LOC) - softmax.py: only used in attention.py IIUC - utils.py: ~1600 LOC, but can move _SplitAlongDim, _combine_tesnsors to te/pytorch/utils.py (they can probably be merged with noop_cat, need investigation), and can condense PackTensors/UnpackTensors related funcs (one pack_tensors func instead of 3) - rope.py: ~200 LOC, but might get longer (a couple of PRs in the pipeline) - inference.py: ~50 LOC, but will get longer (PR 1355, 800 LOC)

The latter looks cleaner to me, if doesn't break more stuff than the former

The former looks better to me, the latter option's attention/ is a bit too chunky. But let's discuss this in Part 2, together with where rope.py, softmax.py, inference.py should be.

transformer_engine/pytorch/attention.py

KshitijLakhani · 2025-03-13T23:43:32Z

/te-ci pytorch L0 L1 L2 L3

sudhakarsingh27

Overall lgtm since this is just setting up for Part 2

sudhakarsingh27 · 2025-03-13T23:41:42Z

transformer_engine/pytorch/attention.py

I feel that if we're keeping multihead_attention.py outside, we should rename the dir structure as follows:

te/pytorch: - transformer.py: TELayer (~800 LOC) - multihead_attention.py: MHA (~800 LOC) - dot_product_attention/ - dot_product_attention.py: DPA (~1000 LOC) - backends.py: FusedAttention, FlashAttention, UnfusedDPA (~1200 LOC) - context_parallel.py: P2P, A2A, AllGather, attn_with_cp (~3300 LOC) - softmax.py: only used in attention.py IIUC - utils.py: ~1600 LOC, but can move _SplitAlongDim, _combine_tesnsors to te/pytorch/utils.py (they can probably be merged with noop_cat, need investigation), and can condense PackTensors/UnpackTensors related funcs (one pack_tensors func instead of 3) - rope.py: ~200 LOC, but might get longer (a couple of PRs in the pipeline) - inference.py: ~50 LOC, but will get longer (PR 1355, 800 LOC)

Or keep multihead_attention.py inside attention as follows. Keeping m_h_a.py as a file top level in addition to attention as directory with attention/attention.py as a file could be misleading.

te/pytorch: - transformer.py: TELayer (~800 LOC) - attention/ - multihead_attention.py: MHA/GQA/MLA - dot_product_attention.py: DPA (~1000 LOC) - backends.py: FusedAttention, FlashAttention, UnfusedDPA (~1200 LOC) - context_parallel.py: P2P, A2A, AllGather, attn_with_cp (~3300 LOC) - softmax.py: only used in attention.py IIUC - utils.py: ~1600 LOC, but can move _SplitAlongDim, _combine_tesnsors to te/pytorch/utils.py (they can probably be merged with noop_cat, need investigation), and can condense PackTensors/UnpackTensors related funcs (one pack_tensors func instead of 3) - rope.py: ~200 LOC, but might get longer (a couple of PRs in the pipeline) - inference.py: ~50 LOC, but will get longer (PR 1355, 800 LOC)

The latter looks cleaner to me, if doesn't break more stuff than the former

KshitijLakhani · 2025-03-14T17:14:40Z

/te-ci pytorch L0 L1 L2 L3

cyanguwa · 2025-03-14T19:02:28Z

transformer_engine/pytorch/dot_product_attention/rope.py

+# See LICENSE for license information.
+
+"""
+Rotary Position Embedding implementation of different types along with hlper functions


Nit: "helper"

cyanguwa · 2025-03-14T19:05:42Z

transformer_engine/pytorch/dot_product_attention/utils.py

+_NVTE_FLASH_ATTN = int(os.getenv("NVTE_FLASH_ATTN", "1"))
+
+
+# ----Helper/Util classes and methods-----


Nit: maybe remove these comments? L57 and L49? It doesn't look like we make these comments elsewhere in utils.py.

cyanguwa

LGTM. After all CI passes, you can merge.

cyanguwa · 2025-03-14T19:06:54Z

transformer_engine/pytorch/attention.py

@@ -87,71 +74,45 @@
    restore_from_saved,
 )

+# Import attention utils
+import transformer_engine.pytorch.dot_product_attention.utils as dpa_utils
+import transformer_engine.pytorch.dot_product_attention.inference as dpa_infer


Nit: could probably do "from transformer_engine.pytorch.dot_product_attention.inference import InferenceParams"?

cyanguwa · 2025-03-14T19:08:25Z

Also, could you have a look at the usage of inference.py and rope.py, see if it makes more sense to keep them in dot_product_attention/ or outside that directory? Thanks!

I'm thinking if we should create a folder naming like transformer_engine.pytorch.functional (similar to torch.nn.functional) for operators (here I mean what we usually use as autograd functions, not through layers or modules). And move RoPE, softmax, permutation, cross_entropy to it. How do you like it? @cyanguwa

Let's have a think about this in Refactoring Part 2. @KshitijLakhani

Move attention logging into a separate class in pytorch/d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

…ning info Move versioning info out of pytorch/attention.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

…_p_a/utils.py Fix tests and imports for the above refactor change Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

…antizers() to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

….py to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

…d_p_a/utils.py Rename cumulative functions from using _cu_ to using _cumul_ to differentiate from CUDA cu calls protocol Rename tensor packaging methods with leading underscore to make them as internal to file Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

….py to it Modify tests and other files to import InferenceParams correctly Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Modify docs api for InferenceParams Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

…to it Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

Use attn_log instead of att_log Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Fix lint error Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

KshitijLakhani · 2025-03-14T20:58:42Z

/te-ci pytorch L0 L1 L2 L3

KshitijLakhani self-assigned this Mar 6, 2025

KshitijLakhani force-pushed the klakhani/maint/refactor-pyt-attn-1 branch 2 times, most recently from 4a6ac72 to 9063a52 Compare March 9, 2025 19:20

ptrendx requested a review from cyanguwa March 10, 2025 21:45

cyanguwa added the 2.2.0 label Mar 10, 2025

KshitijLakhani force-pushed the klakhani/maint/refactor-pyt-attn-1 branch 3 times, most recently from 2b3e300 to c65f750 Compare March 12, 2025 17:07

KshitijLakhani marked this pull request as ready for review March 12, 2025 17:11

KshitijLakhani changed the title ~~Refactoring attention.py~~ Refactoring attention.py part 1 Mar 12, 2025

KshitijLakhani requested a review from sudhakarsingh27 March 12, 2025 18:02

KshitijLakhani force-pushed the klakhani/maint/refactor-pyt-attn-1 branch from b2fa7e2 to 8e774d8 Compare March 13, 2025 17:19

cyanguwa requested changes Mar 13, 2025

View reviewed changes

KshitijLakhani force-pushed the klakhani/maint/refactor-pyt-attn-1 branch from 7dbe271 to b0ee442 Compare March 13, 2025 23:37

sudhakarsingh27 reviewed Mar 13, 2025

View reviewed changes

KshitijLakhani requested review from sudhakarsingh27 and cyanguwa March 14, 2025 00:18

cyanguwa reviewed Mar 14, 2025

View reviewed changes

cyanguwa approved these changes Mar 14, 2025

View reviewed changes

KshitijLakhani and others added 4 commits March 14, 2025 13:19

Create pytorch/dot_product_attention module and pytorch/d_p_a/utils.py

586497f

Move attention logging into a separate class in pytorch/d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

Create FlashAttentionUtils class in pytorch/d_p_a/utils/py for versio…

70c4ebd

…ning info Move versioning info out of pytorch/attention.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

Move AttentionParams and get_attention_backend from attention.py to d…

3b64e2a

…_p_a/utils.py Fix tests and imports for the above refactor change Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

fe37480

for more information, see https://pre-commit.ci

KshitijLakhani and others added 22 commits March 14, 2025 13:19

Move get_qkv_layout(), get_full_mask(), get_alibi(), get_attention_qu…

368b517

…antizers() to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f6f6ab2

for more information, see https://pre-commit.ci

Move tensor packing and unpacking helper functions from pyt/attention…

6b6587c

….py to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a5f916

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

ed17c75

for more information, see https://pre-commit.ci

Remove unnecessary imports in pytorch/attention.py and d_p_a/utils.py

9ff2678

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0666e10

for more information, see https://pre-commit.ci

Create d_p_a/rope.py and move RoPE methods from pytorch/attention.py …

a46ee83

…to it Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5e0c3ad

for more information, see https://pre-commit.ci

Code cleanup

d85be77

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4123598

for more information, see https://pre-commit.ci

Fix qa testing induced bug

0db184a

Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f7f0eb1

for more information, see https://pre-commit.ci

Fix incorrect pack_tensor arg type

b89ce4b

Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

nit: Resolve lint errors

144e595

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

84430f2

for more information, see https://pre-commit.ci

Remove typedef FAUtils for FlashAttentionUtils

61ef0b1

Use attn_log instead of att_log Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Fix lint error Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4886f8b

for more information, see https://pre-commit.ci

nit: Fix the function name from get_cumul to the earlier get_cu

9b8a37b

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

nit: Fix typos, explicit imports and remove extra comments

d0bed1c

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

KshitijLakhani force-pushed the klakhani/maint/refactor-pyt-attn-1 branch from 9654cb8 to d0bed1c Compare March 14, 2025 20:22

Merge branch 'main' into klakhani/maint/refactor-pyt-attn-1

a62dfe7

KshitijLakhani merged commit 3733947 into NVIDIA:main Mar 14, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring attention.py part 1 #1542

Refactoring attention.py part 1 #1542

KshitijLakhani commented Mar 6, 2025 •

edited

Loading

cyanguwa commented Mar 11, 2025 •

edited

Loading

KshitijLakhani commented Mar 12, 2025

yaox12 commented Mar 13, 2025 •

edited

Loading

cyanguwa Mar 13, 2025

KshitijLakhani Mar 13, 2025 •

edited

Loading

sudhakarsingh27 Mar 13, 2025

cyanguwa Mar 14, 2025 •

edited

Loading

KshitijLakhani commented Mar 13, 2025

sudhakarsingh27 left a comment

sudhakarsingh27 Mar 13, 2025

KshitijLakhani commented Mar 14, 2025

cyanguwa Mar 14, 2025

cyanguwa Mar 14, 2025

cyanguwa left a comment

cyanguwa Mar 14, 2025

cyanguwa commented Mar 14, 2025

KshitijLakhani commented Mar 14, 2025

		_NVTE_FLASH_ATTN = int(os.getenv("NVTE_FLASH_ATTN", "1"))


		# ----Helper/Util classes and methods-----

Refactoring attention.py part 1 #1542

Refactoring attention.py part 1 #1542

Conversation

KshitijLakhani commented Mar 6, 2025 • edited Loading

Description

Type of change

Changes

Checklist:

cyanguwa commented Mar 11, 2025 • edited Loading

KshitijLakhani commented Mar 12, 2025

yaox12 commented Mar 13, 2025 • edited Loading

cyanguwa Mar 13, 2025

Choose a reason for hiding this comment

KshitijLakhani Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

sudhakarsingh27 Mar 13, 2025

Choose a reason for hiding this comment

cyanguwa Mar 14, 2025 • edited Loading

Choose a reason for hiding this comment

KshitijLakhani commented Mar 13, 2025

sudhakarsingh27 left a comment

Choose a reason for hiding this comment

sudhakarsingh27 Mar 13, 2025

Choose a reason for hiding this comment

KshitijLakhani commented Mar 14, 2025

cyanguwa Mar 14, 2025

Choose a reason for hiding this comment

cyanguwa Mar 14, 2025

Choose a reason for hiding this comment

cyanguwa left a comment

Choose a reason for hiding this comment

cyanguwa Mar 14, 2025

Choose a reason for hiding this comment

cyanguwa commented Mar 14, 2025

KshitijLakhani commented Mar 14, 2025

KshitijLakhani commented Mar 6, 2025 •

edited

Loading

cyanguwa commented Mar 11, 2025 •

edited

Loading

yaox12 commented Mar 13, 2025 •

edited

Loading

KshitijLakhani Mar 13, 2025 •

edited

Loading

cyanguwa Mar 14, 2025 •

edited

Loading