Skip to content

Conversation

@pbielak
Copy link
Collaborator

@pbielak pbielak commented Oct 23, 2025

What does this PR do?

After the Transformers 4.55 update, one of the attention classes failed to compute the attention scores due to mismatches between arguments in the torch.matmul op. This commits updates the whole Mllama code base to be fully aligned with the code in Transformers 4.55. In particular, it uses the _attn_implementation instead of custom classes.

@github-actions
Copy link

The code quality check failed, please run make style.

@pbielak pbielak force-pushed the dev/pbielak/update-mllama-implementation branch 2 times, most recently from 9710363 to a270769 Compare October 23, 2025 11:27
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@pbielak pbielak self-assigned this Oct 23, 2025
@pbielak pbielak force-pushed the dev/pbielak/update-mllama-implementation branch 2 times, most recently from f5c5287 to 939e520 Compare October 28, 2025 10:19
@github-actions
Copy link

The code quality check failed, please run make style.

@pbielak pbielak force-pushed the dev/pbielak/update-mllama-implementation branch from 939e520 to 720f6a7 Compare October 28, 2025 10:22
After the Transformers 4.55 update, one of the attention classes failed
to compute the attention scores due to mismatches between arguments in
the `torch.matmul` op. This commits updates the whole `Mllama` code base
to be fully aligned with the code in Transformers 4.55. In particular, it:
- uses the `_attn_implementation` instead of custom classes,
- applies the changes from PR [1]
- handles `_attn_implementation` passed to the model
- fix argument preparation in `gaudi_fused_sdpa_attention`

[1] huggingface/transformers#40083
@pbielak pbielak force-pushed the dev/pbielak/update-mllama-implementation branch from 720f6a7 to 79a9ebc Compare October 28, 2025 13:02
@pbielak pbielak marked this pull request as ready for review October 28, 2025 13:02
@pbielak pbielak requested a review from regisss as a code owner October 28, 2025 13:02
]
args.prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

if model_type == "mllama" and args.use_flash_attention:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just allow user to select attn_implementation + add readme section about it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done, because this script is also used by other models (such as llava), which are not yet aligned with the attn_implementation interface.

) -> tuple[torch.Tensor, None]:
bsz, num_heads, tgt_len, head_dim = query.shape

softmax_mode = "fast" if os.getenv("FLASH_ATTENTION_FAST_SOFTMAX") == "1" else "None"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since attention implementation is now separated from model we don't have to use env vars, can you explore if it's possible to use kwargs instead of env vars?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - I will have a look at it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants