[Gemma3] compile ✨ #37447

gante · 2025-04-11T09:17:07Z

What does this PR do?

Enables compilation on Gemma3 (and re-enables it on Gemma2 / Cohere2).

Reverts #36620
Supercedes #37433 (solves the same problem, but this PR is much cleaner)

Performance

Measured on an RTX4090, excluding compile warmup time:

Example script from gemma 3 4b: 3.87s on main (not compiled) -> 2.39s this PR
Example script from gemma 2 9b: 3.67s on main (not compiled) -> 2.18s this PR

Tests

slow gemma 2 tests (9 failing tests from main -> need to be revisited)
slow gemma 3 tests (2 failing tests from main, tests/models/gemma3/test_modeling_gemma3.py::Gemma3Vision2TextModelTest::test_eager_matches_sdpa_generate gets fixed in this PR)

Post-mortem: How did we break compile on Gemma 2?

Doing git bisect, compilation first "breaks" in the PR where the cache is initialized in the meta device (Init cache on meta device #35164). "break" here doesn't mean "crash", but rather "becomes very slow". Curiously, this change doesn't slow down StaticCache + llama (why?), so it flew under the radar when we benchmarked before merging. Nevertheless, this specific PR has been reverted ([Cache] Don't initialize the cache on meta device #36543).
Along the way, we corrected how the sliding window attention works, by slicing the attention mask correctly (Fix mask slicing for models with HybridCache #35681). However, the solution here is not torch.compile friendly: forward now has an int argument that is different at each forward pass at generation time, causing recompilation (reference). The changes in this PR work around this issue.

src/transformers/models/cohere2/modeling_cohere2.py

gante · 2025-04-11T09:31:40Z

src/transformers/models/cohere2/modular_cohere2.py

+                # equivalent to: `attention_mask = attention_mask[:, :, :, offset : offset + effective_seq_len]`,
+                # but without data-dependent slicing (i.e. torch.compile friendly)
+                mask_indexes = torch.arange(
+                    min(effective_seq_len, attention_mask.shape[-1]), device=attention_mask.device
+                )
+                mask_indexes += offset
+                attention_mask = attention_mask[:, :, :, mask_indexes]


Core change for the PR.

attention_mask = attention_mask[:, :, :, offset : offset + effective_seq_len] requires either passing an integer in the signature to build offset (previous solution, triggers recompilation at each forward 🚫 ) or doing data-dependent slicing using offset as a tensor (crashes compile 🚫 )

The solution is to:

build an arange from shapes ✅ (we can use shapes to create compile-compatible arrays on the fly, as opposed to using arbitrary tensors to create tensors)

add some tensor (offset) to a tensor (fixed-shape array) ✅

slice a tensor (attention mask) with another tensor (offset modified fixed-shape array) ✅

(Note: at first I tried torch.roll + fixed-shape slicing, but torch.roll doesn't support the argument shifts=offset, shifts has to be an integer 😢 )

HuggingFaceDocBuilderDev · 2025-04-11T09:50:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2025-04-11T10:14:25Z

src/transformers/models/cohere2/modeling_cohere2.py

+                mask_indexes = torch.arange(
+                    min(effective_seq_len, attention_mask.shape[-1]), device=attention_mask.device
+                )
+                mask_indexes += offset


are you sure this is cuda graph compatible?~

yes, see e.g. scripts at the top of the PR header

also, see this comment explaining why :D

ArthurZucker · 2025-04-11T10:14:53Z

src/transformers/cache_utils.py

-    # TODO (joao): dive deeper into gemma2 and paligemma -- there are reports of speed loss with compilation. Revert
-    # ALL changes from the PR that commented the line below when reactivating it.
-    # is_compileable = True
+    is_compileable = True


nice! Can we update the cache to also init the layers lazily like we dofor HybridChunked cache?

@ArthurZucker HybridChunkedCache only works if we don't compile the first forward pass, HybricCache works regardless of we compile the first forward pass or not. torch._dynamo.mark_static_address can't be called inside torch.compile, which lazy init does.

This means that if a user creates their own custom code with HybridChunkedCache, they can't simply compile the forward pass. If anything, HybridChunkedCache should move away from lazy init :P

Chatted offline:

lazy init is needed for TP

however, lazy init is incompatible with compiling the first forward pass (prefill). lazy init + @torch.compiler.disable() doesn't solve it either

solution: add a new flag lazy_init = None. If torch.distributed is initialized and the flag is unset, then it will be True.

Apply this change to ALL caches -> ALL caches compatible with TP + no non-TP drawbacks

src/transformers/models/cohere2/modular_cohere2.py

zucchini-nlp

Nice, thanks for detailed investigation! Do you think we can add a slow test to compare generation time with compile or include HybridCache in benchmarks board, so we don't accidentally introduce graph breaks? Given that Gemma3 is a high usage model and supports only Hybrid cache, I think it's important to not break it

zucchini-nlp · 2025-04-14T09:02:25Z

src/transformers/models/cohere2/modeling_cohere2.py

+                mask_indexes = torch.arange(
+                    min(effective_seq_len, attention_mask.shape[-1]), device=attention_mask.device
+                )
+                mask_indexes += offset


Cyrilvallez

Hey! Super nice work, thanks a lot! Just the offset that is wrong by 1, I added a detailed comment about that.

Also, as I assume you're already aware, this is only working under the assumption of a single prefill step with more than 1 tokens, then only decoding steps with 1 tokens. This was bothering me for some time I must say (what about context caching???), but never took the time to change it to make it more general and fullproof. However, with llama4, as I added prefill chunking, I had to make it work for any situation and any number of new tokens at any time. This is a nice precedent that I think we should use to now make the sliding caches more general.
However, this should probably be a separate PR, if you plan on working on it at some point let me know!! 🤗

Cyrilvallez · 2025-04-15T08:51:50Z

src/transformers/models/cohere2/modeling_cohere2.py

-                offset = last_cache_position - effective_seq_len
+                offset = cache_position[-1] - effective_seq_len


Very important detail: here it should actually be offset = cache_position[-1] +1 - effective_seq_len. Or equivalently, offset = cache_position[0] - sliding_window + 1 (perhaps more understandable, and more general if we want to extend behavior later). The idea being last_cache_position should in fact be the number of total processed tokens, not the final position (as it starts including 0). The comments are wrong, but the code was correct (it used to take the shape of the attention mask, which is the length, not last index).

Also, note that this only works for one prefill step, then only decoding steps. But it will fail in general with prefill chunking, or e.g. prefill caching (if the cache is already "full", i.e. we processed more than sliding_window tokens, and we want to do a forward with more than 1 new tokens, e.g. a new conversation turn. For the full general case, see the work I did in Llama4 here, as well as the cache going with it here. HybridCache and HybridChunkedCache are fully equivalent in the tokens they return and the necesary mask offsets, HybridChunked is just more general as it can always handle arbitrary number of input tokens, and based on its state return the necesary past states. The only difference is then how Llama create the mask from those states (chunked block vs sliding lower diagonal). But from a cache logic point of view, they are fully equivalent ( I first modified the HybridCache, but then we decided to create a new one for now)

Very important detail: here it should actually be offset = cache_position[-1] +1 - effective_seq_len. Or equivalently, offset = cache_position[0] - sliding_window + 1 (perhaps more understandable, and more general if we want to extend behavior later). The idea being last_cache_position should in fact be the number of total processed tokens, not the final position (as it starts including 0). The comments are wrong, but the code was correct (it used to take the shape of the attention mask, which is the length, not last index).

hehe good point, I replaced the slicing according to the local comments, and didn't double check what was being fed into the last_cache_position variable 👍

[working with prefill chunking / more than one input token]

Also a good point. Let's open a separate PR for it, to avoid bloating this PR. Having compile working on the base case is already very valuable for the community.

gante · 2025-04-17T09:13:08Z

Do you think we can add a slow test to compare generation time with compile or include HybridCache in benchmarks board, so we don't accidentally introduce graph breaks? Given that Gemma3 is a high usage model and supports only Hybrid cache, I think it's important to not break it

@zucchini-nlp Benchmarks are hard to put as a test: different devices/versions -> different speeds, needs multiple runs to avoid being flaky 💔 However, we should test that compilation only happens once for the entire forward pass, i.e. that there are no graph breaks nor recompilations. I'm going to explore torch docs to see if we can add this to a test, in this PR. At the moment, we are often doubting the quality of our compiled forward passes, which is not great

zucchini-nlp · 2025-04-17T09:15:21Z

@gante yeah, we can do something similar to what diffusers is trying to do (in continuation to yesterday's thread discussions). Fine by me, as long as we check that there are no recompilations every step

gante · 2025-04-17T12:15:07Z

@Cyrilvallez off by 1 comment addressed 👍 LMK if you'd like any further changes

@Cyrilvallez @zucchini-nlp There are three follow-up items to this PR, where each will be a separate PR:

Mixin test update to confirm that no recompilations, dynamic shapes, ... are happening. Related comment. [I've decided not to include it in this PR, since it needs to touch other models to fix failures];
TP + compile compatible caches. Related comment
Gemma 3 + Prefill chunking + more than one new token at generation time. Related comment

Cyrilvallez

LGTM!! 🤗 Thanks a lot, super glad to have compile back and to simplify by removing the extra arg that was introduced as well! 💛

gante commented Apr 11, 2025

View reviewed changes

src/transformers/models/cohere2/modeling_cohere2.py Outdated Show resolved Hide resolved

gante force-pushed the compile_gemma3 branch from fe05e2f to aa907f2 Compare April 11, 2025 09:24

gante commented Apr 11, 2025

View reviewed changes

src/transformers/models/cohere2/modeling_cohere2.py Outdated Show resolved Hide resolved

gante commented Apr 11, 2025

View reviewed changes

ArthurZucker reviewed Apr 11, 2025

View reviewed changes

working

a5eb7ce

qubvel reviewed Apr 11, 2025

View reviewed changes

src/transformers/models/cohere2/modular_cohere2.py Show resolved Hide resolved

reduce diff

e397840

gante force-pushed the compile_gemma3 branch from aa907f2 to e397840 Compare April 11, 2025 17:18

gante added 2 commits April 11, 2025 17:20

make fixup

ae15edf

skip test

4453f58

gante marked this pull request as ready for review April 11, 2025 17:49

gante and others added 2 commits April 11, 2025 18:49

Merge branch 'main' into compile_gemma3

690a168

skip fa2 generate test

7085bc3

gante requested review from Cyrilvallez and zucchini-nlp April 11, 2025 17:50

zucchini-nlp approved these changes Apr 14, 2025

View reviewed changes

Cyrilvallez reviewed Apr 15, 2025

View reviewed changes

gante and others added 3 commits April 17, 2025 10:20

Merge branch 'main' into compile_gemma3

32ac69b

make fixup

9638ad7

offset + 1

1930251

Cyrilvallez approved these changes Apr 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gemma3] compile ✨ #37447

[Gemma3] compile ✨ #37447

gante commented Apr 11, 2025 •

edited

Loading

gante Apr 11, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 11, 2025

ArthurZucker Apr 11, 2025

gante Apr 11, 2025 •

edited

Loading

zucchini-nlp Apr 14, 2025

ArthurZucker Apr 11, 2025

gante Apr 11, 2025

gante Apr 11, 2025 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading

zucchini-nlp Apr 14, 2025

Cyrilvallez left a comment

Cyrilvallez Apr 15, 2025 •

edited

Loading

gante Apr 17, 2025

gante commented Apr 17, 2025

zucchini-nlp commented Apr 17, 2025 •

edited

Loading

gante commented Apr 17, 2025 •

edited

Loading

Cyrilvallez left a comment

		offset = last_cache_position - effective_seq_len
		offset = cache_position[-1] - effective_seq_len

[Gemma3] compile ✨ #37447

Are you sure you want to change the base?

[Gemma3] compile ✨ #37447

Conversation

gante commented Apr 11, 2025 • edited Loading

What does this PR do?

Performance

Tests

Post-mortem: How did we break compile on Gemma 2?

gante Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 11, 2025

ArthurZucker Apr 11, 2025

Choose a reason for hiding this comment

gante Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

zucchini-nlp Apr 14, 2025

Choose a reason for hiding this comment

ArthurZucker Apr 11, 2025

Choose a reason for hiding this comment

gante Apr 11, 2025

Choose a reason for hiding this comment

gante Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

zucchini-nlp left a comment • edited Loading

Choose a reason for hiding this comment

zucchini-nlp Apr 14, 2025

Choose a reason for hiding this comment

Cyrilvallez left a comment

Choose a reason for hiding this comment

Cyrilvallez Apr 15, 2025 • edited Loading

Choose a reason for hiding this comment

gante Apr 17, 2025

Choose a reason for hiding this comment

gante commented Apr 17, 2025

zucchini-nlp commented Apr 17, 2025 • edited Loading

gante commented Apr 17, 2025 • edited Loading

Cyrilvallez left a comment

Choose a reason for hiding this comment

gante commented Apr 11, 2025 •

edited

Loading

gante Apr 11, 2025 •

edited

Loading

gante Apr 11, 2025 •

edited

Loading

gante Apr 11, 2025 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading

Cyrilvallez Apr 15, 2025 •

edited

Loading

zucchini-nlp commented Apr 17, 2025 •

edited

Loading

gante commented Apr 17, 2025 •

edited

Loading