Add MLA #278

zzhhjjj · 2025-02-05T16:56:35Z

Add MLA to Nanotron:

Compared MLA with GQA on 25B tokens.
    1. LM loss:
        MLA: 2.58,  GQA: 2.51
        0.07 difference
    2. Throughput: 
        31279 v.s. 27466 tokens/s/gpu
        87% end-to-end throughput, which is expected due to the MLA structure
    3. KV cache:
        GQA:
                Hidden dim 4096, 32 heads, 8 key values
                2048 for key value
        MLA:
                kv_lora_rank: 512
                qk_rope_head_dim: 64
                in total: 576
        4 times less kv cache

zzhhjjj · 2025-02-17T16:54:24Z

Config example

model:   
  ddp_bucket_cap_mb: 25
  dtype: bfloat16
  init_method:
    std: 0.025
  make_vocab_size_divisible_by: 1  
  model_config:  
    ...
    vocab_size: 50272
    # MLA
    q_lora_rank: 1536 
    kv_lora_rank: 512
    qk_nope_head_dim: 128
    qk_rope_head_dim: 64
    v_head_dim: 128

xrsrke

Overall, it looks good, but I recommend adding unit tests to sanity-check MLA with different values of tp_mode, async_communication, and tp_recompute_allgather, and make sure that the output shape of the MLA class is as expected

examples/mla/mla_dp=8.yaml

src/nanotron/models/llama.py

examples/mla/mla_dp=8.yaml

src/nanotron/config/config.py

src/nanotron/models/llama.py

xrsrke

LFG

src/nanotron/models/llama.py

NouamaneTazi · 2025-03-05T17:12:11Z

src/nanotron/models/llama.py

+            q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1
+        )  # [seq_len, batch_size, n_local_heads, qk_nope_head_dim], [seq_len, batch_size, n_local_heads, qk_rope_head_dim]
+        q_pe = (
+            self.rotary_embedding(q_pe.transpose(0, 1), position_ids=None).transpose(0, 1).contiguous()


is the transpose(0,1) needed here?

yes. otherwise the results would be different

i meant why not transpose from the beginning of the forward in MLA? this way we avoid doing multiple small transposes

because transposes are very slow! and you have a lot of them in MLA's forward

src/nanotron/models/llama.py

NouamaneTazi · 2025-03-05T17:23:58Z

src/nanotron/models/llama.py

@@ -701,8 +855,9 @@ def __init__(
        layer_idx: int,
    ):
        super().__init__()
+        attn_cls = MLA if config.kv_lora_rank is not None else CausalSelfAttention


I'd rather make it more explicit like use config.use_mla here and assert somewhere that the other configs (e.g. kv_lora_rank) are well defined. This can be done in config.py

It seems a bit redundant to me since kv_lora_rank = MLA in this case, meaning there's no unexpected behavior

Sometimes redundancy is fine if it make code cleaner! I still think we should have use_mla somewhere as kv_lora_rank only relates to MLA for now

src/nanotron/parallel/tensor_parallel/functional.py

NouamaneTazi

Left some comments!! Looking nice alrdy

NouamaneTazi · 2025-03-07T17:15:16Z

src/nanotron/models/llama.py

+            q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1
+        )  # [seq_len, batch_size, n_local_heads, qk_nope_head_dim], [seq_len, batch_size, n_local_heads, qk_rope_head_dim]
+        q_pe = (
+            self.rotary_embedding(q_pe.transpose(0, 1), position_ids=None).transpose(0, 1).contiguous()


i meant why not transpose from the beginning of the forward in MLA? this way we avoid doing multiple small transposes

NouamaneTazi · 2025-03-07T17:16:19Z

src/nanotron/models/llama.py

@@ -701,8 +855,9 @@ def __init__(
        layer_idx: int,
    ):
        super().__init__()
+        attn_cls = MLA if config.kv_lora_rank is not None else CausalSelfAttention


Sometimes redundancy is fine if it make code cleaner! I still think we should have use_mla somewhere as kv_lora_rank only relates to MLA for now

NouamaneTazi · 2025-03-07T17:18:56Z

src/nanotron/models/llama.py

+            q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1
+        )  # [seq_len, batch_size, n_local_heads, qk_nope_head_dim], [seq_len, batch_size, n_local_heads, qk_rope_head_dim]
+        q_pe = (
+            self.rotary_embedding(q_pe.transpose(0, 1), position_ids=None).transpose(0, 1).contiguous()


because transposes are very slow! and you have a lot of them in MLA's forward

NouamaneTazi · 2025-03-07T17:20:22Z

src/nanotron/config/config.py

+        if self.model_config.kv_lora_rank is not None:
+            # set num_key_value_heads to None for MLA(as it's same as num_attention_heads in the paper)
+            # to avoid unintended errors
+            self.model_config.num_key_value_heads = None


Please add logger.warning here to warn user

NouamaneTazi · 2025-03-07T17:21:47Z

src/nanotron/config/models_config.py

+    q_lora_rank: Optional[int] = None
+    kv_lora_rank: Optional[int] = None
+    qk_nope_head_dim: Optional[int] = None
+    qk_rope_head_dim: Optional[int] = None
+    v_head_dim: Optional[int] = None


we could regroup these in a MLAConfig to make them separate of the rest, or let's just follow transformers' config standards

src/nanotron/parallel/tensor_parallel/functional.py

NouamaneTazi · 2025-03-07T17:23:42Z

src/nanotron/models/llama.py

+        )
+
+        # Initialize linear layers
+        self.q_down = nn.Linear(self.dim, self.q_lora_rank, bias=False)  # Note: this is duplicated across GPUs


Add warning comment please?

zzhhjjj added 2 commits February 5, 2025 16:50

MLA first commit

8d0c61f

fix tp

a300f23

zzhhjjj force-pushed the mla branch from c0f86f0 to a300f23 Compare February 17, 2025 12:24

MLA added to model code

293d418

[email protected] and others added 3 commits February 21, 2025 13:38

update readme

bd270a1

back to flash attention

94829d1

config file example

c813544

xrsrke self-requested a review February 24, 2025 19:19

xrsrke requested changes Feb 24, 2025

View reviewed changes

[email protected] and others added 3 commits March 3, 2025 11:32

format

603bd37

unit test

ab7469e

refactor test

014589e

xrsrke approved these changes Mar 4, 2025

View reviewed changes

NouamaneTazi requested changes Mar 5, 2025

View reviewed changes

add note

d600ea5

NouamaneTazi reviewed Mar 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLA #278

Add MLA #278

zzhhjjj commented Feb 5, 2025 •

edited

Loading

zzhhjjj commented Feb 17, 2025 •

edited

Loading

xrsrke left a comment

xrsrke left a comment

NouamaneTazi Mar 5, 2025

zzhhjjj Mar 6, 2025

NouamaneTazi Mar 7, 2025

NouamaneTazi Mar 7, 2025

NouamaneTazi Mar 5, 2025

zzhhjjj Mar 6, 2025 •

edited

Loading

NouamaneTazi Mar 7, 2025

NouamaneTazi left a comment

NouamaneTazi Mar 7, 2025

NouamaneTazi Mar 7, 2025

NouamaneTazi Mar 7, 2025

NouamaneTazi Mar 7, 2025

NouamaneTazi Mar 7, 2025

NouamaneTazi Mar 7, 2025

Add MLA #278

Are you sure you want to change the base?

Add MLA #278

Conversation

zzhhjjj commented Feb 5, 2025 • edited Loading

zzhhjjj commented Feb 17, 2025 • edited Loading

Config example

xrsrke left a comment

Choose a reason for hiding this comment

xrsrke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzhhjjj Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NouamaneTazi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzhhjjj commented Feb 5, 2025 •

edited

Loading

zzhhjjj commented Feb 17, 2025 •

edited

Loading

zzhhjjj Mar 6, 2025 •

edited

Loading