Add batch splitting in attention layer for decode to hide NIC latency #2334

jthakurH · 2025-10-31T05:54:20Z

Summary

This PR introduces a decode attention batch split feature that enables splitting the attention computation across batches during the decoding phase, along with a distributed barrier optimization to reduce network communication overhead.

Key Changes

1. Decode Attention Batch Split Feature

Added decode_attn_batch_split parameter to enable batch splitting during decoding phase
Implemented batch splitting logic in GaudiLlamaDecoderLayer.forward() method
Properly manages KV cache handling across batch splits during decode phase

2. Implementation Details

Extended existing attn_batch_split functionality to support decode phase
Maintains backward compatibility with existing attention batch split for prompt phase
Proper residual connection handling across batch splits
Efficient memory management for split operations

Benefits

Performance: Enables better memory utilization during decoding by splitting attention computation
Scalability: Reduces network communication overhead in multi-node setups

Total Perf for 405B llama model increase by 4.8% .
baseline 2k_2k_180 config: Perf -> 965 tokens/sec
with decode split 2 , 2k_2k_180 config: perf->1010 tokens/sec

examples/text-generation/utils.py

Add batch splitting in attention layer for decode to hide NIC latency

b45dca6

jthakurH requested review from libinta, mandy-li, regisss and vivekgoe as code owners October 31, 2025 05:54

Update modeling_llama.py

3f98783

yafshar reviewed Oct 31, 2025

View reviewed changes

examples/text-generation/utils.py Outdated Show resolved Hide resolved

yafshar reviewed Oct 31, 2025

View reviewed changes

examples/text-generation/utils.py Outdated Show resolved Hide resolved

Update utils.py

cc0d91d

karol-brejna-i assigned AKloniecki Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add batch splitting in attention layer for decode to hide NIC latency #2334

Add batch splitting in attention layer for decode to hide NIC latency #2334

jthakurH commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add batch splitting in attention layer for decode to hide NIC latency #2334

Are you sure you want to change the base?

Add batch splitting in attention layer for decode to hide NIC latency #2334

Conversation

jthakurH commented Oct 31, 2025

Summary

Key Changes

1. Decode Attention Batch Split Feature

2. Implementation Details

Benefits

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants