Write a sharded transformer block in nvFuser API. 

This is to unblock @cowanmeg and @samnordmann 's distributed matmul experiments. 

I'll start with the tensor parallelism proposed by [the original Megatron-LM paper](https://arxiv.org/pdf/1909.08053). 
1. Only MHA and MLP are sharded. 
2. Activations are sharded in 2D, batch and hidden. However, the batch dimension sharding is just for data parallelism and the dimension is never resharded. 
3. Weights are sharded in 1D, the hidden dimension. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Write a sharded transformer block in nvFuser API. #2199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Write a sharded transformer block in nvFuser API. #2199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions