Skip to content

Write a sharded transformer block in nvFuser API.  #2199

@wujingyue

Description

@wujingyue

This is to unblock @cowanmeg and @samnordmann 's distributed matmul experiments.

I'll start with the tensor parallelism proposed by the original Megatron-LM paper.

  1. Only MHA and MLP are sharded.
  2. Activations are sharded in 2D, batch and hidden. However, the batch dimension sharding is just for data parallelism and the dimension is never resharded.
  3. Weights are sharded in 1D, the hidden dimension.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions