Add DeepSeekV3 and figure out what's needed to support it #29

fmassa · 2025-07-04T14:07:38Z

This PR adds a self-contained implementation of DeepSeekV3 MoE taken from https://github.com/pytorch/torchtitan/tree/deepseek-v3/torchtitan/models/deepseek_v3 taken from commit pytorch/torchtitan@0a6ab71
We could also add the Attention-based part, but I decided to focus on the MoE first because I've addressed a few issues with the MLA-based attention in #7

There were quite a few problems that appeared. My goal with this PR is to create a list of work-items that will bring us closer to supporting DeepSeekV3 in AutoParallel.

With all those changes, we are able to get the solver to give a solution (although it will be inefficient as unknown ops will be Replicate() everywhere). We should at least add a batch sharding fallback

First problem: custom triton kernels

The code as is couldn't trace because of custom triton kernels. I wrapped the triton kernel in a custom op and things were fine (see 4fdf95c)

Second problem: Node relabeling missing grad_input

For some reason, we seem to have one additional output in the joint graph, which makes the grad_input relabeling don't work. For now I just commented it as it's not necessary for getting the list of ops that need to be supported

Third problem: Missing sharding propagation for a number of ops

This was expected (and was the main goal of getting this work done). After a bunch of workarounds, I was able to get an exhaustive list of all the ops that seem to require re-implementation.

Here is a list of ops for which sharding propagation needs to be implemented to support DeepSeekV3:

Note that the _softmax_backward_data and the fma ops are redundant -- fma comes from the decomposition of _softmax_backward_data, so we can decide to either decompose it or not

Forth problem: Missing int64 (and other dtypes) flops for compute estimation

I defaulted to just ignoring unsupported dtypes and returning 0 flops basically, which is reasonable. But I think we should add as cost the memory read / write cost as well, like inductor does (although there are issues with inductor estimation as well)

Fifth problem: _grouped_mm requires stride % 16

For now I just add inf cost for this type of ops, but we should remove the placement ops that are invalid beforehand

Potential Fix PR: pytorch/pytorch#158245

filters out the invalid (non-aligned) shardings earlier on so that we do not attempt to estimate flops for grouped_mm for invalid inputs

Taken from #29

autoparallel/export_module.py

wconstab · 2025-07-04T22:44:50Z

autoparallel/propagation_rules.py

-        assert len(op_spec[i].strategies) == len(
-            op_spec[0].strategies
-        ), "Assume each cat input has same number of strategies"
+    # for i in range(1, num_tensors):


Why was this changed? Are we filtering to the least common strategies now? / should we be?

I just commented those asserts just so that I could move forward with what is needed for operator support. That's why I kept the torch.cat operator in the list of operators to implement, as I basically butchered things just so it could run up to the solver construction

autoparallel/utils.py

zpcore · 2025-07-05T06:11:42Z

The following three ops are already registered. Did you notice any issues with those?

aten.cat.default
aten.index_put.default
aten.slice_scatter.default

fmassa · 2025-07-05T08:32:30Z

The following three ops are already registered. Did you notice any issues with those?

aten.cat.default
aten.index_put.default
aten.slice_scatter.default

IIRC, they assumed that the number of strategies for every input was the same, or something like that, and it caused issues in the past. It would be good to double-check if things work as expected nowadays.

EDIT: I commented out the custom registration for those three ops, and here were the issues I found:

aten.cat.default

(At least) The redistribute costs is missing, only one output strategy proposed, no tensor_meta

aten.index_put.default

Got this error in the PyTorch rule

[rank0]: Traceback (most recent call last):
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/examples/example_ds3.py", line 845, in <module>
[rank0]:     autop = AutoParallel(model, input_fn, mesh)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/autoparallel/api.py", line 250, in __init__
[rank0]:     sharding_optimizer = ShardingOptimizer(self.gm, self.mesh)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/autoparallel/optimize_sharding.py", line 45, in __init__
[rank0]:     self.strats = self.build_sharding_metadata()
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/autoparallel/optimize_sharding.py", line 69, in build_sharding_metadata
[rank0]:     strat = get_placement_options(
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/autoparallel/utils.py", line 176, in get_placement_options
[rank0]:     out_strat = torch.distributed.tensor.DTensor._op_dispatcher.sharding_propagator.op_strategy_funcs[
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/micromamba/envs/ptdev/lib/python3.12/site-packages/torch/distributed/tensor/_ops/_tensor_ops.py", line 744, in prop_index_put
[rank0]:     in_spec, indices_spec, values_spec = op_schema.args_schema
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: too many values to unpack (expected 3)

aten.slice_scatter.default

(At least the) TensorMeta information is missing, doesn't take different number of input shardings (for input and index) into account, missing redistribute costs

autoparallel/utils.py

zpcore · 2025-07-06T07:19:46Z

The following three ops are already registered. Did you notice any issues with those?
aten.cat.default
aten.index_put.default
aten.slice_scatter.default

IIRC, they assumed that the number of strategies for every input was the same, or something like that, and it caused issues in the past. It would be good to double-check if things work as expected nowadays.

EDIT: I commented out the custom registration for those three ops, and here were the issues I found:

aten.cat.default

(At least) The redistribute costs is missing, only one output strategy proposed, no tensor_meta

aten.index_put.default

Got this error in the PyTorch rule
[rank0]: Traceback (most recent call last):
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/examples/example_ds3.py", line 845, in <module>
[rank0]:     autop = AutoParallel(model, input_fn, mesh)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/autoparallel/api.py", line 250, in __init__
[rank0]:     sharding_optimizer = ShardingOptimizer(self.gm, self.mesh)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/autoparallel/optimize_sharding.py", line 45, in __init__
[rank0]:     self.strats = self.build_sharding_metadata()
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/autoparallel/optimize_sharding.py", line 69, in build_sharding_metadata
[rank0]:     strat = get_placement_options(
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/work/projects/autoparallel/autoparallel/utils.py", line 176, in get_placement_options
[rank0]:     out_strat = torch.distributed.tensor.DTensor._op_dispatcher.sharding_propagator.op_strategy_funcs[
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/storage/home/fmassa/micromamba/envs/ptdev/lib/python3.12/site-packages/torch/distributed/tensor/_ops/_tensor_ops.py", line 744, in prop_index_put
[rank0]:     in_spec, indices_spec, values_spec = op_schema.args_schema
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: too many values to unpack (expected 3)
aten.slice_scatter.default

(At least the) TensorMeta information is missing, doesn't take different number of input shardings (for input and index) into account, missing redistribute costs

Thanks for the detail!

index_put was recently added and I just made a fix to resolve the arg unpack issue: Fix index_put propagate strategy arg unpack error pytorch/pytorch#157671.
We are currently keeping in track of other ops like cat and slice_scatter which miss requires fields here [DTensor] Improve tensor_metadata and redistribute_cost coverage for op strategies. pytorch/pytorch#157495.

fmassa · 2025-07-07T13:07:29Z

autoparallel/export_module.py

+    print("grad_input")
+
+    # TODO: figure out and fix why this is not working
+    # rename_nodes(fx_g, grad_inputs, "grad_input", inputs_that_require_grad)


Can someone have a look and see what was going wrong here and why we had more nodes than expected?

Took a look - if we're willing to support a copy_ mutable epilogue at the end of the joint graph (similar to what compile handles today), I can uncomment rename_nodes after patching in these PRs:

pytorch/pytorch#157730

#32

ezyang · 2025-07-08T14:11:56Z

The code as is couldn't trace because of custom triton kernels. I wrapped the triton kernel in a custom op and things were fine (see 4fdf95c)

If the code is a "user defined Triton kernel" it should trace directly; the OmniFMv2 triton kernels traced in this way.

fmassa · 2025-07-08T15:24:25Z

The issue is was having was that somewhere in the triton kernel it was trying to grab the data_ptr of the tensor, and tracing through it didn't work

Doesn't yet work, but the code for now is a copy-paste from https://github.com/pytorch/torchtitan/blob/deepseek-v3/torchtitan/models/deepseek_v3/model/moe.py so it will make it easier to track the changes

Needs to fix the grad_input renaming which is not working for some reason

They are not correct, it's just to get a list of what we need for DeepSeekV3

prims.fma is probably easier to implement, but I'm removing this decomp just in case

Now should handle all ops properly, with correct shapes

…m cases with invalid strides The grouped_mm should be handled in the sharding propagation and those cases should just be removed I think

Otherwise we can't shard on the batch dimension. With this change everything works up to executing the solver

zpcore · 2025-07-25T16:52:56Z

@zpcore I rebased the PR on top of latest main and I tried removing the _softmax_backward_data fallback, but it seems like it doesn't have an input_spec so things fail https://github.com/pytorch/pytorch/blob/92e93bb580f31d405a72ee58f30fe82908bbeacf/torch/distributed/tensor/_ops/_math_ops.py#L563

This should probably be fixed before we can enable it

I see, just send out the PR to add back the missing field pytorch/pytorch#159167.

Needs cleanup

There was no flop formula, which was making the solver think that computing this op is free

…sa/deepseekv3

This is still approximate as we can't evenly shard on the tokens, but doing this prior to see if we can introduce a DynamicShard primitive

fmassa · 2025-07-28T09:56:58Z

autoparallel/propagation_rules.py

+@register_opschema_rule(torch.ops.aten.sort.stable)
+def sort_rule(mesh, op_schema):
+    op = torch.ops.aten.topk.default
+    out_strat = torch.distributed.tensor.DTensor._op_dispatcher.sharding_propagator.op_strategy_funcs[
+        op
+    ](
+        op_schema
+    )
+    return out_strat
+
+
+@register_opschema_rule(torch.ops.aten.gather.default)
+def gather_strategy(mesh, op_schema):
+    from torch.distributed.tensor._op_schema import PlacementList
+    from torch.distributed.tensor._ops._embedding_ops import _MaskPartial
+    from torch.distributed.tensor._ops.utils import expand_to_full_mesh_op_strategy
+
+    input_strategy = op_schema.args_schema[0]
+    dim = op_schema.args_schema[1]
+    index_strategy = op_schema.args_schema[2]
+
+    input_shape = input_strategy.shape
+    index_shape = index_strategy.shape
+
+    single_mesh_dim_strategies = []
+
+    # placement list stores placements of [output, input, index]
+    # first we always have replicate all for inputs and output
+    all_replicate: PlacementList = [Replicate()] * 3
+    single_mesh_dim_strategies.append(all_replicate)
+
+    # input sharding, input sharded, index accepts mask partial, output follows index
+    # this only works when the input is sharded on the gather dimension, and
+    # index has size 1 on the gather dimension
+    if index_shape[dim] == 1:
+        index_partial_placement = _MaskPartial(offset_shape=input_shape, offset_dim=dim)
+        input_sharding: PlacementList = [
+            index_partial_placement,
+            Shard(dim),
+            index_partial_placement,
+        ]
+        single_mesh_dim_strategies.append(input_sharding)
+
+    # index sharding, input replicated, index sharded, output follows index
+    # this only works when the sharding dimension is the gather dimension
+    index_sharding: PlacementList = [Shard(dim), Replicate(), Shard(dim)]
+    single_mesh_dim_strategies.append(index_sharding)
+
+    if len(input_shape) == len(index_shape):
+        for d in range(len(input_shape)):
+            if d != dim:
+                sharding = [Shard(d), Shard(d), Shard(d)]
+                single_mesh_dim_strategies.append(sharding)
+
+    return expand_to_full_mesh_op_strategy(
+        mesh, op_schema, single_mesh_dim_strategies, input_index=1
+    )
+
+
+@register_opschema_rule(torch.ops.aten.scatter_add.default)
+def scatter_add_strategy(mesh, op_schema):


@wconstab @zpcore can we double-check those added rules and make sure they are valid / make sense?

The strategy for scatter_add is basically following what I've added for gather, which is that we can allow all tensors to be sharded on any dimension which is not the dim from gather.

autoparallel/utils.py

autoparallel/optimize_sharding.py

fmassa · 2025-07-29T14:34:49Z

examples/example_ds3.py

+# mesh = torch.distributed.device_mesh.init_device_mesh("cuda", (world_size,), mesh_dim_names=("dp",))
+mesh = torch.distributed.device_mesh.init_device_mesh(
+    "cuda",
+    (world_size // 32, 32),


should be instead

(world_size // 64, 64)

if I want to follow exactly what DeepSeek is doing

…sa/deepseekv3

They were taken from #29

…sa/deepseekv3

Gather and scatter_add were merged yesterday in pytorch/pytorch#160140

…nsum (#26) * [WIP] Replace view -> mm -> view with matmul This tries to support CP-style sharding, by overcoming a limitation of DTensor. Doesn't yet work as _mm_strategy is failing * Fix matmul propagation rule Somethings are starting to work, but we are not yet there * Move function to graph_utils.py * Pull improvements from #29 * Fix equation for einsum * Cleanup code now that PyTorch has fixed _gen_einsum_strategies Requires pytorch/pytorch#157593 * Generalize to more than 3d * Generalize backward pass as well and make everything call into einsum * Add note about future work * Add einsum flops and generalize creation of sharded tensors Before this, if we had a list of tensors we wouldn't shard the tensors inside the list * Disable erroneous sdpa rule from backward * Account for compute cost in collectives as well This removes a long-standing hack to tell the solver that S(1) -> R is more expensive than S(0) -> R because of an additional data movement * Account for compute cost in collectives as well This removes a long-standing hack to tell the solver that S(1) -> R is more expensive than S(0) -> R because of an additional data movement * Support getitem as well * Improve comments and suppose 80% efficiency * Suppose 70% efficiency for comms * Add comment and set it to false by default * Revert changes from another PR * Add spaces back

…sa/deepseekv3

Taken from #3 and #29. Decomposing softmax_backward leads to prims.fma, which doesn't have a sharding rule and we end up having a Replicate showing up as only possible sharding

fmassa requested review from bdhirsh and wconstab July 4, 2025 14:07

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 4, 2025

fmassa mentioned this pull request Jul 4, 2025

Pass kwargs to get_placement_options #30

Merged

fmassa added a commit that referenced this pull request Jul 4, 2025

Update convert_element_type_rule to latest PyTorch signature

56d14d0

Taken from #29

fmassa mentioned this pull request Jul 4, 2025

Update convert_element_type_rule to latest PyTorch signature #31

Merged

fmassa added a commit that referenced this pull request Jul 4, 2025

Update convert_element_type_rule to latest PyTorch signature (#31)

1053272

Taken from #29

wconstab reviewed Jul 4, 2025

View reviewed changes

zpcore reviewed Jul 6, 2025

View reviewed changes

autoparallel/utils.py Outdated Show resolved Hide resolved

fmassa commented Jul 7, 2025

View reviewed changes

bdhirsh mentioned this pull request Jul 7, 2025

keep input mutations in the joint graph #32

Merged

zpcore mentioned this pull request Jul 7, 2025

[DTensor] Improve tensor_metadata and redistribute_cost coverage for op strategies. pytorch/pytorch#157495

Open

52 tasks

bdhirsh mentioned this pull request Jul 10, 2025

add float8 support pytorch/torchtitan#1378

Open

zpcore mentioned this pull request Jul 17, 2025

[2/N] support of replication fallback strategy pytorch/pytorch#158476

Closed

wconstab force-pushed the fmassa/deepseekv3 branch from 3d2ad08 to b5eb863 Compare July 18, 2025 14:16

fmassa added 10 commits July 18, 2025 07:17

[WIP] Add basic DeepSeekV3

52ea0c1

Doesn't yet work, but the code for now is a copy-paste from https://github.com/pytorch/torchtitan/blob/deepseek-v3/torchtitan/models/deepseek_v3/model/moe.py so it will make it easier to track the changes

Lint

0d3ae2d

Workarounds to make graph capture pass

98d9dfd

Needs to fix the grad_input renaming which is not working for some reason

Add dummy propagation rules just to see what we need to implement

61a63c4

They are not correct, it's just to get a list of what we need for DeepSeekV3

Cleanup

67eb264

prims.fma comes from softmax_backward

86d53ff

prims.fma is probably easier to implement, but I'm removing this decomp just in case

Make _geenrate_dummy_strategy more generic

7864f4d

Now should handle all ops properly, with correct shapes

Add proper redistribute_cost to dummy strategies

60ccf1a

Hack around missing dtypes in compute estimation and handle grouped_m…

dbbc205

…m cases with invalid strides The grouped_mm should be handled in the sharding propagation and those cases should just be removed I think

Add representative batch size

d92f8c6

Otherwise we can't shard on the batch dimension. With this change everything works up to executing the solver

Fixes so that it runs

6bec5f5

fmassa added 5 commits July 26, 2025 06:55

[WIP] Plumb fake_mode to avoid materializing memory

ce1c0a5

Needs cleanup

Use more representative values for DS3 example

5d79bec

Add approximate flop formula to grouped_mm

daea5a2

There was no flop formula, which was making the solver think that computing this op is free

Merge branch 'main' of github.com:pytorch-labs/autoparallel into fmas…

6d350e0

…sa/deepseekv3

Glimpses of having DeepSeekV3 returning a reasonable solution

418ad55

This is still approximate as we can't evenly shard on the tokens, but doing this prior to see if we can introduce a DynamicShard primitive

fmassa commented Jul 28, 2025

View reviewed changes

autoparallel/utils.py Outdated Show resolved Hide resolved

fmassa commented Jul 28, 2025

View reviewed changes

autoparallel/optimize_sharding.py Outdated Show resolved Hide resolved

fmassa commented Jul 29, 2025

View reviewed changes

fmassa added 4 commits July 30, 2025 11:07

Merge branch 'main' of github.com:pytorch-labs/autoparallel into fmas…

fce321f

…sa/deepseekv3

Use with_implicit_strategies instead of my generate_dummy_strategy

6d5747a

[WIP] Convert view->mm->view into matmul

e0ae8a2

Merge branch 'main' of github.com:pytorch-labs/autoparallel into fmas…

1b83581

…sa/deepseekv3

This was referenced Aug 4, 2025

Use fake_mode when constructing ShardingOptimizer #70

Merged

Fix invariant that factory ops should have input specs #71

Merged

Use implicit strategy as fallback when no strategy is available #72

Merged

Merge branch 'main' of github.com:pytorch-labs/autoparallel into fmas…

cf1229d

…sa/deepseekv3

fmassa added a commit that referenced this pull request Aug 6, 2025

Add gather and scatter_add strategies

315f44b

They were taken from #29

fmassa mentioned this pull request Aug 6, 2025

Add gather and scatter_add strategies #81

Closed

fmassa added 2 commits August 9, 2025 05:43

Merge branch 'main' of github.com:meta-pytorch/autoparallel into fmas…

4fe5a40

…sa/deepseekv3

Remove sharding rules that have been since moved to PyTorch

67542ad

Gather and scatter_add were merged yesterday in pytorch/pytorch#160140

fmassa added a commit that referenced this pull request Aug 12, 2025

Pull improvements from #29

99abc4e

fmassa added 2 commits September 4, 2025 08:02

Merge branch 'main' of github.com:meta-pytorch/autoparallel into fmas…

779e808

…sa/deepseekv3

Fixes after rebase

124034e

fmassa added a commit that referenced this pull request Sep 29, 2025

Remove decomposition from softmax

e6b327e

Taken from #3 and #29. Decomposing softmax_backward leads to prims.fma, which doesn't have a sharding rule and we end up having a Replicate showing up as only possible sharding

fmassa mentioned this pull request Sep 29, 2025

Remove decomposition from softmax #171

Merged

fmassa added a commit that referenced this pull request Oct 1, 2025

Remove decomposition from softmax (#171)

6cd2133

Taken from #3 and #29. Decomposing softmax_backward leads to prims.fma, which doesn't have a sharding rule and we end up having a Replicate showing up as only possible sharding

Add DeepSeekV3 and figure out what's needed to support it #29

Are you sure you want to change the base?

Add DeepSeekV3 and figure out what's needed to support it #29

Uh oh!

Conversation

fmassa commented Jul 4, 2025 • edited by zpcore Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

First problem: custom triton kernels

Second problem: Node relabeling missing grad_input

Third problem: Missing sharding propagation for a number of ops

Forth problem: Missing int64 (and other dtypes) flops for compute estimation

Fifth problem: _grouped_mm requires stride % 16

Uh oh!

Uh oh!

wconstab Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

fmassa Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zpcore commented Jul 5, 2025

Uh oh!

fmassa commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

aten.cat.default

aten.index_put.default

aten.slice_scatter.default

Uh oh!

Uh oh!

zpcore commented Jul 6, 2025

aten.cat.default

aten.index_put.default

aten.slice_scatter.default

Uh oh!

fmassa Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

bdhirsh Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jul 8, 2025

Uh oh!

fmassa commented Jul 8, 2025

Uh oh!

zpcore commented Jul 25, 2025

Uh oh!

fmassa Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fmassa Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fmassa commented Jul 4, 2025 •

edited by zpcore

Loading

fmassa Jul 5, 2025 •

edited

Loading

fmassa commented Jul 5, 2025 •

edited

Loading

bdhirsh Jul 7, 2025 •

edited

Loading