simplify streaming diloco #233

tushar00jain · 2025-07-14T17:38:43Z

Summary:

since we made a simplifying assumption that we will only ever have 1 inflight fragment, we can simplify some of the logic particularly getting rid of the local step in manager state
we'll just use the manager's step to determine which fragment to sync
this also allows us to easily support heterogenous hardware by tuning the sync_every setting that will make slower/faster machines to perform less/more local steps before they sync
we can also perform quorum right before preparing a fragment sync - this easily ensures that all replicas will have the same max step and sync the same fragment
fix some numeric issues
- the sign of the pseudogradient
- inplace lerp when mixing local and global parameters

Stack created with Sapling. Best reviewed with ReviewStack.

d4l3k

LGTM

d4l3k · 2025-07-15T00:47:17Z

torchft/local_sgd.py

-                pseudogradient = local_param - self.original_parameters[name].to(
-                    p.device
+                pseudogradient = (
+                    self.original_parameters[name].to(p.device) - local_param


This is flipped because we don't do 1- below?

what do you mean by 1-? the outer optimizer will do param = param - pseudo_grad. but loss goes down in the direction of -pseudo_grad. this is pretty much why we were seeing the loss going up earlier i think.

proof of this:

let's say pseudo_grad = new_param - old_param where new_param = old_param - grad

outer optimizer step is new_param = old_param - pseudo_grad = old_param - (old_param - grad) = old_param + grad

this is incorrect because it should just be new_param = old_param - grad

d4l3k · 2025-07-15T00:51:56Z

torchft/local_sgd.py

@@ -588,7 +553,11 @@ def __init__(
        if sync_every < len(model_fragments):
            raise ValueError("Only 1 fragment can be syncrhonized at a time")

-        if fragment_sync_delay >= sync_every:
+        if sync_every % len(model_fragments) != 0:
+            raise ValueError("sync_every must divide the number of fragments")


This makes sense for now -- we can relax this later if it turns out people want to sync different parts of the model at different rates though that has other significant considerations

we can still sync at different rates by passing in a different sync_every. will need to make it configurable in torchtitan too.

Summary: - move the training loop to a separate file - convert it into a class so that methods can be overridden without having to duplicate code

Summary: - since we made a simplifying assumption that we will only ever have 1 inflight fragment, we can simplify some of the logic particularly getting rid of the local step in manager state - we'll just use the manager's step to determine which fragment to sync - this also allows us to easily support heterogenous hardware by tuning the sync_every setting that will make slower/faster machines to perform less/more local steps before they sync - we can also perform quorum right before preparing a fragment sync - this easily ensures that all replicas will have the same max step and sync the same fragment - fix some numeric issues - the sign of the pseudogradient - inplace lerp when mixing local and global parameters

This was referenced Jul 14, 2025

regression test #234

Merged

test multiple outer optimizers #231

Merged

refactor diloco test #232

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 14, 2025

tushar00jain requested review from d4l3k and H-Huang and removed request for d4l3k July 14, 2025 17:49

tushar00jain force-pushed the pr233 branch 3 times, most recently from 6069b95 to 3d53cdc Compare July 14, 2025 18:42

d4l3k approved these changes Jul 15, 2025

View reviewed changes

tushar00jain force-pushed the pr233 branch 2 times, most recently from 606428d to 5d57f7f Compare July 15, 2025 17:29

tushar00jain added 2 commits July 15, 2025 10:38

refactor diloco test

d320e2b

Summary: - move the training loop to a separate file - convert it into a class so that methods can be overridden without having to duplicate code

tushar00jain force-pushed the pr233 branch from 5d57f7f to 1f581a0 Compare July 15, 2025 17:38

tushar00jain merged commit 8170a4b into pytorch:main Jul 15, 2025
14 of 15 checks passed

tushar00jain deleted the pr233 branch July 15, 2025 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

simplify streaming diloco #233

simplify streaming diloco #233

Uh oh!

tushar00jain commented Jul 14, 2025 •

edited

Loading

Uh oh!

d4l3k left a comment

Uh oh!

d4l3k Jul 15, 2025

Uh oh!

tushar00jain Jul 15, 2025

Uh oh!

tushar00jain Jul 15, 2025

Uh oh!

d4l3k Jul 15, 2025

Uh oh!

tushar00jain Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

simplify streaming diloco #233

simplify streaming diloco #233

Uh oh!

Conversation

tushar00jain commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Jul 14, 2025 •

edited

Loading