[FT] Support local_sgd / diloco in titan #1122

H-Huang · 2025-04-18T19:10:14Z

Depends on torchft changes:

This PR adds a new semi sync method context manager which wraps around the train loop to run local sgd or diloco. It also adds multiple config properties to set and control the training method.

To run (need 3 different terminals):

Start torchft lighthouse (terminal 1):
RUST_LOGS=debug RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 2 --quorum_tick_ms 100 --join_timeout_ms 10000

Start replica 1 (terminal 2, update lighthouse URL):
TORCHFT_LIGHTHOUSE=<url> TORCHFT_MANAGER_PORT=29520 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=0,1,2,3 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --fault_tolerance.semi_sync_method="diloco"

Start replica 2 (terminal 3, update lighthouse URL):
TORCHFT_LIGHTHOUSE=<url> TORCHFT_MANAGER_PORT=29522 REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=4,5,6,7 NGPU=4 ./run_train.sh --parallelism.data_parallel_shard_degree=4 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=1 --fault_tolerance.semi_sync_method="diloco"

fegin

It looks like we can share the training loop? We basically just add another context manager. Additionally, the training_method is added to the main JobConfig. so we shouldn't create an additional train.py.

If we don't want to expose this feature as the main feature yet, then we have to move training_method configuration to experiments/fault_tolerance/train.py as well. You can check torchtitan/tests/unit_tests/test_job_config.py to find out the example.

fegin · 2025-04-22T18:17:46Z

torchtitan/experiments/fault_tolerance/train.py

+            model=self.model_parts[0],
+            optimizer=self.optimizers,
+            sync_every=2,
+        ) as semi_sync_training:


It seems that we are not using this variable. We don't need as semi_sync_training?

Yeah we dont need it. I can remove

fegin · 2025-04-22T18:21:19Z

torchtitan/config_manager.py

@@ -502,6 +502,13 @@ class FaultTolerance:
    min_replica_size: int = 1
    """The minimum number of FT replica for each step."""

+    training_method: str | None = "diloco"


semi_sync_method or synchronize_method are less confusing. trianing_method is pretty general. And can we specify that if the value is not set, we will use synchronized training?

semi_sync_method sounds good. I will also update that comment

H-Huang · 2025-04-22T18:48:00Z

Correct @fegin, we can share the training loop. My main worry was making the regular training loop harder to read by adding this, but yeah you are right that the configs are already in JobConfig.

The alternative is to ask the user to use torchtitan.experiments.fault_tolerance import FtTrainer and replace their trainer = FtTrainer(config), but im not sure if we currently support swapping in another trainer. Maybe your proposal is the right way to go

d4l3k · 2025-04-22T19:07:03Z

I think if we can do it with the same train loop it'd be nice -- maybe we can use ExitStack for cleaner optional registration of the context managers?

fegin · 2025-04-22T20:06:36Z

I'm okay either way. Since TorchFT is already in the core TorchTitan, I think it is okay to put the semi-sync to the main training loop.

If you want to put it in the experiment folder, custom_args_module is all you need.

fegin

LGTM, one small request change if makes sense.

fegin · 2025-04-28T22:33:01Z

torchtitan/train.py

+            model=self.model_parts[0],
+            optimizer=self.optimizers,
+            sync_every=job_config.fault_tolerance.sync_steps,
+        ):


uh, I just realized that this context can be initialized in trainer.__init__(), putting into dist_utils.get_train_context. If my understanding is correct, that will only require to change dist_utils.get_train_context and its caller. Let me know if this makes sense.

Here it's entering context manager for the entire training, across iterations.
dist_utils.get_train_context is a per-iteration context manager.
Which way are we supposed to use the ft contexts?

ah interesting. Yeah the FT context is across iterations since it adds hooks to the optimizers and performs synchronization every N iterations (based on the optimizer .step() calls).

I see. Maybe it'd be good to organize them into two context manager util functions, one for overall train, the other for per training iteration.

good idea, I will add in a follow up PR!

tianyu-l · 2025-04-29T03:16:27Z

torchtitan/components/ft.py

@@ -158,3 +162,44 @@ def ft_clip_grad_norm_util(total_norm: DTensor) -> torch.Tensor:
            return DTensor.from_local(local_tensor, mesh.mesh, placements)

    return total_norm
+
+
+def maybe_semi_sync_training(


can we add typing?

tianyu-l · 2025-04-29T03:16:42Z

torchtitan/train.py

+            model=self.model_parts[0],
+            optimizer=self.optimizers,
+            sync_every=job_config.fault_tolerance.sync_steps,
+        ):


Here it's entering context manager for the entire training, across iterations.
dist_utils.get_train_context is a per-iteration context manager.
Which way are we supposed to use the ft contexts?

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 18, 2025

H-Huang changed the title ~~[FT] Support local_sgd / diloco in titan~~ [WIP] [FT] Support local_sgd / diloco in titan Apr 18, 2025

H-Huang mentioned this pull request Apr 18, 2025

Support DTensor params in local_sgd/diloco pytorch/torchft#168

Merged

H-Huang force-pushed the diloco branch 2 times, most recently from 7d1a96d to 3c0a9f8 Compare April 22, 2025 17:33

H-Huang changed the title ~~[WIP] [FT] Support local_sgd / diloco in titan~~ [FT] Support local_sgd / diloco in titan Apr 22, 2025

H-Huang force-pushed the diloco branch from 3c0a9f8 to e008b65 Compare April 22, 2025 17:43

H-Huang marked this pull request as ready for review April 22, 2025 17:47

H-Huang requested review from fegin and d4l3k April 22, 2025 17:56

fegin requested changes Apr 22, 2025

View reviewed changes

H-Huang force-pushed the diloco branch from e008b65 to c6341d4 Compare April 28, 2025 14:59

H-Huang requested a review from fegin April 28, 2025 14:59

H-Huang force-pushed the diloco branch from c6341d4 to 5073648 Compare April 28, 2025 15:50

fegin approved these changes Apr 28, 2025

View reviewed changes

tianyu-l reviewed Apr 29, 2025

View reviewed changes

Run local_sgd/diloco in titan

16237d5

H-Huang force-pushed the diloco branch from 5073648 to 16237d5 Compare April 29, 2025 19:07

H-Huang merged commit c6c28dc into pytorch:main Apr 29, 2025
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FT] Support local_sgd / diloco in titan #1122

[FT] Support local_sgd / diloco in titan #1122

H-Huang commented Apr 18, 2025 •

edited

Loading

fegin left a comment •

edited

Loading

fegin Apr 22, 2025

H-Huang Apr 22, 2025

fegin Apr 22, 2025

H-Huang Apr 22, 2025

H-Huang commented Apr 22, 2025

d4l3k commented Apr 22, 2025

fegin commented Apr 22, 2025

fegin left a comment

fegin Apr 28, 2025

tianyu-l Apr 29, 2025

H-Huang Apr 29, 2025

tianyu-l Apr 29, 2025

H-Huang Apr 29, 2025

tianyu-l Apr 29, 2025

H-Huang Apr 29, 2025

tianyu-l Apr 29, 2025

[FT] Support local_sgd / diloco in titan #1122

[FT] Support local_sgd / diloco in titan #1122

Conversation

H-Huang commented Apr 18, 2025 • edited Loading

To run (need 3 different terminals):

fegin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

H-Huang commented Apr 22, 2025

d4l3k commented Apr 22, 2025

fegin commented Apr 22, 2025

fegin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

H-Huang commented Apr 18, 2025 •

edited

Loading

fegin left a comment •

edited

Loading