Skip to content

Commit 2ef3b3a

Browse files
authored
Add monarch+torchft distributed training example (#264)
* add monarch+torchft distributed training example * lint and feedback
1 parent eebdf3a commit 2ef3b3a

File tree

2 files changed

+477
-0
lines changed

2 files changed

+477
-0
lines changed

examples/monarch/README.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
### Monarch-TorchFT-TorchTitan Distributed Training Orchestrator
2+
3+
#### Overview
4+
This script orchestrates fault-tolerant distributed training using TorchTitan and TorchMonarch
5+
frameworks. It manages multiple training replicas across SLURM-scheduled compute nodes
6+
with automatic failure recovery and TorchFT lighthouse coordination.
7+
8+
##### PREREQUISITES
9+
- Access to a SLURM cluster with GPU nodes
10+
- TorchTitan training configuration file in script directory (debug_model.toml)
11+
- A training dataset (c4_test) and tokenizer in script directory
12+
13+
##### CONFIGURATION
14+
Before running, update the cluster-specific constants:
15+
- MACHINE: TorchX named resource for your cluster (currently: "gpu.xlarge")
16+
- MACHINE_MEMORY: Memory per machine in MB (currently: 2062607)
17+
You can also override the resource configuration manually:
18+
- https://docs.pytorch.org/torchx/main/specs.html#resource
19+
20+
##### USAGE
21+
python train_distributed.py --help
22+
23+
Basic usage with 2 replicas, each with 1 node and 8 GPUs:
24+
python train_distributed.py
25+
26+
Custom configuration:
27+
python train_distributed.py --replica-count 3 --gpu-per-node 8 \
28+
--host-per-replica 2 --training-steps 100
29+
30+
With remote TorchFT lighthouse:
31+
python train_distributed.py --remote-lighthouse
32+
33+
##### KEY COMPONENTS
34+
- LighthouseActor: Coordination server for fault tolerance
35+
- TrainingActor: Individual trainer processes
36+
- ReplicaActor: Manages groups of trainers
37+
- OrchestrationManager: Top-level orchestration and failure recovery
38+
39+
##### FAILURE RECOVERY
40+
- Automatic retry with configurable delays (PER_ATTEMPT_DELAY)
41+
- New allocations after repeated failures (PROC_ATTEMPTS)
42+
- Maximum attempts per replica (MAX_ATTEMPT)
43+
44+
##### OUTPUT
45+
- Training outputs saved to ./outputs directory
46+
- Logs streamed from all distributed processes
47+
- TensorBoard metrics enabled by default
48+
49+
##### CLEANUP
50+
All SLURM jobs are automatically terminated at script completion.

0 commit comments

Comments
 (0)