|
| 1 | +### Monarch-TorchFT-TorchTitan Distributed Training Orchestrator |
| 2 | + |
| 3 | +#### Overview |
| 4 | +This script orchestrates fault-tolerant distributed training using TorchTitan and TorchMonarch |
| 5 | +frameworks. It manages multiple training replicas across SLURM-scheduled compute nodes |
| 6 | +with automatic failure recovery and TorchFT lighthouse coordination. |
| 7 | + |
| 8 | +##### PREREQUISITES |
| 9 | +- Access to a SLURM cluster with GPU nodes |
| 10 | +- TorchTitan training configuration file in script directory (debug_model.toml) |
| 11 | +- A training dataset (c4_test) and tokenizer in script directory |
| 12 | + |
| 13 | +##### CONFIGURATION |
| 14 | +Before running, update the cluster-specific constants: |
| 15 | +- MACHINE: TorchX named resource for your cluster (currently: "gpu.xlarge") |
| 16 | +- MACHINE_MEMORY: Memory per machine in MB (currently: 2062607) |
| 17 | +You can also override the resource configuration manually: |
| 18 | +- https://docs.pytorch.org/torchx/main/specs.html#resource |
| 19 | + |
| 20 | +##### USAGE |
| 21 | + python train_distributed.py --help |
| 22 | + |
| 23 | + Basic usage with 2 replicas, each with 1 node and 8 GPUs: |
| 24 | + python train_distributed.py |
| 25 | + |
| 26 | + Custom configuration: |
| 27 | + python train_distributed.py --replica-count 3 --gpu-per-node 8 \ |
| 28 | + --host-per-replica 2 --training-steps 100 |
| 29 | + |
| 30 | + With remote TorchFT lighthouse: |
| 31 | + python train_distributed.py --remote-lighthouse |
| 32 | + |
| 33 | +##### KEY COMPONENTS |
| 34 | +- LighthouseActor: Coordination server for fault tolerance |
| 35 | +- TrainingActor: Individual trainer processes |
| 36 | +- ReplicaActor: Manages groups of trainers |
| 37 | +- OrchestrationManager: Top-level orchestration and failure recovery |
| 38 | + |
| 39 | +##### FAILURE RECOVERY |
| 40 | +- Automatic retry with configurable delays (PER_ATTEMPT_DELAY) |
| 41 | +- New allocations after repeated failures (PROC_ATTEMPTS) |
| 42 | +- Maximum attempts per replica (MAX_ATTEMPT) |
| 43 | + |
| 44 | +##### OUTPUT |
| 45 | +- Training outputs saved to ./outputs directory |
| 46 | +- Logs streamed from all distributed processes |
| 47 | +- TensorBoard metrics enabled by default |
| 48 | + |
| 49 | +##### CLEANUP |
| 50 | +All SLURM jobs are automatically terminated at script completion. |
0 commit comments