You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi!
I am working on a pre-training job for GPT-OSS 20B (25B tokens, trained in 3 phases). I need your feedback to optimize the pre-training recipe to reduce our current 2-week timeline on 8x B200 GPUs. We are currently running at a GBS of 16, which results in a 2-week timeline. Any setting higher than GBS 256 triggers an immediate OOM error. Also, if we increase the batch size to our target of 256, the projected training time actually augments to 3 weeks.
Optimizer: Distributed Optimizer (Adam) with overlap_param_gather=True.
Memory: Selective Activation Checkpointing (~75% of layers).
Learning Rate: Max LR 7e-6 (Phase 1) with Cosine Annealing and 3% Warmup.
Data Loading: 32 workers, prefetch_factor=8, and persistent_workers=True.
Any idea what would be the best to do to increase GBS to accelerate the training process? I think that is the bottleneck, but not sure, maybe increasing TP?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I am working on a pre-training job for GPT-OSS 20B (25B tokens, trained in 3 phases). I need your feedback to optimize the pre-training recipe to reduce our current 2-week timeline on 8x B200 GPUs. We are currently running at a GBS of 16, which results in a 2-week timeline. Any setting higher than GBS 256 triggers an immediate OOM error. Also, if we increase the batch size to our target of 256, the projected training time actually augments to 3 weeks.
Current Recipe Hyperparameters:
Any idea what would be the best to do to increase GBS to accelerate the training process? I think that is the bottleneck, but not sure, maybe increasing TP?
Beta Was this translation helpful? Give feedback.
All reactions