-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Issue Description:
I'm experiencing training instability when attempting to reproduce the caformer_b36 model. The training proceeds normally for the first 38 epochs, but the loss becomes NaN at epoch 39 and persists for the remainder of training.
Environment:
Code path: metaformer repository (latest main branch)
Data: ImageNet dataset at $DATA_PATH
Hardware: 8 GPUs
Mixed precision training enabled (--amp)
Training Configuration:
DATA_PATH=/opt/work/Flow/data
CODE_PATH=/opt/work/metaformer # modify code path here
ALL_BATCH_SIZE=4096
NUM_GPU=8
GRAD_ACCUM_STEPS=4 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS
cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model caformer_b36 --opt lamb --lr 8e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path 0.6 --head-dropout 0.5 --amp --epochs 300 Observed Behavior:
Training loss decreases normally until epoch 38, but at epoch 39 the loss suddenly becomes NaN mid-epoch:
Train: 39 [ 700/1251 ( 56%)] Loss: 4.256 (4.08) [...]
Train: 39 [ 750/1251 ( 60%)] Loss: nan (nan) [...]
Train: 39 [ 800/1251 ( 64%)] Loss: nan (nan) [...]
... (remaining iterations all NaN)
Have you encountered similar instability with caformer_b36 during your experiments?
Are there recommended gradient clipping settings or learning rate adjustments needed for this model configuration?
Could this be related to the LAMB optimizer interacting with mixed precision training? Are there specific --amp flags or optimizer parameters that should be tuned?
Are the hyperparameters (--lr 8e-3, --drop-path 0.6, --head-dropout 0.5) in the example command verified for caformer_b36, or are they primarily for smaller models?
Any guidance on stabilizing training for this model would be greatly appreciated.