Training instability: NaN loss when training caformer_b36 with LAMB optimizer

Issue Description:
I'm experiencing training instability when attempting to reproduce the caformer_b36 model. The training proceeds normally for the first 38 epochs, but the loss becomes NaN at epoch 39 and persists for the remainder of training.

Environment:

Code path: metaformer repository (latest main branch)
Data: ImageNet dataset at $DATA_PATH
Hardware: 8 GPUs
Mixed precision training enabled (--amp)
Training Configuration:
```bash
DATA_PATH=/opt/work/Flow/data
CODE_PATH=/opt/work/metaformer # modify code path here


ALL_BATCH_SIZE=4096
NUM_GPU=8
GRAD_ACCUM_STEPS=4 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS


cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model caformer_b36 --opt lamb --lr 8e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path 0.6 --head-dropout 0.5 --amp --epochs 300 
```

Observed Behavior:
Training loss decreases normally until epoch 38, but at epoch 39 the loss suddenly becomes NaN mid-epoch:
```
Train: 39 [ 700/1251 ( 56%)]  Loss: 4.256 (4.08)  [...]
Train: 39 [ 750/1251 ( 60%)]  Loss: nan (nan)  [...]
Train: 39 [ 800/1251 ( 64%)]  Loss: nan (nan)  [...]
... (remaining iterations all NaN)
```

Have you encountered similar instability with caformer_b36 during your experiments?
Are there recommended gradient clipping settings or learning rate adjustments needed for this model configuration?
Could this be related to the LAMB optimizer interacting with mixed precision training? Are there specific --amp flags or optimizer parameters that should be tuned?
Are the hyperparameters (--lr 8e-3, --drop-path 0.6, --head-dropout 0.5) in the example command verified for caformer_b36, or are they primarily for smaller models?
Any guidance on stabilizing training for this model would be greatly appreciated. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training instability: NaN loss when training caformer_b36 with LAMB optimizer #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training instability: NaN loss when training caformer_b36 with LAMB optimizer #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions