Skip to content

Training instability: NaN loss when training caformer_b36 with LAMB optimizer #20

@liveck

Description

@liveck

Issue Description:
I'm experiencing training instability when attempting to reproduce the caformer_b36 model. The training proceeds normally for the first 38 epochs, but the loss becomes NaN at epoch 39 and persists for the remainder of training.

Environment:

Code path: metaformer repository (latest main branch)
Data: ImageNet dataset at $DATA_PATH
Hardware: 8 GPUs
Mixed precision training enabled (--amp)
Training Configuration:

DATA_PATH=/opt/work/Flow/data
CODE_PATH=/opt/work/metaformer # modify code path here


ALL_BATCH_SIZE=4096
NUM_GPU=8
GRAD_ACCUM_STEPS=4 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS


cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model caformer_b36 --opt lamb --lr 8e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path 0.6 --head-dropout 0.5 --amp --epochs 300 

Observed Behavior:
Training loss decreases normally until epoch 38, but at epoch 39 the loss suddenly becomes NaN mid-epoch:

Train: 39 [ 700/1251 ( 56%)]  Loss: 4.256 (4.08)  [...]
Train: 39 [ 750/1251 ( 60%)]  Loss: nan (nan)  [...]
Train: 39 [ 800/1251 ( 64%)]  Loss: nan (nan)  [...]
... (remaining iterations all NaN)

Have you encountered similar instability with caformer_b36 during your experiments?
Are there recommended gradient clipping settings or learning rate adjustments needed for this model configuration?
Could this be related to the LAMB optimizer interacting with mixed precision training? Are there specific --amp flags or optimizer parameters that should be tuned?
Are the hyperparameters (--lr 8e-3, --drop-path 0.6, --head-dropout 0.5) in the example command verified for caformer_b36, or are they primarily for smaller models?
Any guidance on stabilizing training for this model would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions