Skip to content

[BUG] total_train_steps was not calculated or passed in correctly #357

@jia-huang

Description

@jia-huang

Describe the bug

There is an issue where total_train_steps was not calculated or passed in correctly.
My run script is as follows:

#!/bin/bash

UniMol Fine-tuning script (updated for torchrun)

===== 配置参数 =====

data_path="/root/PythonProject/Uni-Mol-main/Uni-Mol-main0/unimol/example_data"
save_dir="./save_demo"
MASTER_PORT=10086
n_gpu=1
dict_name="dict.txt"
weight_path='/root/PythonProject/Uni-Mol-main/Uni-Mol-main0/unimol/notebooks/mol_pre_no_h_220816.pt'
task_name="bace"
task_num=2
loss_func="finetune_cross_entropy"
lr=1e-4
batch_size=32
epoch=5
dropout=0.1
warmup=0.06
local_batch_size=32
only_polar=0
conf_size=11
seed=0
metric="valid_agg_auc"
#update_freq=$((batch_size / local_batch_size))
update_freq=1

===== 复制字典文件 =====

cp ../example_data/molecule/$dict_name $data_path

===== 环境变量设置 =====

export NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=1

===== 启动训练 =====

torchrun
--nproc_per_node=$n_gpu
--master_port=$MASTER_PORT
$(which unicore-train) $data_path
--task-name $task_name
--user-dir ../unimol
--train-subset train
--valid-subset valid
--conf-size $conf_size
--num-workers 8
--ddp-backend=c10d
--dict-name $dict_name
--task mol_finetune
--loss $loss_func
--arch unimol_base
--classification-head-name $task_name
--num-classes $task_num
--optimizer adam
--adam-betas '(0.9, 0.99)'
--adam-eps 1e-6
--clip-norm 1.0
--lr-scheduler polynomial_decay
--lr $lr
--warmup-ratio $warmup
--max-epoch $epoch
--batch-size $local_batch_size
--pooler-dropout $dropout
--update-freq $update_freq
--seed $seed
--fp16
--fp16-init-scale 4
--fp16-scale-window 256
--log-interval 100
--log-format simple
--validate-interval 1
--keep-last-epochs 10
--finetune-from-model $weight_path
--best-checkpoint-metric $metric
--patience 20
--save-dir $save_dir
--only-polar $only_polar
--maximize-best-checkpoint-metric

The error is as follows:
[rank0]: return build_lr_scheduler_(args, optimizer, total_train_steps)
[rank0]: File "/root/miniconda3/envs/py38-unimol/lib/python3.8/site-packages/unicore/registry.py", line 42, in build_x
[rank0]: return builder(args, *extra_args, **extra_kwargs)
[rank0]: File "/root/miniconda3/envs/py38-unimol/lib/python3.8/site-packages/unicore/optim/lr_scheduler/polynomial_decay_schedule.py", line 19, in init
[rank0]: assert total_train_steps is not None
[rank0]: AssertionError

Uni-Mol Version

Uni-Mol

Expected behavior

There is an issue where total_train_steps was not calculated or passed in correctly.

To Reproduce

No response

Environment

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions