-
Notifications
You must be signed in to change notification settings - Fork 165
Description
Describe the bug
There is an issue where total_train_steps was not calculated or passed in correctly.
My run script is as follows:
#!/bin/bash
UniMol Fine-tuning script (updated for torchrun)
===== 配置参数 =====
data_path="/root/PythonProject/Uni-Mol-main/Uni-Mol-main0/unimol/example_data"
save_dir="./save_demo"
MASTER_PORT=10086
n_gpu=1
dict_name="dict.txt"
weight_path='/root/PythonProject/Uni-Mol-main/Uni-Mol-main0/unimol/notebooks/mol_pre_no_h_220816.pt'
task_name="bace"
task_num=2
loss_func="finetune_cross_entropy"
lr=1e-4
batch_size=32
epoch=5
dropout=0.1
warmup=0.06
local_batch_size=32
only_polar=0
conf_size=11
seed=0
metric="valid_agg_auc"
#update_freq=$((batch_size / local_batch_size))
update_freq=1
===== 复制字典文件 =====
cp ../example_data/molecule/$dict_name $data_path
===== 环境变量设置 =====
export NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=1
===== 启动训练 =====
torchrun
--nproc_per_node=$n_gpu
--master_port=$MASTER_PORT
$(which unicore-train) $data_path
--task-name $task_name
--user-dir ../unimol
--train-subset train
--valid-subset valid
--conf-size $conf_size
--num-workers 8
--ddp-backend=c10d
--dict-name $dict_name
--task mol_finetune
--loss $loss_func
--arch unimol_base
--classification-head-name $task_name
--num-classes $task_num
--optimizer adam
--adam-betas '(0.9, 0.99)'
--adam-eps 1e-6
--clip-norm 1.0
--lr-scheduler polynomial_decay
--lr $lr
--warmup-ratio $warmup
--max-epoch $epoch
--batch-size $local_batch_size
--pooler-dropout $dropout
--update-freq $update_freq
--seed $seed
--fp16
--fp16-init-scale 4
--fp16-scale-window 256
--log-interval 100
--log-format simple
--validate-interval 1
--keep-last-epochs 10
--finetune-from-model $weight_path
--best-checkpoint-metric $metric
--patience 20
--save-dir $save_dir
--only-polar $only_polar
--maximize-best-checkpoint-metric
The error is as follows:
[rank0]: return build_lr_scheduler_(args, optimizer, total_train_steps)
[rank0]: File "/root/miniconda3/envs/py38-unimol/lib/python3.8/site-packages/unicore/registry.py", line 42, in build_x
[rank0]: return builder(args, *extra_args, **extra_kwargs)
[rank0]: File "/root/miniconda3/envs/py38-unimol/lib/python3.8/site-packages/unicore/optim/lr_scheduler/polynomial_decay_schedule.py", line 19, in init
[rank0]: assert total_train_steps is not None
[rank0]: AssertionError
Uni-Mol Version
Uni-Mol
Expected behavior
There is an issue where total_train_steps was not calculated or passed in correctly.
To Reproduce
No response
Environment
No response
Additional Context
No response