-
Notifications
You must be signed in to change notification settings - Fork 101
Description
使用train_14B_lora.sh来训练模型,把zero2改成了zero3,出现了如下报错
[rank0]: saved met[rank1]: Traceback (most recent call last):[rank1]: File "/home/work/lw_us/projects/StableAvatar/train_14B_lora_org_lw.py", line 1432, in [rank1]: main()[rank1]: File "/home/work/lw_us/projects/StableAvatar/train_14B_lora_org_lw.py", line 1370, in main[rank1]: accelerator.backward(loss)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/accelerate/accelerator.py", line 2726, in backward[rank1]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 270, in backward[rank1]: self.engine.backward(loss, **kwargs)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn[rank1]: ret_val = func(*args, **kwargs)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2267, in backward[rank1]: self._do_optimizer_backward(loss, retain_graph)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in _do_optimizer_backward[rank1]: self.optimizer.backward(loss, retain_graph=retain_graph)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn[rank1]: ret_val = func(*args, **kwargs)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2310, in backward[rank1]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 65, in backward[rank1]: scaled_loss.backward(retain_graph=retain_graph)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward[rank1]: torch.autograd.backward([rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward[rank1]: _engine_run_backward([rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1129, in unpack_hook[rank1]: frame.check_recomputed_tensors_match(gid)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 903, in check_recomputed_tensors_match[rank1]: raise CheckpointError([rank1]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.