Skip to content

14B模型LoRA训练报错 #36

@liwang0621

Description

@liwang0621

使用train_14B_lora.sh来训练模型,把zero2改成了zero3,出现了如下报错

[rank0]: saved met[rank1]: Traceback (most recent call last):[rank1]: File "/home/work/lw_us/projects/StableAvatar/train_14B_lora_org_lw.py", line 1432, in [rank1]: main()[rank1]: File "/home/work/lw_us/projects/StableAvatar/train_14B_lora_org_lw.py", line 1370, in main[rank1]: accelerator.backward(loss)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/accelerate/accelerator.py", line 2726, in backward[rank1]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 270, in backward[rank1]: self.engine.backward(loss, **kwargs)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn[rank1]: ret_val = func(*args, **kwargs)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2267, in backward[rank1]: self._do_optimizer_backward(loss, retain_graph)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in _do_optimizer_backward[rank1]: self.optimizer.backward(loss, retain_graph=retain_graph)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn[rank1]: ret_val = func(*args, **kwargs)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2310, in backward[rank1]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 65, in backward[rank1]: scaled_loss.backward(retain_graph=retain_graph)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/_tensor.py", line 626, in backward[rank1]: torch.autograd.backward([rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward[rank1]: _engine_run_backward([rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1129, in unpack_hook[rank1]: frame.check_recomputed_tensors_match(gid)[rank1]: File "/home/work/lw_data/miniforge3/envs/stableavatar/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 903, in check_recomputed_tensors_match[rank1]: raise CheckpointError([rank1]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions