Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't load SD 3.5 checkpoint // sd3-sd3.5-flux branch #3017

Open
ZfE0QQ6ds92W opened this issue Dec 16, 2024 · 3 comments
Open

Can't load SD 3.5 checkpoint // sd3-sd3.5-flux branch #3017

ZfE0QQ6ds92W opened this issue Dec 16, 2024 · 3 comments

Comments

@ZfE0QQ6ds92W
Copy link

Hi,

I am using the sd3-sd3.5-flux branch. It seems to me that (for some reason) the correct python files are not used. Please note, that everything works with SDXL

SDXL -> working
INFO loading model for process 0/1 sdxl_train_util.py:32
INFO load StableDiffusion checkpoint: sdxl_train_util.py:73
./sd_xl_base
_1.0.safetensors
INFO building U-Net sdxl_model_util.py:198
INFO loading U-Net from checkpoint sdxl_model_util.py:202

SD 3.5 -> not working:

INFO loading model for process 0/1 train_util.py:5359
INFO load StableDiffusion checkpoint: ./sd3.5_large_fp8_scaled.safetensors train_util.py:5315
Traceback (most recent call last):
File ".\kohya_ss\sd-scripts\train_network.py", line 1513, in
trainer.train(args)
File ".\kohya_ss\sd-scripts\train_network.py", line 413, in train
model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator)
File ".\kohya_ss\sd-scripts\train_network.py", line 128, in load_target_model
text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator)
File ".\kohya_ss\sd-scripts\library\train_util.py", line 5361, in load_target_model
text_encoder, vae, unet, load_stable_diffusion_format = _load_target_model(
File ".\kohya_ss\sd-scripts\library\train_util.py", line 5316, in _load_target_model
text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(
File ".\kohya_ss\sd-scripts\library\model_util.py", line 1005, in load_models_from_stable_diffusion_checkpoint
converted_unet_checkpoint = convert_ldm_unet_checkpoint(v2, state_dict, unet_config)
File ".\kohya_ss\sd-scripts\library\model_util.py", line 267, in convert_ldm_unet_checkpoint
new_checkpoint["time_embedding.linear_1.weight"] = unet_state_dict["time_embed.0.weight"]
KeyError: 'time_embed.0.weight'
Traceback (most recent call last):
File ".\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File ".\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File ".\Scripts\accelerate.EXE_main
.py", line 7, in
sys.exit(main())
File ".\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File ".\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File ".\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['.\python.exe', './kohya_ss/sd-scripts/train_network.py', '--config_file', './Output/config_lora-20241216-151851.toml']' returned non-zero exit status 1.
15:19:02-995047
INFO Training has ended.

Training commands:
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
caption_extension = ".txt"
clip_skip = 1
dynamo_backend = "no"
enable_bucket = true
epoch = 1
gradient_accumulation_steps = 1
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 225
max_train_steps = 1600
min_bucket_reso = 256
mixed_precision = "fp16"
network_alpha = 1
network_args = []
network_dim = 8
network_module = "networks.lora"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "./Output"
output_name = "Last_01"
pretrained_model_name_or_path =
"./sd3.5_large_fp8_scaled.safetensors"
prior_loss_weight = 1
resolution = "2048,2048"
sample_prompts = ".\sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "fp16"
text_encoder_lr = []
train_batch_size = 1
train_data_dir = "./img"
unet_lr = 0.0001
wandb_run_name = "Last_01"
xformers = true

@ZfE0QQ6ds92W
Copy link
Author

Small addition: the same thing happens with branch sd3-flux.1 on a completly new windows 11 install where only necessary drivers and the Windows Pre-requirements are installed. The LoRA training works with SDXL, but can't load SD3.5 model.

@Impudence12
Copy link

The SD3 checkbox doesn't do anything. The train button defaults to train_network.py. It also defaults to network_module = "networks.lora". This is not a proper fix, but you can edit lora_gui.py to default to the SD3 scripts.

kohya_ss\kohya_gui\lora_gui.py
Line 1150 run_cmd.append(rf"{scriptdir}/sd-scripts/train_network.py") change to run_cmd.append(rf"{scriptdir}/sd-scripts/sd3_train_network.py")
Line 1267 network_module = "networks.lora" change to network_module = "networks.lora_sd3"

Then you will need to add the clips to the Additional Parameters since there won't be any gui boxes to put them in.

--clip_l "./models/clip/clip_l.safetensors" --clip_g "./models/clip/clip_g.safetensors" --t5xxl "./models/clip/t5xxl_fp16.safetensors"

No idea why neither branch that "supports" SD3.5 actually supports it. Technically the scripts do, but the GUI does not.

@ZfE0QQ6ds92W
Copy link
Author

Thanks for your help. Unfortunately, it still does not work. Is it a problem that I run it within an Anaconda environment?

	    INFO     Building VAE                                                                                                                                                        sd3_utils.py:258
                INFO     Loading state dict...                                                                                                                                               sd3_utils.py:260
                INFO     Loaded VAE: <All keys matched successfully>                                                                                                                         sd3_utils.py:262

import network module: networks.lora_sd3
INFO [Dataset 0] train_util.py:2495
INFO caching latents with caching strategy. train_util.py:1048
INFO caching latents... train_util.py:1097
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
Traceback (most recent call last):
File ".\sd-scripts\sd3_train_network.py", line 480, in
trainer.train(args)
File ".\sd-scripts\train_network.py", line 461, in train
self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
File ".\sd-scripts\sd3_train_network.py", line 264, in cache_text_encoder_outputs_if_needed
text_encoders[1].to(accelerator.device, dtype=weight_dtype)
File "C:\Users\thoma.conda\envs\Kohya_sd35_large\lib\site-packages\transformers\modeling_utils.py", line 2905, in to
return super().to(*args, **kwargs)
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\torch\nn\modules\module.py", line 1340, in to
return self._apply(convert)
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
module._apply(fn)
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
module._apply(fn)
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
module._apply(fn)
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\torch\nn\modules\module.py", line 927, in _apply
param_applied = fn(param)
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\torch\nn\modules\module.py", line 1333, in convert
raise NotImplementedError(
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
Traceback (most recent call last):
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users[...].conda\envs\Kohya_sd35_large\Scripts\accelerate.EXE_main
.py", line 7, in
sys.exit(main())
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "C:\Users[...].conda\envs\Kohya_sd35_large\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\Users\[...]\.conda\envs\Kohya_sd35_large\python.exe', './sd-scripts/sd3_train_network.py', '--config_file', './config_lora-20241222-214855.toml', '--clip_l', './Models/clip/clip_l.safetensors', '--clip_g', './Models/clip/clip_g.safetensors', '--t5xxl', './Models/clip/t5xxl_fp16.safetensors']' returned non-zero exit status 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants