Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuDNN Frontend error #3049

Open
techidsk opened this issue Jan 14, 2025 · 1 comment
Open

RuntimeError: cuDNN Frontend error #3049

techidsk opened this issue Jan 14, 2025 · 1 comment

Comments

@techidsk
Copy link

2025-01-14 16:41:25 INFO     Checking the state dict: Diffusers or BFL, dev or schnell                                                                                           flux_utils.py:43
                    INFO     Building Flux model dev from BFL checkpoint                                                                                                        flux_utils.py:101
2025-01-14 16:41:26 INFO     Loading state dict from /home/techidsk/code/kohya_ss/models/flux1-dev.safetensors                                                                  flux_utils.py:118
                    INFO     Loaded Flux: <All keys matched successfully>                                                                                                       flux_utils.py:137
                    INFO     Cast FLUX model to fp8. This may take a while. You can reduce the time by using fp8 checkpoint. /                                          flux_train_network.py:101
                             FLUXモデルをfp8に変換しています。これには時間がかかる場合があります。fp8チェックポイントを使用することで時間を短縮できます。
2025-01-14 16:41:43 INFO     Building CLIP-L                                                                                                                                    flux_utils.py:179
                    INFO     Loading state dict from /home/techidsk/code/ComfyUI/models/clip/clip_l.safetensors                                                                 flux_utils.py:275
                    INFO     Loaded CLIP-L: <All keys matched successfully>                                                                                                     flux_utils.py:278
                    INFO     Loading state dict from /home/techidsk/code/ComfyUI/models/clip/t5xxl_fp16.safetensors                                                             flux_utils.py:330
2025-01-14 16:41:44 INFO     Loaded T5xxl: <All keys matched successfully>                                                                                                      flux_utils.py:333
                    INFO     Building AutoEncoder                                                                                                                               flux_utils.py:144
                    INFO     Loading state dict from /home/techidsk/code/kohya_ss/models/ae.safetensors                                                                         flux_utils.py:149
                    INFO     Loaded AE: <All keys matched successfully>                                                                                                         flux_utils.py:152
import network module: networks.lora_flux
                    INFO     [Dataset 0]                                                                                                                                       train_util.py:2495
                    INFO     caching latents with caching strategy.                                                                                                            train_util.py:1048
                    INFO     caching latents...                                                                                                                                train_util.py:1097
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 123/123 [00:00<00:00, 2214.25it/s]
                    INFO     move vae and unet to cpu to save memory                                                                                                    flux_train_network.py:203
                    INFO     move text encoders to gpu                                                                                                                  flux_train_network.py:211
2025-01-14 16:41:55 INFO     [Dataset 0]                                                                                                                                       train_util.py:2517
                    INFO     caching Text Encoder outputs with caching strategy.                                                                                               train_util.py:1231
                    INFO     checking cache validity...                                                                                                                        train_util.py:1242
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 123/123 [00:00<00:00, 182232.21it/s]
                    INFO     caching Text Encoder outputs...                                                                                                                   train_util.py:1273
  0%|                                                                                                                                                                    | 0/123 [00:00<?, ?it/s]Could not load library libcuda.so. Error: libcuda.so: cannot open shared object file: No such file or directory
Could not load library libcuda.so. Error: libcuda.so: cannot open shared object file: No such file or directory
Could not load library libcuda.so. Error: libcuda.so: cannot open shared object file: No such file or directory
  0%|                                                                                                                                                                    | 0/123 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/techidsk/code/kohya_ss/sd-scripts/flux_train_network.py", line 583, in <module>
    trainer.train(args)
  File "/home/techidsk/code/kohya_ss/sd-scripts/train_network.py", line 461, in train
    self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
  File "/home/techidsk/code/kohya_ss/sd-scripts/flux_train_network.py", line 223, in cache_text_encoder_outputs_if_needed
    dataset.new_cache_text_encoder_outputs(text_encoders, accelerator)
  File "/home/techidsk/code/kohya_ss/sd-scripts/library/train_util.py", line 2518, in new_cache_text_encoder_outputs
    dataset.new_cache_text_encoder_outputs(models, accelerator)
  File "/home/techidsk/code/kohya_ss/sd-scripts/library/train_util.py", line 1276, in new_cache_text_encoder_outputs
    caching_strategy.cache_batch_outputs(tokenize_strategy, models, text_encoding_strategy, batch)
  File "/home/techidsk/code/kohya_ss/sd-scripts/library/strategy_flux.py", line 162, in cache_batch_outputs
    l_pooled, t5_out, txt_ids, _ = flux_text_encoding_strategy.encode_tokens(tokenize_strategy, models, tokens_and_masks)
  File "/home/techidsk/code/kohya_ss/sd-scripts/library/strategy_flux.py", line 68, in encode_tokens
    l_pooled = clip_l(l_tokens.to(clip_l.device))["pooler_output"]
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 986, in forward
    return self.text_model(
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 890, in forward
    encoder_outputs = self.encoder(
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 813, in forward
    layer_outputs = encoder_layer(
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 548, in forward
    hidden_states, attn_weights = self.self_attn(
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 480, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph.
                    WARNING  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLEOFError(8, '[SSL:               connectionpool.py:870
                             UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')': /api/4504800232407040/envelope/
2025-01-14 16:41:56 WARNING  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLEOFError(8, '[SSL:               connectionpool.py:870
                             UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')': /api/4504800232407040/envelope/
                    WARNING  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLEOFError(8, '[SSL:               connectionpool.py:870
                             UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')': /api/4504800232407040/envelope/
2025-01-14 16:41:57 WARNING  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLEOFError(8, '[SSL:               connectionpool.py:870
                             UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')': /api/4504800232407040/envelope/
                    WARNING  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLEOFError(8, '[SSL:               connectionpool.py:870
                             UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')': /api/4504800232407040/envelope/
Traceback (most recent call last):
  File "/home/techidsk/miniconda3/envs/kohya/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "/home/techidsk/miniconda3/envs/kohya/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/techidsk/miniconda3/envs/kohya/bin/python3.10', '/home/techidsk/code/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', '/home/techidsk/code/kohya_ss/outputs/0114_shaoxing/config_lora-20250114-164115.toml']' returned non-zero exit status 1.

I used sd3.1 + flux branch, found the error may be occured by Pytorch 2.5.

When I use pip install pytorch 2.4, and restart the gui, it auto reinstall Pytorch 2.5.

@techidsk
Copy link
Author

Downgrade pytorch to 2.4 workds.
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1

Use python kohya_gui.py --noverify restart will ignore requirements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant