Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic issue in DiffusionPipeline when getting dtype or device #10671

Open
dimitribarbot opened this issue Jan 28, 2025 · 3 comments · May be fixed by #10696
Open

Deterministic issue in DiffusionPipeline when getting dtype or device #10671

dimitribarbot opened this issue Jan 28, 2025 · 3 comments · May be fixed by #10696
Labels
bug Something isn't working

Comments

@dimitribarbot
Copy link
Contributor

Describe the bug

This issue is a follow-up of this PR.

The idea was to fix an issue leading to the following error message:

RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

This was due to a dtype mismatch between image input encoded with a VAE using float32 precision and unet configured with float16 precision.

Although this bug is fixed, I would like to mention how difficult it was to debug since it didn't occur every time. After debugging I found that the dtype property of the diffusion pipeline does not return a value deterministically (it seems the same applies to the device). It uses the dtype of the first pipeline component it finds, but the list of pipeline components is not sorted.

I think this may lead to other mistakes in the future and may require changes.

Reproduction

To reproduce the deterministic issue, you can first modify the DiffusionPipeline dtype property in the src/diffusers/pipelines/pipeline_utils.py file to add some logs:

@property
def dtype(self) -> torch.dtype:
    r"""
    Returns:
        `torch.dtype`: The torch dtype on which the pipeline is located.
    """
    module_names, _ = self._get_signature_keys(self)
    modules = [getattr(self, n, None) for n in module_names]
    modules = [m for m in modules if isinstance(m, torch.nn.Module)]

    for module in modules:
        print(f"module.dtype is {module.dtype} using {type(module).__name__} from {module_names}.")  # <--- Add this line
        return module.dtype

    return torch.float32

Then create a new python script file with the following content:

from diffusers import StableDiffusionXLPipeline
from diffusers.schedulers import UniPCMultistepScheduler
import torch

# initialize the models and pipeline
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16"
)
pipe.to(torch.device("cuda"))
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

# generate image
print("Pipeline dtype:", pipe.dtype)
image = pipe(prompt="a dog", num_inference_steps=20).images[0]

Run this script multiple times and observe the results (see "Logs" section below).

Logs

1st script run gives:
module.dtype is torch.float16 using CLIPTextModelWithProjection from {'tokenizer', 'text_encoder_2', 'text_encoder', 'vae', 'unet', 'scheduler', 'image_encoder', 'tokenizer_2', 'feature_extractor'}.
Pipeline dtype: torch.float16

2nd script run gives:
module.dtype is torch.float16 using CLIPTextModel from {'tokenizer_2', 'tokenizer', 'text_encoder', 'text_encoder_2', 'vae', 'feature_extractor', 'scheduler', 'image_encoder', 'unet'}.
Pipeline dtype: torch.float16

3rd script run gives:
module.dtype is torch.float16 using UNet2DConditionModel from {'unet', 'tokenizer', 'text_encoder', 'vae', 'text_encoder_2', 'tokenizer_2', 'scheduler', 'image_encoder', 'feature_extractor'}.
Pipeline dtype: torch.float16

4th script run gives:
module.dtype is torch.float16 using AutoencoderKL from {'vae', 'scheduler', 'text_encoder_2', 'unet', 'image_encoder', 'tokenizer_2', 'tokenizer', 'text_encoder', 'feature_extractor'}.
Pipeline dtype: torch.float16

System Info

  • 🤗 Diffusers version: 0.32.2
  • Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
  • Running on Google Colab?: No
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.26.5
  • Transformers version: 4.47.1
  • Accelerate version: 1.2.1
  • PEFT version: 0.13.2
  • Bitsandbytes version: not installed
  • Safetensors version: 0.4.5
  • xFormers version: 0.0.28.post3
  • Accelerator: NVIDIA GeForce RTX 4090, 24564 MiB
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

@dimitribarbot dimitribarbot added the bug Something isn't working label Jan 28, 2025
@a-r-r-o-w
Copy link
Member

cc @sayakpaul @DN6

This is indeed a problem (with determining device as well), requiring some additional device-determining gymnastics for custom offloading strategy (#10503)

@sayakpaul
Copy link
Member

sayakpaul commented Jan 30, 2025

Thanks for the investigation and the detailed reproducer (all reproducers should be like yours)! So, really thank you!

The order problem can be fixed with simple changes:

Patch
diff --git a/src/diffusers/pipelines/pipeline_utils.py b/src/diffusers/pipelines/pipeline_utils.py
index 0c1371c75..945ac24d0 100644
--- a/src/diffusers/pipelines/pipeline_utils.py
+++ b/src/diffusers/pipelines/pipeline_utils.py
@@ -507,6 +507,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
         modules = [m for m in modules if isinstance(m, torch.nn.Module)]
 
         for module in modules:
+            print(f"module.dtype is {module.dtype} using {type(module).__name__} from {module_names}.")  # <--- Add this line
             return module.dtype
 
         return torch.float32
@@ -1577,7 +1578,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
                 expected_modules.add(name)
                 optional_parameters.remove(name)
 
-        return expected_modules, optional_parameters
+        return sorted(expected_modules), sorted(optional_parameters)
 
     @classmethod
     def _get_signature_types(cls):
@@ -1619,10 +1620,12 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
             k: getattr(self, k) for k in self.config.keys() if not k.startswith("_") and k not in optional_parameters
         }
 
-        if set(components.keys()) != expected_modules:
+        actual = sorted(set(components.keys()))
+        expected = sorted(expected_modules)
+        if actual != expected:
             raise ValueError(
                 f"{self} has been incorrectly initialized or {self.__class__} is incorrectly implemented. Expected"
-                f" {expected_modules} to be defined, but {components.keys()} are defined."
+                f" {expected} to be defined, but {actual} are defined."
             )
 
         return components

Would you maybe interested in opening a PR for this?

@dimitribarbot
Copy link
Contributor Author

Thank you for your answer and for the patch.

Yes, no problem, I will open a PR.
I'll take a look at it tomorrow, as I'm not available today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants