Broken video output with Wan 2.1 I2V pipeline + quantized transformer

### Describe the bug

Since there is no proper documentation yet, I'm not sure if there is a difference to other video pipelines that I'm unaware of – but with the code below, the video results are reproducibly broken.

There is a warning:
`Expected types for image_encoder: (<class 'transformers.models.clip.modeling_clip.CLIPVisionModel'>,), got <class 'transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection'>.`
which I assume I'm expected to ignore.

Init image:

![Image](https://github.com/user-attachments/assets/776aafca-f8d6-4f7b-81e6-c6d41d20dcee)

Result:

https://github.com/user-attachments/assets/c2e591e7-4cd5-4849-bec4-5938058c0775

Result with different seed:

https://github.com/user-attachments/assets/7006e400-3018-4891-9c4f-06d44ebc704f

Result with different prompt:

https://github.com/user-attachments/assets/42f15f68-bd2b-4b22-b6da-6d5182bc6b22


### Reproduction

```
# Tested on Google Colab with an A100 (40GB).
# Uses ~21 GB VRAM, takes ~150 sec per step, ~75 min in total.


!pip install git+https://github.com/huggingface/diffusers.git
!pip install -U bitsandbytes
!pip install ftfy


import os
import torch
from diffusers import (
    BitsAndBytesConfig,
    WanImageToVideoPipeline,
    WanTransformer3DModel
)
from diffusers.utils import export_to_video
from PIL import Image


model_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
transformer = WanTransformer3DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config
)
pipe = WanImageToVideoPipeline.from_pretrained(
    model_id,
    transformer=transformer
)
pipe.enable_model_cpu_offload()


def render(
    filename,
    image,
    prompt,
    seed=0,
    width=832,
    height=480,
    num_frames=81,
    num_inference_steps=30,
    guidance_scale=5.0,
    fps=16
):
    video = pipe(
        image=image,
        prompt=prompt,
        generator=torch.Generator(device=pipe.device).manual_seed(seed),
        width=width,
        height=height,
        num_frames=num_frames,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale
    ).frames[0]
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    export_to_video(video, filename, fps=fps)


render(
    filename="/content/test.mp4",
    image=Image.open("/content/test.png"),
    prompt="a woman in a yellow coat is dancing in the desert",
    seed=42
)
```

### Logs

```shell

```

### System Info

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-6.1.85+-x86_64-with-glibc2.35
- Running on Google Colab?: Yes
- Python version: 3.11.11
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.10.4 (gpu)
- Jax version: 0.4.33
- JaxLib version: 0.4.33
- Huggingface_hub version: 0.28.1
- Transformers version: 4.48.3
- Accelerate version: 1.3.0
- PEFT version: 0.14.0
- Bitsandbytes version: 0.45.3
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-40GB, 40960 MiB

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Broken video output with Wan 2.1 I2V pipeline + quantized transformer #11006

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Broken video output with Wan 2.1 I2V pipeline + quantized transformer #11006

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions