Closed
Description
Describe the bug
-
enable_model_cpu_offload works with 4bit but not with 8bit. Is this the expected behavior or an issue?
The module 'HunyuanVideoTransformer3DModel' has been loaded in bitsandbytes 8bit and moving it to cpu via .to() is not supported. Module is still on cuda:0 -
device_map="balanced", also doesn't work with int8 (after commenting enable_model_cpu_offload ) [not tested for int4]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Reproduction
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers import BitsAndBytesConfig
from diffusers.utils import export_to_video
import torch
quant_config = BitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
pipe = HunyuanVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
transformer=transformer_8bit,
torch_dtype=torch.float16,
# device_map="balanced",
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
prompt = "A cat walks on the grass, realistic style."
video = pipe(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=15)
Logs
(venv) C:\aiOWN\diffuser_webui>python HunyuanVideo_8bit_4bit.py
import error: No module named 'triton'
Fetching 6 files: 100%|█████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|████████████████████████████████████| 4/4 [00:01<00:00, 3.24it/s]
Loading pipeline components...: 100%|███████████████████████████████| 7/7 [00:04<00:00, 1.70it/s]
The module 'HunyuanVideoTransformer3DModel' has been loaded in `bitsandbytes` 8bit and moving it to cpu via `.to()` is not supported. Module is still on cuda:0.
Traceback (most recent call last):
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
System Info
- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Windows-10-10.0.26100-SP0
- Running on Google Colab?: No
- Python version: 3.10.11
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.27.1
- Transformers version: 4.48.1
- Accelerate version: 1.4.0.dev0
- PEFT version: not installed
- Bitsandbytes version: 0.45.0
- Safetensors version: 0.5.2
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 4060 Laptop GPU, 8188 MiB
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response