The module 'HunyuanVideoTransformer3DModel' has been loaded in `bitsandbytes` 8bit and moving it to cpu via `.to()` is not supported. Module is still on cuda:0

### Describe the bug

1. enable_model_cpu_offload works with 4bit but not with 8bit. Is this the expected behavior or an issue?
_The module 'HunyuanVideoTransformer3DModel' has been loaded in bitsandbytes 8bit and moving it to cpu via .to() is not supported. Module is still on cuda:0_

2. device_map="balanced", also doesn't work with int8 (after commenting enable_model_cpu_offload ) [not tested for int4]
_RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)_

### Reproduction

```
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers import BitsAndBytesConfig
from diffusers.utils import export_to_video
import torch 


quant_config = BitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = HunyuanVideoTransformer3DModel.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.bfloat16,
)

pipe = HunyuanVideoPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    # device_map="balanced",
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
prompt = "A cat walks on the grass, realistic style."
video = pipe(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=15)
```

### Logs

```shell
(venv) C:\aiOWN\diffuser_webui>python HunyuanVideo_8bit_4bit.py
import error: No module named 'triton'
Fetching 6 files: 100%|█████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|████████████████████████████████████| 4/4 [00:01<00:00,  3.24it/s]
Loading pipeline components...: 100%|███████████████████████████████| 7/7 [00:04<00:00,  1.70it/s]
The module 'HunyuanVideoTransformer3DModel' has been loaded in `bitsandbytes` 8bit and moving it to cpu via `.to()` is not supported. Module is still on cuda:0.
Traceback (most recent call last):
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

### System Info

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Windows-10-10.0.26100-SP0
- Running on Google Colab?: No
- Python version: 3.10.11
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.27.1
- Transformers version: 4.48.1
- Accelerate version: 1.4.0.dev0
- PEFT version: not installed
- Bitsandbytes version: 0.45.0
- Safetensors version: 0.5.2
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 4060 Laptop GPU, 8188 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The module 'HunyuanVideoTransformer3DModel' has been loaded in `bitsandbytes` 8bit and moving it to cpu via `.to()` is not supported. Module is still on cuda:0 #10653

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The module 'HunyuanVideoTransformer3DModel' has been loaded in bitsandbytes 8bit and moving it to cpu via .to() is not supported. Module is still on cuda:0 #10653

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The module 'HunyuanVideoTransformer3DModel' has been loaded in `bitsandbytes` 8bit and moving it to cpu via `.to()` is not supported. Module is still on cuda:0 #10653