Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"index_select_cuda" not implemented for 'Float8_e4m3fn' error from CogVideoXImageToVideoPipeline #9539

Closed
FurkanGozukara opened this issue Sep 26, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@FurkanGozukara
Copy link

FurkanGozukara commented Sep 26, 2024

Describe the bug

Hello. I am trying to load CogVideoXImageToVideo in FP8 and I am getting this error

without FP8 no such errors

I am simply following this page : https://huggingface.co/THUDM/CogVideoX-5b

Diffusers commit id is latest : diffusers @ git+https://github.com/huggingface/diffusers.git@665c6b47a23bc841ad1440c4fe9cbb1782258656

here full details of error

image

image

image

System Info

Microsoft Windows [Version 10.0.19045.4894]
(c) Microsoft Corporation. All rights reserved.

R:\CogVideoX_v1\CogVideoX_SECourses\venv\Scripts>activate

(venv) R:\CogVideoX_v1\CogVideoX_SECourses\venv\Scripts>pip freeze
accelerate==0.34.2
aiofiles==23.2.1
annotated-types==0.7.0
anyio==4.6.0
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
contourpy==1.3.0
cycler==0.12.1
decorator==4.4.2
diffusers @ git+https://github.com/huggingface/diffusers.git@665c6b47a23bc841ad1440c4fe9cbb1782258656
distro==1.9.0
einops==0.8.0
exceptiongroup==1.2.2
fastapi==0.115.0
ffmpy==0.4.0
filelock==3.16.1
fonttools==4.54.1
fsspec==2024.9.0
gradio==4.44.0
gradio_client==1.3.0
h11==0.14.0
httpcore==1.0.5
httpx==0.27.2
huggingface-hub==0.25.1
idna==3.10
imageio==2.35.1
imageio-ffmpeg==0.5.1
importlib_metadata==8.5.0
importlib_resources==6.4.5
Jinja2==3.1.4
jiter==0.5.0
kiwisolver==1.4.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
mdurl==0.1.2
moviepy==1.0.3
mpmath==1.3.0
networkx==3.3
numpy==1.26.0
openai==1.48.0
opencv-python==4.10.0.84
orjson==3.10.7
packaging==24.1
pandas==2.2.3
Pillow==9.5.0
proglog==0.1.10
psutil==6.0.0
pydantic==2.9.2
pydantic_core==2.23.4
pydub==0.25.1
Pygments==2.18.0
pyparsing==3.1.4
python-dateutil==2.9.0.post0
python-multipart==0.0.10
pytz==2024.2
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
rich==13.8.1
ruff==0.6.8
safetensors==0.4.5
scikit-video==1.1.11
scipy==1.14.1
semantic-version==2.10.0
sentencepiece==0.2.0
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
spandrel==0.4.0
starlette==0.38.6
sympy==1.13.3
tokenizers==0.20.0
tomlkit==0.12.0
torch==2.4.1+cu124
torchvision==0.19.1+cu124
tqdm==4.66.5
transformers==4.45.0
triton @ https://huggingface.co/MonsterMMORPG/SECourses/resolve/main/triton-3.0.0-cp310-cp310-win_amd64.whl
typer==0.12.5
typing_extensions==4.12.2
tzdata==2024.2
urllib3==2.2.3
uvicorn==0.30.6
websockets==12.0
xformers==0.0.28.post1
zipp==3.20.2

(venv) R:\CogVideoX_v1\CogVideoX_SECourses\venv\Scripts>diffusers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • 🤗 Diffusers version: 0.31.0.dev0
  • Platform: Windows-10-10.0.19045-SP0
  • Running on Google Colab?: No
  • Python version: 3.10.11
  • PyTorch version (GPU?): 2.4.1+cu124 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.25.1
  • Transformers version: 4.45.0
  • Accelerate version: 0.34.2
  • PEFT version: not installed
  • Bitsandbytes version: not installed
  • Safetensors version: 0.4.5
  • xFormers version: 0.0.28.post1
  • Accelerator: NVIDIA GeForce RTX 3090 Ti, 24564 MiB
    NVIDIA GeForce RTX 3060, 12288 MiB
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

(venv) R:\CogVideoX_v1\CogVideoX_SECourses\venv\Scripts>

Who can help?

Text-to-Video / Video-to-Video @DN6 @a-r-r-o-w

@FurkanGozukara
Copy link
Author

FurkanGozukara commented Sep 26, 2024

this doesnt make sense to me because we are able to use FP8 mode of FLUX model and T5 XXL when using FLUX with ComfyUI

@a-r-r-o-w
Copy link
Member

Hey. As far as I understand, ComfyUI uses layerwise upcasting to allow inference in fp8_e4m3, that is, weights are stored in fp8 dtype but inference is done in float16/bfloat16. This is the standard way of doing quantization things. Diffusers doesn't support layerwise upcasting but there was an attempt in #9177.

I'm not sure what your exact code is so I won't speculate on how you could probably fix it. But, you can find some more information and gotchas in this diffusers-torchao repo. With torchao fp8 quantization (float8_weight_only(), float8_dynamic_activation_float8_weight(), float8_dynamic_activation_float8_weight(granularity=PerRow())), it only works on Hopper architectures (so H100 and similar). But that's a bit different than using fp8_e4m3, which comes from pytorch itself as a storage dtype (which would require the layerwise upcasting trick not in Diffusers yet).

My personal opinion is to use int8 quantization from torchao, or fp8 from quanto, which is supported on 3090 Ti and similar architectures. They are almost the same level in terms of quality/speed in comparison to bf16.

cc @sayakpaul in case he has a different idea of what's happening here.

@FurkanGozukara
Copy link
Author

@a-r-r-o-w thank you for answer

pipeline is here : https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

any hope to make it FP8 supporting like ComfyUI does?

@sayakpaul
Copy link
Member

You can use this https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa for FP8. And @a-r-r-o-w has already explained another way of doing FP8:

Hey. As far as I understand, ComfyUI uses layerwise upcasting to allow inference in fp8_e4m3, that is, weights are stored in fp8 dtype but inference is done in float16/bfloat16. This is the standard way of doing quantization things. Diffusers doesn't support layerwise upcasting but there was an attempt in #9177.

@FurkanGozukara
Copy link
Author

FurkanGozukara commented Sep 27, 2024

@a-r-r-o-w

I applied the quantize like below but it made 0 VRAM difference what could be reason?

i am on Windows 10 and Python 3.10

It is like quantized weights are remaining on RAM since it uses huge amount of RAM and RAM memory doesnt drop

it uses 26 GB VRAM without any optimization thus I expected it to fit into 24 gb gpu

I am monitoring vram usage, after executing .to(device) it gets almost full, then those quantize and freeze steps made 0 difference in memory - monitoring via nvitop and executing with step by step on Microsoft Visual Studio

        pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
            "THUDM/CogVideoX-5b-I2V",
            transformer=transformer,
            vae=vae,
            scheduler=pipe.scheduler,
            tokenizer=pipe.tokenizer,
            text_encoder=text_encoder,
            torch_dtype=default_dtype,
        ).to(device)

        quantize(pipe.transformer, weights=qfloat8)
        quantize(pipe.vae, weights=qfloat8)
        freeze(pipe.transformer)
        freeze(pipe.vae)

here what i mean

before quanting check RAM and VRAM

image

after quant it is using way more RAM and no difference in VRAM @sayakpaul

image

@a-r-r-o-w
Copy link
Member

I applied the quantize like below but it made 0 VRAM difference what could be reason?

I do not have a 3090, so I'm unable to reproduce this 1:1 but when running the gist script shared above, I do see the memory savings in model memory on a 4090/A100. It does take 26 GB for inference, as reported in the gist results, so it OOMs. You will need to enable VAE tiling for it to work under 24 GB. If you still face OOMs, I think CPU offloading might be required too.

Another solution, if you don't want to use CPU offloading is to quantize the T5 encoder to 8-bit using bitsandbytes config (or better yet, to nf4). The quality loss is minimal and probably not very noticeable either. You can keep rest of the things in normal bf16 precision. This works in 22-23 GB I believe.

@FurkanGozukara
Copy link
Author

FurkanGozukara commented Sep 27, 2024

@a-r-r-o-w that bitsandbytes is a great solution how can i do that?

with cpu offloading it works on rtx 3090 - but becomes too slow , and image to video pipeline uses 26 GB without cpu offloading - you didnt test it i see that

the thing is,

   quantize(pipe.transformer, weights=qfloat8)
        quantize(pipe.vae, weights=qfloat8)
        freeze(pipe.transformer)
        freeze(pipe.vae)

has made absolutely 0 difference but i see in your experiments 6 gb + reduction

@a-r-r-o-w
Copy link
Member

For quantizing the T5 encoder, you could follow this guide: https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes (this quantization feature will be coming to Diffusers too, soon). Or if you want to download an available checkpoint that already has done the quantization, you could use this.

has made absolutely 0 difference but i see in your experiments 6 gb + reduction

this is very weird and unexpected. I can't reproduce this and don't have the bandwidth to run more experiments on different GPUs, so I'm going to hope someone else could take a look at this.

If you're still facing problems on the 3090Ti, I'm not sure how helpful I could be. My advice would be, if you prefer to not have cpu offloading, to encode all your testing prompts first, save the embeddings, delete the text encoder and free memory, then perform normal denoising + decoding by passing prompt_embeds and negative_prompt_embeds.

@FurkanGozukara
Copy link
Author

FurkanGozukara commented Sep 27, 2024

@a-r-r-o-w thank you so much

if i change repo name to sayakpaul/cog-t5-nf4 - it simply fails

if i add it like this which i suppose to accurate way

    from transformers import BitsAndBytesConfig

    nf4_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
    )
    text_encoder = T5EncoderModel.from_pretrained("sayakpaul/cog-t5-nf4", quantization_config=nf4_config)

then i start getting this error - no cpu offloading is enabled

image

As soon as I add bitsandbytes import to the code, nothing changed different than above code, it impacts system wide like below

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. 
These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now set to True since model is quantized.

@a-r-r-o-w
Copy link
Member

a-r-r-o-w commented Sep 29, 2024

@FurkanGozukara I noticed you commented on Reddit that you fixed it and were able to run in under 24 GB while not using cpu offloading. Great work figuring it out! LMK if I can close this

@FurkanGozukara
Copy link
Author

@FurkanGozukara I noticed you commented on Reddit that you fixed it and were able to run in under 24 GB while not using cpu offloading. Great work figuring it out! LMK if I can close this

ye but we (actually my friend did) modified a lot of pipelines hacking so it wasn't anything like said here, it was all of hacking pipes :/

i think you really should make it properly handled currently, situation not fixed - everyone else trying will get same errors

@a-r-r-o-w
Copy link
Member

a-r-r-o-w commented Sep 29, 2024

Could you let me know what you/friend tried? It would be very valuable feedback in improving the pipeline - that is if the hacks were pipeline/modeling-level fixes.

I know that most people resort to using cpu offloading, which is slow, but if we can do anything to improve this, I'd be happy to incorporate it in no time.

Other than that, as I said in one of my above comment, I had no problem running in under 24GB on a 4090

@FurkanGozukara
Copy link
Author

FurkanGozukara commented Sep 29, 2024

Could you let me know what you/friend tried? It would be very valuable feedback in improving the pipeline - that is if the hacks were pipeline/modeling-level fixes.

I know that most people resort to using cpu offloading, which is slow, but if we can do anything to improve this, I'd be happy to incorporate it in no time.

Other than that, as I said in one of my above comment, I had no problem running in under 24GB on a 4090

here you can see code : FurkanGozukara/ImageToVideoAI@37dad09

a lot of pipeline changes :D

i don't know anyone else made it work on windows without cpu offloading at the moment other than us for 24 gb gpus yet - image to video

i think you can compare with official ones

by the way video to video and text to video still not working at all, only image to video working

@a-r-r-o-w
Copy link
Member

a-r-r-o-w commented Sep 29, 2024

Thank you for the quick response! I'll take a look and see what I can do as soon as I'm free

@sayakpaul
Copy link
Member

I think Dhruv’s PR on dynamic upcasting could work without quantization and offloading.

@a-r-r-o-w
Copy link
Member

I looked at the changes and it looks like there is now an autocast statement in your version of the pipeline

image

I'm not sure why/how this helps you run i2v in under 24 GB, but I noticed that you don't have this change in the t2v and v2v pipelines. If you can run the 5B image-to-video checkpoint in under 24 GB (as you mentioned above), there is no way that running t2v/v2v with Cog-5b would lead to OOMs, because the effect latent size is doubled in the former due to image latents.

Also, I'm unable to reproduce so I'm afraid I can't be of any more help. Sayak's Bitsandbytes PR will be merged as a quantization backend, so hopefully that helps a little bit in reducing memory requirements without offloading

@FurkanGozukara
Copy link
Author

@a-r-r-o-w our text to video or video to video not working at all :D i also tried official ones from official demo on hugging face

thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants