-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"index_select_cuda" not implemented for 'Float8_e4m3fn' error from CogVideoXImageToVideoPipeline #9539
Comments
this doesnt make sense to me because we are able to use FP8 mode of FLUX model and T5 XXL when using FLUX with ComfyUI |
Hey. As far as I understand, ComfyUI uses layerwise upcasting to allow inference in fp8_e4m3, that is, weights are stored in fp8 dtype but inference is done in float16/bfloat16. This is the standard way of doing quantization things. Diffusers doesn't support layerwise upcasting but there was an attempt in #9177. I'm not sure what your exact code is so I won't speculate on how you could probably fix it. But, you can find some more information and gotchas in this diffusers-torchao repo. With torchao fp8 quantization ( My personal opinion is to use int8 quantization from torchao, or fp8 from quanto, which is supported on 3090 Ti and similar architectures. They are almost the same level in terms of quality/speed in comparison to bf16. cc @sayakpaul in case he has a different idea of what's happening here. |
@a-r-r-o-w thank you for answer pipeline is here : https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py any hope to make it FP8 supporting like ComfyUI does? |
You can use this https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa for FP8. And @a-r-r-o-w has already explained another way of doing FP8:
|
I applied the quantize like below but it made 0 VRAM difference what could be reason? i am on Windows 10 and Python 3.10 It is like quantized weights are remaining on RAM since it uses huge amount of RAM and RAM memory doesnt drop it uses 26 GB VRAM without any optimization thus I expected it to fit into 24 gb gpu I am monitoring vram usage, after executing .to(device) it gets almost full, then those quantize and freeze steps made 0 difference in memory - monitoring via nvitop and executing with step by step on Microsoft Visual Studio
here what i mean before quanting check RAM and VRAM after quant it is using way more RAM and no difference in VRAM @sayakpaul |
I do not have a 3090, so I'm unable to reproduce this 1:1 but when running the gist script shared above, I do see the memory savings in model memory on a 4090/A100. It does take 26 GB for inference, as reported in the gist results, so it OOMs. You will need to enable VAE tiling for it to work under 24 GB. If you still face OOMs, I think CPU offloading might be required too. Another solution, if you don't want to use CPU offloading is to quantize the T5 encoder to 8-bit using bitsandbytes config (or better yet, to nf4). The quality loss is minimal and probably not very noticeable either. You can keep rest of the things in normal bf16 precision. This works in 22-23 GB I believe. |
@a-r-r-o-w that bitsandbytes is a great solution how can i do that? with cpu offloading it works on rtx 3090 - but becomes too slow , and image to video pipeline uses 26 GB without cpu offloading - you didnt test it i see that the thing is,
has made absolutely 0 difference but i see in your experiments 6 gb + reduction |
For quantizing the T5 encoder, you could follow this guide: https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes (this quantization feature will be coming to Diffusers too, soon). Or if you want to download an available checkpoint that already has done the quantization, you could use this.
this is very weird and unexpected. I can't reproduce this and don't have the bandwidth to run more experiments on different GPUs, so I'm going to hope someone else could take a look at this. If you're still facing problems on the 3090Ti, I'm not sure how helpful I could be. My advice would be, if you prefer to not have cpu offloading, to encode all your testing prompts first, save the embeddings, delete the text encoder and free memory, then perform normal denoising + decoding by passing |
@a-r-r-o-w thank you so much if i change repo name to sayakpaul/cog-t5-nf4 - it simply fails if i add it like this which i suppose to accurate way
then i start getting this error - no cpu offloading is enabled As soon as I add bitsandbytes import to the code, nothing changed different than above code, it impacts system wide like below
|
@FurkanGozukara I noticed you commented on Reddit that you fixed it and were able to run in under 24 GB while not using cpu offloading. Great work figuring it out! LMK if I can close this |
ye but we (actually my friend did) modified a lot of pipelines hacking so it wasn't anything like said here, it was all of hacking pipes :/ i think you really should make it properly handled currently, situation not fixed - everyone else trying will get same errors |
Could you let me know what you/friend tried? It would be very valuable feedback in improving the pipeline - that is if the hacks were pipeline/modeling-level fixes. I know that most people resort to using cpu offloading, which is slow, but if we can do anything to improve this, I'd be happy to incorporate it in no time. Other than that, as I said in one of my above comment, I had no problem running in under 24GB on a 4090 |
here you can see code : FurkanGozukara/ImageToVideoAI@37dad09 a lot of pipeline changes :D i don't know anyone else made it work on windows without cpu offloading at the moment other than us for 24 gb gpus yet - image to video i think you can compare with official ones by the way video to video and text to video still not working at all, only image to video working |
Thank you for the quick response! I'll take a look and see what I can do as soon as I'm free |
I think Dhruv’s PR on dynamic upcasting could work without quantization and offloading. |
I looked at the changes and it looks like there is now an autocast statement in your version of the pipeline I'm not sure why/how this helps you run i2v in under 24 GB, but I noticed that you don't have this change in the t2v and v2v pipelines. If you can run the 5B image-to-video checkpoint in under 24 GB (as you mentioned above), there is no way that running t2v/v2v with Cog-5b would lead to OOMs, because the effect latent size is doubled in the former due to image latents. Also, I'm unable to reproduce so I'm afraid I can't be of any more help. Sayak's Bitsandbytes PR will be merged as a quantization backend, so hopefully that helps a little bit in reducing memory requirements without offloading |
@a-r-r-o-w our text to video or video to video not working at all :D i also tried official ones from official demo on hugging face thanks a lot |
Describe the bug
Hello. I am trying to load CogVideoXImageToVideo in FP8 and I am getting this error
without FP8 no such errors
I am simply following this page : https://huggingface.co/THUDM/CogVideoX-5b
Diffusers commit id is latest : diffusers @ git+https://github.com/huggingface/diffusers.git@665c6b47a23bc841ad1440c4fe9cbb1782258656
here full details of error
System Info
Microsoft Windows [Version 10.0.19045.4894]
(c) Microsoft Corporation. All rights reserved.
R:\CogVideoX_v1\CogVideoX_SECourses\venv\Scripts>activate
(venv) R:\CogVideoX_v1\CogVideoX_SECourses\venv\Scripts>pip freeze
accelerate==0.34.2
aiofiles==23.2.1
annotated-types==0.7.0
anyio==4.6.0
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
contourpy==1.3.0
cycler==0.12.1
decorator==4.4.2
diffusers @ git+https://github.com/huggingface/diffusers.git@665c6b47a23bc841ad1440c4fe9cbb1782258656
distro==1.9.0
einops==0.8.0
exceptiongroup==1.2.2
fastapi==0.115.0
ffmpy==0.4.0
filelock==3.16.1
fonttools==4.54.1
fsspec==2024.9.0
gradio==4.44.0
gradio_client==1.3.0
h11==0.14.0
httpcore==1.0.5
httpx==0.27.2
huggingface-hub==0.25.1
idna==3.10
imageio==2.35.1
imageio-ffmpeg==0.5.1
importlib_metadata==8.5.0
importlib_resources==6.4.5
Jinja2==3.1.4
jiter==0.5.0
kiwisolver==1.4.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
mdurl==0.1.2
moviepy==1.0.3
mpmath==1.3.0
networkx==3.3
numpy==1.26.0
openai==1.48.0
opencv-python==4.10.0.84
orjson==3.10.7
packaging==24.1
pandas==2.2.3
Pillow==9.5.0
proglog==0.1.10
psutil==6.0.0
pydantic==2.9.2
pydantic_core==2.23.4
pydub==0.25.1
Pygments==2.18.0
pyparsing==3.1.4
python-dateutil==2.9.0.post0
python-multipart==0.0.10
pytz==2024.2
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
rich==13.8.1
ruff==0.6.8
safetensors==0.4.5
scikit-video==1.1.11
scipy==1.14.1
semantic-version==2.10.0
sentencepiece==0.2.0
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
spandrel==0.4.0
starlette==0.38.6
sympy==1.13.3
tokenizers==0.20.0
tomlkit==0.12.0
torch==2.4.1+cu124
torchvision==0.19.1+cu124
tqdm==4.66.5
transformers==4.45.0
triton @ https://huggingface.co/MonsterMMORPG/SECourses/resolve/main/triton-3.0.0-cp310-cp310-win_amd64.whl
typer==0.12.5
typing_extensions==4.12.2
tzdata==2024.2
urllib3==2.2.3
uvicorn==0.30.6
websockets==12.0
xformers==0.0.28.post1
zipp==3.20.2
(venv) R:\CogVideoX_v1\CogVideoX_SECourses\venv\Scripts>diffusers-cli env
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
NVIDIA GeForce RTX 3060, 12288 MiB
(venv) R:\CogVideoX_v1\CogVideoX_SECourses\venv\Scripts>
Who can help?
Text-to-Video / Video-to-Video @DN6 @a-r-r-o-w
The text was updated successfully, but these errors were encountered: