-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possibly to avoid from_single_file
loading in fp32 to save RAM
#10679
Comments
@yiyixuxu changed to windows and a mobile 4090 since the original issue was with windows. Without the PR: Initial memory usage: 548.29 MB
Peak memory usage: 4667.13 MB
Time taken: 1.49 seconds
Final memory usage: 1387.07 MB With the PR: Initial memory usage: 548.34 MB
Peak memory usage: 4668.90 MB
Time taken: 3.08 seconds
Final memory usage: 1388.89 MB So RAM usage is almost the same, and not related to this issue, but that PR doubles the time for loading the model which is not good. ccing @SunMarc just in case |
Adding a bit more context from the original discussion: The conversion itself uses a lot of RAM. Combined with the fact that there is no way (that I know of) to load the model in its original format from the file, there will be an overhead in most situations. See this example: I'm using this file loading in fp8 format from a bf16 file:
loading in fp32 format from a bf16 file:
loading in bf6 format from a bf16 file:
loading in an unspecified format from a bf16 file:
Feature request for a new parameter to remove any conversion overheadAdditionally it would be great to have an option to not load the weights at all. This can be done by removing any read access to the tensors. The safetensors library already supports lazy tensor loading out of the box. Only tensors with a read access are actually loaded from the file. At the moment this is triggered by the |
Ive tested load this flux_dev model. On my machine when I have 46GB of free memory. This loading method load_from_single_file() when internally it loads the state_dict it loads it in bfloat16 in memory before converting. So my memory dropped to 32GB. Then in the step where it converted this to the diffusers format my memory dropped again to 20GB. And finally when it takes this and throws it in the meta device, only then will it convert it to FP8 so my memory drops to less than 4GB free and then it frees up the memory again. |
the ram used depend on file size loaded , so u must convert model to smaller size then load it ! or add command to free ram after loaded |
It was due to a small mistake from me ! Sorry for that, I fixed it in the latest commit. Also the PR should only speed up loading for diffusion models for now. |
Describe the bug
When loading a model using
from_single_file()
, the RAM usage is really high possibly because the weights are loaded in FP32 before conversion.Reproduction
Logs
System Info
not relevant here
Who can help?
@DN6
The text was updated successfully, but these errors were encountered: