-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Add to overload for GGMLTensor, so calling to on the model moves the quantized data. #7949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Ah, now test failure is going to be fun to resolve. There's also a question if we should remove |
a159fb3
to
fcc452b
Compare
I rebased on The failing test is skipped on CI (unsure why, comment says it was flaky - on v5.9.1 it does pass for me locally). So we need to run the test locally to ensure it is passing. The error occurs on Linux and macOS for me. Could the problem causing the failure cause issues at runtime? |
The failing test shouldn't have cause issues at run time, its on of those checks for the 'user' doing something they shouldn't, attempting to move the GGMLTensor to a new dtype. The test expects an exception to be raised it you try to do thay and there are no instances of doing that in the InvokeAI code base (as it would have cause the Exception and a failure). As you can see from the re-run checks, I added a check for attempting to change the dtype into the new code using the same Exception so the test now passes. |
Deep Breath... Yep my change is responsible for the tests failing so I went back to the drawing board. It seems PyTorch know this is 'wrong' and want to change it, but are concerned about Backwards Compatibility. Luckily they've started prepping the way and provided a call to swap behaviours so I've tested this by reverting all my code and wrapped the call to move the model with old_value = torch.__future__.get_overwrite_module_params_on_conversion()
torch.__future__.set_overwrite_module_params_on_conversion(True)
self._model.to(self._compute_device)
torch.__future__.set_overwrite_module_params_on_conversion(old_value) So I'm looking at a couple of ways of dealing with this, the easiest is to do the change above, that needs a lot of testing the other would be to change the GGUF code to use the data field which would be a lot of work. How to a get the tests to run locally, I was getting model errors when I tried earlier today ? |
Bah, changing quantized_data to data causes what looks like an infinite recursion |
To test locally, install the test extras: uv pip install -e ".[dev,test]" Then, from repo root: pytest tests Or, to run just this test: pytest tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py I run the tests in the VSCode debugger so I can step through. I use this launch config: {
"name": "InvokeAI Single Test",
"type": "debugpy",
"request": "launch",
"module": "pytest",
"args": ["tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py"],
"justMyCode": false
}, Then I can run the test w/ this button: Screen.Recording.2025-04-24.at.7.38.57.am.mov
Yeah, I got stuck on this and wasn't sure how to proceed when I was investigating. Realized I'm in over my head though. |
Won't have to look at this for a few days, I think PyTorch has some 'interesting' behaviour going on with Tensors, if you try to access the data directly. >>> import torch
>>> x = torch.tensor([[1., -1.], [1., -1.]])
>>> print(x)
tensor([[ 1., -1.],
[ 1., -1.]])
>>> print(x.data)
tensor([[ 1., -1.],
[ 1., -1.]])
>>> type(x.data)
<class 'torch.Tensor'> It's like its passing the call back up to the class level. Oh just though of this test >>> import torch
>>> x = torch.tensor([1])
>>> print (x)
tensor([1])
>>> print (x.data)
tensor([1])
>>> print (x.data.data.data)
tensor([1])
>>> It GGMLTensors are doing the same thing I get why its infinitely recursing, it looks like there's a detach call in there which gets dispatched to |
Okay, I've convinced myself that using the data field won't work, pretty much anything to do with the lower level Tensor stuff is done in the C libraries so I can't really see how it dodges the recursion and I don't think they're be a way to apply that to GGMLTensors. So that leaves three choices. We go with the in place move, that I originally started with. The last looks something like this. if self._cpu_state_dict is not None:
new_state_dict: dict[str, torch.Tensor] = {}
for k, v in self._cpu_state_dict.items():
new_state_dict[k] = v.to(self._compute_device, copy=True)
self._model.load_state_dict(new_state_dict, assign=True)
#new code
check_for_gguf = self._model.state_dict().get("img_in.weight")
if isinstance(check_for_gguf, GGMLTensor):
new_state_dict: dict[str, torch.Tensor] = {}
for k, v in self._model.state_dict().items():
new_state_dict[k] = v.to(self._compute_device, copy=True)
self._model.load_state_dict(new_state_dict, assign=True)
#end of new code
self._model.to(self._compute_device) Using the futures with check would end up like this if self._cpu_state_dict is not None:
new_state_dict: dict[str, torch.Tensor] = {}
for k, v in self._cpu_state_dict.items():
new_state_dict[k] = v.to(self._compute_device, copy=True)
self._model.load_state_dict(new_state_dict, assign=True)
#new code
check_for_gguf = self._model.state_dict().get("img_in.weight")
if isinstance(check_for_gguf, GGMLTensor):
old_value = torch.__future__.get_overwrite_module_params_on_conversion()
torch.__future__.set_overwrite_module_params_on_conversion(True)
self._model.to(self._compute_device)
torch.__future__.set_overwrite_module_params_on_conversion(old_value)
else:
self._model.to(self._compute_device) I personally prefer the futures code as it is ready for it pytorch ever switch behaviours and its probably avoiding a extra run through and copy of the weights compared to the manual move, and it shouldn't break the tests. Forgot to mention the GGMLTensor check requires the appropriate import from invokeai.backend.quantization.gguf.ggml_tensor import GGMLTensor |
Summary
If a user has not enabled partial loading, and disabled keeping copies of the models state dict in CPU VRAM (both likely for MPS users) then GGUF models state dicts are not moved to the compute device from cpu device that models are loaded into by default. This is due to the quantised data being stored in its own field instead of the inherited Tensors data field.
This PR uses an overload of the Tensor
to
method to load the quantised data to the compute device when the model is moved by calling isto
method,Related Issues / Discussions
Closes #7939
QA Instructions
Duplicate the failure by removing
enable_partial_loading: true
and addingkeep_ram_copy_of_weights: false
to invokeai.yaml as necessary and attempting to run a image generation using a GGUF quantised Flux model.Note: enable_partial_loading: true needs to be removed it seems commenting stuff out in the yaml file doesn't work (sometimes ?)
You should get an error related to mixed compute devices, on MPS the error is s follows:
Apply the PR , restart Invokeai and attempt the same render, it should now work.
Merge Plan
This is a small stand alone PR it should merge without issue.
Checklist
What's New
copy (if doing a release after this PR)