Add to overload for GGMLTensor, so calling to on the model moves the quantized data. #7949

Vargol · 2025-04-22T08:44:53Z

Summary

If a user has not enabled partial loading, and disabled keeping copies of the models state dict in CPU VRAM (both likely for MPS users) then GGUF models state dicts are not moved to the compute device from cpu device that models are loaded into by default. This is due to the quantised data being stored in its own field instead of the inherited Tensors data field.
This PR uses an overload of the Tensor to method to load the quantised data to the compute device when the model is moved by calling is to method,

Related Issues / Discussions

Closes #7939

QA Instructions

Duplicate the failure by removing enable_partial_loading: true and adding keep_ram_copy_of_weights: false to invokeai.yaml as necessary and attempting to run a image generation using a GGUF quantised Flux model.
Note: enable_partial_loading: true needs to be removed it seems commenting stuff out in the yaml file doesn't work (sometimes ?)

You should get an error related to mixed compute devices, on MPS the error is s follows:

RuntimeError: Tensor for argument weight is on cpu but expected on mps

Apply the PR , restart Invokeai and attempt the same render, it should now work.

Merge Plan

This is a small stand alone PR it should merge without issue.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
[] Updated What's New copy (if doing a release after this PR)

Vargol · 2025-04-22T09:35:10Z

Ah, now test failure is going to be fun to resolve. to is in the GGML operations table but apply_to_quantized_tensor returns a new GGMLTensor which as I mentioned in the issue doesn't seem to work in combination with 'Module.to' to move a GGMLTensor in the modules state dict. I can think of a few fixes...
Call super() that seems to go through the OPS table, but creates a tensor we then throw away.
Check for an attempted dtype move of the new to overload.
Get rid of the test.

There's also a question if we should remove to from the OPS table ?

psychedelicious · 2025-04-23T10:12:00Z

I rebased on main which has triggered CI to run.

The failing test is skipped on CI (unsure why, comment says it was flaky - on v5.9.1 it does pass for me locally). So we need to run the test locally to ensure it is passing. The error occurs on Linux and macOS for me.

Could the problem causing the failure cause issues at runtime?

Vargol · 2025-04-23T10:21:24Z

The failing test shouldn't have cause issues at run time, its on of those checks for the 'user' doing something they shouldn't, attempting to move the GGMLTensor to a new dtype. The test expects an exception to be raised it you try to do thay and there are no instances of doing that in the InvokeAI code base (as it would have cause the Exception and a failure).

As you can see from the re-run checks, I added a check for attempting to change the dtype into the new code using the same Exception so the test now passes.

psychedelicious · 2025-04-23T10:38:14Z

The checks are skipped in CI:

It's failing for me locally on macOS (MPS) and Linux (CUDA).

test on macOS

=============================================================================== test session starts ===============================================================================
platform darwin -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: /Users/spencer/Documents/Code/InvokeAI
configfile: pyproject.toml
plugins: datadir-1.6.1, anyio-4.9.0, Faker-37.1.0, timeout-2.3.1, cov-6.1.1
collected 5 items                                                                                                                                                                 

tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py s.sFs                                                                      [100%]

==================================================================================== FAILURES =====================================================================================
______________________________________________________________ test_torch_module_autocast_linear_layer[gguf-device1] ______________________________________________________________

device = device(type='mps'), model = ModelWithLinearLayer(
  (linear): Linear(in_features=32, out_features=64, bias=True)
)

    @cuda_and_mps
    @torch.no_grad()
    def test_torch_module_autocast_linear_layer(device: torch.device, model: torch.nn.Module):
        # Skip this test with MPS on GitHub Actions. It fails but I haven't taken the tie to figure out why. It passes
        # locally on MacOS.
        if os.environ.get("GITHUB_ACTIONS") == "true" and device.type == "mps":
            pytest.skip("This test is flaky on GitHub Actions")
    
        # Model parameters should start off on the CPU.
        assert all(p.device.type == "cpu" for p in model.parameters())
    
        torch.manual_seed(0)
    
        # Run inference on the CPU.
        x = torch.randn(1, 32, device="cpu")
        expected = model(x)
        assert expected.device.type == "cpu"
    
        # Apply the custom layers to the model.
        apply_custom_layers_to_model(model, device_autocasting_enabled=True)
    
        # Run the model on the device.
        autocast_result = model(x.to(device))
    
        # The model output should be on the device.
        assert autocast_result.device.type == device.type
        # The model parameters should still be on the CPU.
        assert all(p.device.type == "cpu" for p in model.parameters())
    
        # Remove the custom layers from the model.
        remove_custom_layers_from_model(model)
    
        # After removing the custom layers, the model should no longer be able to run inference on the device.
        with pytest.raises(RuntimeError):
            _ = model(x.to(device))
    
        # Run inference again on the CPU.
>       after_result = model(x)

tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1739: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1750: in _call_impl
    return forward_call(*args, **kwargs)
tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py:39: in forward
    return self.linear(x)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1739: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1750: in _call_impl
    return forward_call(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Linear(in_features=32, out_features=64, bias=True)
input = tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487,  0.6920, -0.3160, -2.1152,
          0.3223, -1.2633,  0.3500,  ...  0.5988, -1.5551, -0.3414,  1.8530,
          0.7502, -0.5855, -0.1734,  0.1835,  1.3894,  1.5863,  0.9463, -0.8437]])

    def forward(self, input: Tensor) -> Tensor:
>       return F.linear(input, self.weight, self.bias)
E       RuntimeError: Tensor for argument #1 'self' is on CPU, but expected it to be on GPU (while checking arguments for addmm_out_mps_impl)

.venv/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: RuntimeError
================================================================================ warnings summary =================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================= short test summary info =============================================================================
FAILED tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_linear_layer[gguf-device1] - RuntimeError: Tensor for argument #1 'self' is on CPU, but expected it to be on GPU (while checking arguments for addmm_out_mps_impl)
=============================================================== 1 failed, 1 passed, 3 skipped, 2 warnings in 0.28s ================================================================

test on Linux

================================================================================ test session starts =================================================================================
platform linux -- Python 3.12.7, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/bat/Documents/Code/InvokeAI
configfile: pyproject.toml
plugins: timeout-2.3.1, Faker-37.1.0, anyio-4.9.0, cov-6.1.1, datadir-1.6.1
collected 5 items                                                                                                                                                                    

tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py .sFs.                                                                         [100%]

====================================================================================== FAILURES ======================================================================================
_______________________________________________________________ test_torch_module_autocast_linear_layer[gguf-device0] ________________________________________________________________

device = device(type='cuda'), model = ModelWithLinearLayer(
  (linear): Linear(in_features=32, out_features=64, bias=True)
)

    @cuda_and_mps
    @torch.no_grad()
    def test_torch_module_autocast_linear_layer(device: torch.device, model: torch.nn.Module):
        # Skip this test with MPS on GitHub Actions. It fails but I haven't taken the tie to figure out why. It passes
        # locally on MacOS.
        if os.environ.get("GITHUB_ACTIONS") == "true" and device.type == "mps":
            pytest.skip("This test is flaky on GitHub Actions")
    
        # Model parameters should start off on the CPU.
        assert all(p.device.type == "cpu" for p in model.parameters())
    
        torch.manual_seed(0)
    
        # Run inference on the CPU.
        x = torch.randn(1, 32, device="cpu")
        expected = model(x)
        assert expected.device.type == "cpu"
    
        # Apply the custom layers to the model.
        apply_custom_layers_to_model(model, device_autocasting_enabled=True)
    
        # Run the model on the device.
        autocast_result = model(x.to(device))
    
        # The model output should be on the device.
        assert autocast_result.device.type == device.type
        # The model parameters should still be on the CPU.
        assert all(p.device.type == "cpu" for p in model.parameters())
    
        # Remove the custom layers from the model.
        remove_custom_layers_from_model(model)
    
        # After removing the custom layers, the model should no longer be able to run inference on the device.
        with pytest.raises(RuntimeError):
            _ = model(x.to(device))
    
        # Run inference again on the CPU.
>       after_result = model(x)

tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1739: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1750: in _call_impl
    return forward_call(*args, **kwargs)
tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py:39: in forward
    return self.linear(x)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1739: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1750: in _call_impl
    return forward_call(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Linear(in_features=32, out_features=64, bias=True)
input = tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487,  0.6920, -0.3160, -2.1152,
          0.3223, -1.2633,  0.3500,  ...  0.5988, -1.5551, -0.3414,  1.8530,
          0.7502, -0.5855, -0.1734,  0.1835,  1.3894,  1.5863,  0.9463, -0.8437]])

    def forward(self, input: Tensor) -> Tensor:
>       return F.linear(input, self.weight, self.bias)
E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_addmm)

.venv/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: RuntimeError
================================================================================== warnings summary ==================================================================================
tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_bnb_llm_int8_linear_layer
  /home/bat/Documents/Code/InvokeAI/.venv/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py:315: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
    warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================== short test summary info ===============================================================================
FAILED tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_linear_layer[gguf-device0] - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_addmm)
================================================================= 1 failed, 2 passed, 2 skipped, 1 warning in 0.27s ==================================================================

The error messages are slightly different but I think that's just an inconsistency in PyTorch's MPS implementation vs CUDA

Vargol · 2025-04-23T15:56:41Z

Deep Breath... Yep my change is responsible for the tests failing so I went back to the drawing board.
Spent the last few hours knees deep in the pytorch code and eventually confirmed what I expected, moving a Model, or technically moving a nn.Module 'breaks' things because it doesn't use the new Tensor created by the calls to to when it propagates the call to the nn.Parameter values e.g. weight and bias. It creates the new Tensor then copies the .data field to the existing tensor. see...

https://github.com/pytorch/pytorch/blob/1eba9b3aa3c43f86f4a2c807ac8e12c4a7767340/torch/nn/modules/module.py#L907

It seems PyTorch know this is 'wrong' and want to change it, but are concerned about Backwards Compatibility. Luckily they've started prepping the way and provided a call to swap behaviours so I've tested this by reverting all my code and wrapped the call to move the model with

        old_value = torch.__future__.get_overwrite_module_params_on_conversion()
        torch.__future__.set_overwrite_module_params_on_conversion(True)    
        self._model.to(self._compute_device)
        torch.__future__.set_overwrite_module_params_on_conversion(old_value)

So I'm looking at a couple of ways of dealing with this, the easiest is to do the change above, that needs a lot of testing the other would be to change the GGUF code to use the data field which would be a lot of work.

How to a get the tests to run locally, I was getting model errors when I tried earlier today ?

Vargol · 2025-04-23T16:46:02Z

Bah, changing quantized_data to data causes what looks like an infinite recursion

psychedelicious · 2025-04-23T21:40:49Z

To test locally, install the test extras:

uv pip install -e ".[dev,test]"

Then, from repo root:

pytest tests

Or, to run just this test:

pytest tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py

I run the tests in the VSCode debugger so I can step through. I use this launch config:

      {
        "name": "InvokeAI Single Test",
        "type": "debugpy",
        "request": "launch",
        "module": "pytest",
        "args": ["tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py"],
        "justMyCode": false
      },

Then I can run the test w/ this button:

Screen.Recording.2025-04-24.at.7.38.57.am.mov

Bah, changing quantized_data to data causes what looks like an infinite recursion

Yeah, I got stuck on this and wasn't sure how to proceed when I was investigating. Realized I'm in over my head though.

Vargol · 2025-04-24T11:01:16Z

Won't have to look at this for a few days, I think PyTorch has some 'interesting' behaviour going on with Tensors, if you try to access the data directly.

>>> import torch
>>> x = torch.tensor([[1., -1.], [1., -1.]])
>>> print(x)
tensor([[ 1., -1.],
        [ 1., -1.]])
>>> print(x.data)
tensor([[ 1., -1.],
        [ 1., -1.]])
>>> type(x.data)
<class 'torch.Tensor'>

It's like its passing the call back up to the class level. Oh just though of this test

>>> import torch
>>> x = torch.tensor([1])
>>> print (x)
tensor([1])
>>> print (x.data)
tensor([1])
>>> print (x.data.data.data)
tensor([1])
>>>

It GGMLTensors are doing the same thing I get why its infinitely recursing, it looks like there's a detach call in there which gets dispatched to apply_to_quantized_data, but as that now calls GGMLTensor.data.detach() field it punts it up to GGMLTensor.detach and hits the dispatch code again and get redirected back to apply_to_quantized_data and so on.
I assume the PyTorch calls for Tensor.detach are doing something at a lower level to avoid the recursion.

Vargol · 2025-04-26T11:21:16Z

Okay, I've convinced myself that using the data field won't work, pretty much anything to do with the lower level Tensor stuff is done in the C libraries so I can't really see how it dodges the recursion and I don't think they're be a way to apply that to GGMLTensors.

So that leaves three choices. We go with the in place move, that I originally started with.
use the futures behaviour, perhaps with a "is it a GGUF model" check.
Copy the state dictionary manually, if there's a GGUF model.

The last looks something like this.

       if self._cpu_state_dict is not None:
            new_state_dict: dict[str, torch.Tensor] = {}
            for k, v in self._cpu_state_dict.items():
                new_state_dict[k] = v.to(self._compute_device, copy=True)
            self._model.load_state_dict(new_state_dict, assign=True)

#new code        
        check_for_gguf = self._model.state_dict().get("img_in.weight") 
        if isinstance(check_for_gguf, GGMLTensor):
            new_state_dict: dict[str, torch.Tensor] = {}
            for k, v in self._model.state_dict().items():
                new_state_dict[k] = v.to(self._compute_device, copy=True)
            self._model.load_state_dict(new_state_dict, assign=True)
#end of new code

        self._model.to(self._compute_device)

Using the futures with check would end up like this

       if self._cpu_state_dict is not None:
            new_state_dict: dict[str, torch.Tensor] = {}
            for k, v in self._cpu_state_dict.items():
                new_state_dict[k] = v.to(self._compute_device, copy=True)
            self._model.load_state_dict(new_state_dict, assign=True)

#new code        
        check_for_gguf = self._model.state_dict().get("img_in.weight") 
        if isinstance(check_for_gguf, GGMLTensor):
            old_value = torch.__future__.get_overwrite_module_params_on_conversion()
            torch.__future__.set_overwrite_module_params_on_conversion(True)    
            self._model.to(self._compute_device)
            torch.__future__.set_overwrite_module_params_on_conversion(old_value)
        else:
            self._model.to(self._compute_device)

I personally prefer the futures code as it is ready for it pytorch ever switch behaviours and its probably avoiding a extra run through and copy of the weights compared to the manual move, and it shouldn't break the tests.

Forgot to mention the GGMLTensor check requires the appropriate import

from invokeai.backend.quantization.gguf.ggml_tensor import GGMLTensor

…quantized data as well

psychedelicious

Thank you!

I tested the change with a GGUF-quantized FLUX on MPS and it works. I also tested on Linux w/ a CUDA GPU and confirmed there is no obvious regression - still works.

Vargol requested review from lstein, blessedcoolant, brandonrising, hipsterusername and jazzhaiku as code owners April 22, 2025 08:44

github-actions bot added python PRs that change python files backend PRs that change backend files labels Apr 22, 2025

psychedelicious force-pushed the gruff_to branch from a159fb3 to fcc452b Compare April 23, 2025 07:45

Vargol added 12 commits May 19, 2025 11:02

Add to overload for GGMLTensor, so calling to on the model moves the …

fa69704

…quantized data as well

raise exected exception when attempting to change dtype

40f5614

fix missing bracket

1abde91

fix picky ruff issue

3d4ea85

revert to overload due to failing tests, use Torch futures instead

71b9d10

fix offload device

5fbc412

add check for state_dict, required to load TI's

597f7c1

Add to overload for GGMLTensor, so calling to on the model moves the …

654038c

…quantized data as well

raise exected exception when attempting to change dtype

3f14b60

fix missing bracket

c20e52a

fix picky ruff issue

4e237a2

fix import ordering, remove code I reverted that the resync added back

3f34789

psychedelicious force-pushed the gruff_to branch from f77f019 to 3f34789 Compare May 19, 2025 01:03

psychedelicious requested review from psychedelicious and removed request for brandonrising May 19, 2025 01:03

psychedelicious approved these changes May 19, 2025

View reviewed changes

hipsterusername approved these changes May 19, 2025

View reviewed changes

psychedelicious merged commit 6c0bd7d into invoke-ai:main May 19, 2025
12 checks passed

Vargol deleted the gruff_to branch May 19, 2025 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add to overload for GGMLTensor, so calling to on the model moves the quantized data. #7949

Add to overload for GGMLTensor, so calling to on the model moves the quantized data. #7949

Uh oh!

Vargol commented Apr 22, 2025 •

edited

Loading

Uh oh!

Vargol commented Apr 22, 2025

Uh oh!

psychedelicious commented Apr 23, 2025

Uh oh!

Vargol commented Apr 23, 2025 •

edited

Loading

Uh oh!

psychedelicious commented Apr 23, 2025 •

edited

Loading

Uh oh!

Vargol commented Apr 23, 2025

Uh oh!

Vargol commented Apr 23, 2025

Uh oh!

psychedelicious commented Apr 23, 2025

Uh oh!

Vargol commented Apr 24, 2025

Uh oh!

Vargol commented Apr 26, 2025 •

edited

Loading

Uh oh!

psychedelicious left a comment

Uh oh!

Uh oh!

Uh oh!

Add to overload for GGMLTensor, so calling to on the model moves the quantized data. #7949

Add to overload for GGMLTensor, so calling to on the model moves the quantized data. #7949

Uh oh!

Conversation

Vargol commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

Vargol commented Apr 22, 2025

Uh oh!

psychedelicious commented Apr 23, 2025

Uh oh!

Vargol commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psychedelicious commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vargol commented Apr 23, 2025

Uh oh!

Vargol commented Apr 23, 2025

Uh oh!

psychedelicious commented Apr 23, 2025

Uh oh!

Vargol commented Apr 24, 2025

Uh oh!

Vargol commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psychedelicious left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Vargol commented Apr 22, 2025 •

edited

Loading

Vargol commented Apr 23, 2025 •

edited

Loading

psychedelicious commented Apr 23, 2025 •

edited

Loading

Vargol commented Apr 26, 2025 •

edited

Loading