Skip to content

Add to overload for GGMLTensor, so calling to on the model moves the quantized data. #7949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Vargol
Copy link
Contributor

@Vargol Vargol commented Apr 22, 2025

Summary

If a user has not enabled partial loading, and disabled keeping copies of the models state dict in CPU VRAM (both likely for MPS users) then GGUF models state dicts are not moved to the compute device from cpu device that models are loaded into by default. This is due to the quantised data being stored in its own field instead of the inherited Tensors data field.
This PR uses an overload of the Tensor to method to load the quantised data to the compute device when the model is moved by calling is to method,

Related Issues / Discussions

Closes #7939

QA Instructions

Duplicate the failure by removing enable_partial_loading: true and adding keep_ram_copy_of_weights: false to invokeai.yaml as necessary and attempting to run a image generation using a GGUF quantised Flux model.
Note: enable_partial_loading: true needs to be removed it seems commenting stuff out in the yaml file doesn't work (sometimes ?)

You should get an error related to mixed compute devices, on MPS the error is s follows:

RuntimeError: Tensor for argument weight is on cpu but expected on mps

Apply the PR , restart Invokeai and attempt the same render, it should now work.

Merge Plan

This is a small stand alone PR it should merge without issue.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • [] Updated What's New copy (if doing a release after this PR)

@github-actions github-actions bot added python PRs that change python files backend PRs that change backend files labels Apr 22, 2025
@Vargol
Copy link
Contributor Author

Vargol commented Apr 22, 2025

Ah, now test failure is going to be fun to resolve. to is in the GGML operations table but apply_to_quantized_tensor returns a new GGMLTensor which as I mentioned in the issue doesn't seem to work in combination with 'Module.to' to move a GGMLTensor in the modules state dict. I can think of a few fixes...
Call super() that seems to go through the OPS table, but creates a tensor we then throw away.
Check for an attempted dtype move of the new to overload.
Get rid of the test.

There's also a question if we should remove to from the OPS table ?

@psychedelicious
Copy link
Collaborator

I rebased on main which has triggered CI to run.

The failing test is skipped on CI (unsure why, comment says it was flaky - on v5.9.1 it does pass for me locally). So we need to run the test locally to ensure it is passing. The error occurs on Linux and macOS for me.

Could the problem causing the failure cause issues at runtime?

@Vargol
Copy link
Contributor Author

Vargol commented Apr 23, 2025

The failing test shouldn't have cause issues at run time, its on of those checks for the 'user' doing something they shouldn't, attempting to move the GGMLTensor to a new dtype. The test expects an exception to be raised it you try to do thay and there are no instances of doing that in the InvokeAI code base (as it would have cause the Exception and a failure).

As you can see from the re-run checks, I added a check for attempting to change the dtype into the new code using the same Exception so the test now passes.

@psychedelicious
Copy link
Collaborator

psychedelicious commented Apr 23, 2025

The checks are skipped in CI:

image

It's failing for me locally on macOS (MPS) and Linux (CUDA).

test on macOS
=============================================================================== test session starts ===============================================================================
platform darwin -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: /Users/spencer/Documents/Code/InvokeAI
configfile: pyproject.toml
plugins: datadir-1.6.1, anyio-4.9.0, Faker-37.1.0, timeout-2.3.1, cov-6.1.1
collected 5 items                                                                                                                                                                 

tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py s.sFs                                                                      [100%]

==================================================================================== FAILURES =====================================================================================
______________________________________________________________ test_torch_module_autocast_linear_layer[gguf-device1] ______________________________________________________________

device = device(type='mps'), model = ModelWithLinearLayer(
  (linear): Linear(in_features=32, out_features=64, bias=True)
)

    @cuda_and_mps
    @torch.no_grad()
    def test_torch_module_autocast_linear_layer(device: torch.device, model: torch.nn.Module):
        # Skip this test with MPS on GitHub Actions. It fails but I haven't taken the tie to figure out why. It passes
        # locally on MacOS.
        if os.environ.get("GITHUB_ACTIONS") == "true" and device.type == "mps":
            pytest.skip("This test is flaky on GitHub Actions")
    
        # Model parameters should start off on the CPU.
        assert all(p.device.type == "cpu" for p in model.parameters())
    
        torch.manual_seed(0)
    
        # Run inference on the CPU.
        x = torch.randn(1, 32, device="cpu")
        expected = model(x)
        assert expected.device.type == "cpu"
    
        # Apply the custom layers to the model.
        apply_custom_layers_to_model(model, device_autocasting_enabled=True)
    
        # Run the model on the device.
        autocast_result = model(x.to(device))
    
        # The model output should be on the device.
        assert autocast_result.device.type == device.type
        # The model parameters should still be on the CPU.
        assert all(p.device.type == "cpu" for p in model.parameters())
    
        # Remove the custom layers from the model.
        remove_custom_layers_from_model(model)
    
        # After removing the custom layers, the model should no longer be able to run inference on the device.
        with pytest.raises(RuntimeError):
            _ = model(x.to(device))
    
        # Run inference again on the CPU.
>       after_result = model(x)

tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1739: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1750: in _call_impl
    return forward_call(*args, **kwargs)
tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py:39: in forward
    return self.linear(x)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1739: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1750: in _call_impl
    return forward_call(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Linear(in_features=32, out_features=64, bias=True)
input = tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487,  0.6920, -0.3160, -2.1152,
          0.3223, -1.2633,  0.3500,  ...  0.5988, -1.5551, -0.3414,  1.8530,
          0.7502, -0.5855, -0.1734,  0.1835,  1.3894,  1.5863,  0.9463, -0.8437]])

    def forward(self, input: Tensor) -> Tensor:
>       return F.linear(input, self.weight, self.bias)
E       RuntimeError: Tensor for argument #1 'self' is on CPU, but expected it to be on GPU (while checking arguments for addmm_out_mps_impl)

.venv/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: RuntimeError
================================================================================ warnings summary =================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================= short test summary info =============================================================================
FAILED tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_linear_layer[gguf-device1] - RuntimeError: Tensor for argument #1 'self' is on CPU, but expected it to be on GPU (while checking arguments for addmm_out_mps_impl)
=============================================================== 1 failed, 1 passed, 3 skipped, 2 warnings in 0.28s ================================================================
test on Linux
================================================================================ test session starts =================================================================================
platform linux -- Python 3.12.7, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/bat/Documents/Code/InvokeAI
configfile: pyproject.toml
plugins: timeout-2.3.1, Faker-37.1.0, anyio-4.9.0, cov-6.1.1, datadir-1.6.1
collected 5 items                                                                                                                                                                    

tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py .sFs.                                                                         [100%]

====================================================================================== FAILURES ======================================================================================
_______________________________________________________________ test_torch_module_autocast_linear_layer[gguf-device0] ________________________________________________________________

device = device(type='cuda'), model = ModelWithLinearLayer(
  (linear): Linear(in_features=32, out_features=64, bias=True)
)

    @cuda_and_mps
    @torch.no_grad()
    def test_torch_module_autocast_linear_layer(device: torch.device, model: torch.nn.Module):
        # Skip this test with MPS on GitHub Actions. It fails but I haven't taken the tie to figure out why. It passes
        # locally on MacOS.
        if os.environ.get("GITHUB_ACTIONS") == "true" and device.type == "mps":
            pytest.skip("This test is flaky on GitHub Actions")
    
        # Model parameters should start off on the CPU.
        assert all(p.device.type == "cpu" for p in model.parameters())
    
        torch.manual_seed(0)
    
        # Run inference on the CPU.
        x = torch.randn(1, 32, device="cpu")
        expected = model(x)
        assert expected.device.type == "cpu"
    
        # Apply the custom layers to the model.
        apply_custom_layers_to_model(model, device_autocasting_enabled=True)
    
        # Run the model on the device.
        autocast_result = model(x.to(device))
    
        # The model output should be on the device.
        assert autocast_result.device.type == device.type
        # The model parameters should still be on the CPU.
        assert all(p.device.type == "cpu" for p in model.parameters())
    
        # Remove the custom layers from the model.
        remove_custom_layers_from_model(model)
    
        # After removing the custom layers, the model should no longer be able to run inference on the device.
        with pytest.raises(RuntimeError):
            _ = model(x.to(device))
    
        # Run inference again on the CPU.
>       after_result = model(x)

tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1739: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1750: in _call_impl
    return forward_call(*args, **kwargs)
tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py:39: in forward
    return self.linear(x)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1739: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1750: in _call_impl
    return forward_call(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Linear(in_features=32, out_features=64, bias=True)
input = tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487,  0.6920, -0.3160, -2.1152,
          0.3223, -1.2633,  0.3500,  ...  0.5988, -1.5551, -0.3414,  1.8530,
          0.7502, -0.5855, -0.1734,  0.1835,  1.3894,  1.5863,  0.9463, -0.8437]])

    def forward(self, input: Tensor) -> Tensor:
>       return F.linear(input, self.weight, self.bias)
E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_addmm)

.venv/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: RuntimeError
================================================================================== warnings summary ==================================================================================
tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_bnb_llm_int8_linear_layer
  /home/bat/Documents/Code/InvokeAI/.venv/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py:315: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
    warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================== short test summary info ===============================================================================
FAILED tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_linear_layer[gguf-device0] - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_addmm)
================================================================= 1 failed, 2 passed, 2 skipped, 1 warning in 0.27s ==================================================================

The error messages are slightly different but I think that's just an inconsistency in PyTorch's MPS implementation vs CUDA

@Vargol
Copy link
Contributor Author

Vargol commented Apr 23, 2025

Deep Breath... Yep my change is responsible for the tests failing so I went back to the drawing board.
Spent the last few hours knees deep in the pytorch code and eventually confirmed what I expected, moving a Model, or technically moving a nn.Module 'breaks' things because it doesn't use the new Tensor created by the calls to to when it propagates the call to the nn.Parameter values e.g. weight and bias. It creates the new Tensor then copies the .data field to the existing tensor. see...

https://github.com/pytorch/pytorch/blob/1eba9b3aa3c43f86f4a2c807ac8e12c4a7767340/torch/nn/modules/module.py#L907

It seems PyTorch know this is 'wrong' and want to change it, but are concerned about Backwards Compatibility. Luckily they've started prepping the way and provided a call to swap behaviours so I've tested this by reverting all my code and wrapped the call to move the model with

        old_value = torch.__future__.get_overwrite_module_params_on_conversion()
        torch.__future__.set_overwrite_module_params_on_conversion(True)    
        self._model.to(self._compute_device)
        torch.__future__.set_overwrite_module_params_on_conversion(old_value)

So I'm looking at a couple of ways of dealing with this, the easiest is to do the change above, that needs a lot of testing the other would be to change the GGUF code to use the data field which would be a lot of work.

How to a get the tests to run locally, I was getting model errors when I tried earlier today ?

@Vargol
Copy link
Contributor Author

Vargol commented Apr 23, 2025

Bah, changing quantized_data to data causes what looks like an infinite recursion

@psychedelicious
Copy link
Collaborator

To test locally, install the test extras:

uv pip install -e ".[dev,test]"

Then, from repo root:

pytest tests

Or, to run just this test:

pytest tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py

I run the tests in the VSCode debugger so I can step through. I use this launch config:

      {
        "name": "InvokeAI Single Test",
        "type": "debugpy",
        "request": "launch",
        "module": "pytest",
        "args": ["tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py"],
        "justMyCode": false
      },

Then I can run the test w/ this button:

Screen.Recording.2025-04-24.at.7.38.57.am.mov

Bah, changing quantized_data to data causes what looks like an infinite recursion

Yeah, I got stuck on this and wasn't sure how to proceed when I was investigating. Realized I'm in over my head though.

@Vargol
Copy link
Contributor Author

Vargol commented Apr 24, 2025

Won't have to look at this for a few days, I think PyTorch has some 'interesting' behaviour going on with Tensors, if you try to access the data directly.

>>> import torch
>>> x = torch.tensor([[1., -1.], [1., -1.]])
>>> print(x)
tensor([[ 1., -1.],
        [ 1., -1.]])
>>> print(x.data)
tensor([[ 1., -1.],
        [ 1., -1.]])
>>> type(x.data)
<class 'torch.Tensor'>

It's like its passing the call back up to the class level. Oh just though of this test

>>> import torch
>>> x = torch.tensor([1])
>>> print (x)
tensor([1])
>>> print (x.data)
tensor([1])
>>> print (x.data.data.data)
tensor([1])
>>> 

It GGMLTensors are doing the same thing I get why its infinitely recursing, it looks like there's a detach call in there which gets dispatched to apply_to_quantized_data, but as that now calls GGMLTensor.data.detach() field it punts it up to GGMLTensor.detach and hits the dispatch code again and get redirected back to apply_to_quantized_data and so on.
I assume the PyTorch calls for Tensor.detach are doing something at a lower level to avoid the recursion.

@Vargol
Copy link
Contributor Author

Vargol commented Apr 26, 2025

Okay, I've convinced myself that using the data field won't work, pretty much anything to do with the lower level Tensor stuff is done in the C libraries so I can't really see how it dodges the recursion and I don't think they're be a way to apply that to GGMLTensors.

So that leaves three choices. We go with the in place move, that I originally started with.
use the futures behaviour, perhaps with a "is it a GGUF model" check.
Copy the state dictionary manually, if there's a GGUF model.

The last looks something like this.

       if self._cpu_state_dict is not None:
            new_state_dict: dict[str, torch.Tensor] = {}
            for k, v in self._cpu_state_dict.items():
                new_state_dict[k] = v.to(self._compute_device, copy=True)
            self._model.load_state_dict(new_state_dict, assign=True)

#new code        
        check_for_gguf = self._model.state_dict().get("img_in.weight") 
        if isinstance(check_for_gguf, GGMLTensor):
            new_state_dict: dict[str, torch.Tensor] = {}
            for k, v in self._model.state_dict().items():
                new_state_dict[k] = v.to(self._compute_device, copy=True)
            self._model.load_state_dict(new_state_dict, assign=True)
#end of new code

        self._model.to(self._compute_device)

Using the futures with check would end up like this

       if self._cpu_state_dict is not None:
            new_state_dict: dict[str, torch.Tensor] = {}
            for k, v in self._cpu_state_dict.items():
                new_state_dict[k] = v.to(self._compute_device, copy=True)
            self._model.load_state_dict(new_state_dict, assign=True)

#new code        
        check_for_gguf = self._model.state_dict().get("img_in.weight") 
        if isinstance(check_for_gguf, GGMLTensor):
            old_value = torch.__future__.get_overwrite_module_params_on_conversion()
            torch.__future__.set_overwrite_module_params_on_conversion(True)    
            self._model.to(self._compute_device)
            torch.__future__.set_overwrite_module_params_on_conversion(old_value)
        else:
            self._model.to(self._compute_device)

I personally prefer the futures code as it is ready for it pytorch ever switch behaviours and its probably avoiding a extra run through and copy of the weights compared to the manual move, and it shouldn't break the tests.

Forgot to mention the GGMLTensor check requires the appropriate import

from invokeai.backend.quantization.gguf.ggml_tensor import GGMLTensor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend PRs that change backend files python PRs that change python files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: GGUF models no longer work on MacOS, tensors on cpu not on mps
2 participants