Skip to content

demo: Show map_location default doesn't fix CUDA-to-CPU loading#3

Open
robosimon wants to merge 1 commit intomainfrom
demo/map-location-default
Open

demo: Show map_location default doesn't fix CUDA-to-CPU loading#3
robosimon wants to merge 1 commit intomainfrom
demo/map-location-default

Conversation

@robosimon
Copy link
Copy Markdown
Owner

Demonstration: Why Simple map_location Default Fails

This PR demonstrates that Steffen's suggested simple fix (adding map_location="cpu" default when CUDA is unavailable) does not work for the CUDA-to-CPU loading issue.

Changes

  • Added simple 4-line fix in CEBRA.load() to set map_location="cpu" when CUDA not available
  • Added comprehensive tests from the main fix PR

Test Results (All FAIL)

$ pytest tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu -v

FAILED tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu[offset1-model-cuda]
FAILED tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu[offset1-model-cuda:0]
FAILED tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu[offset1-model-saved_device2]
FAILED tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu[offset1-model-saved_device3]
FAILED tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu[parametrized-model-5-cuda]
FAILED tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu[parametrized-model-5-cuda:0]
FAILED tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu[parametrized-model-5-saved_device2]
FAILED tests/test_sklearn.py::test_load_cuda_checkpoint_falls_back_to_cpu[parametrized-model-5-saved_device3]

8 failed, 3 warnings in 26.79s
$ pytest tests/test_sklearn.py::test_safe_torch_load_cuda_fallback -v

FAILED tests/test_sklearn.py::test_safe_torch_load_cuda_fallback - RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False

Root Cause

The simple map_location="cpu" only affects torch.load() - it loads tensors to CPU. But then _load_cebra_with_sklearn_backend() calls:

model = cebra.models.init(...).to(state['device_'])  # device_ is still 'cuda'!

This tries to move the model to CUDA after loading, causing:

AssertionError: Torch not compiled with CUDA enabled

Conclusion

This confirms that Steffen was right: the simple default map_location="cpu" alone is insufficient. The comprehensive fix in PR AdaptiveMotorControlLab#296 is necessary because we need to:

  1. Resolve the device from the checkpoint before using it
  2. Update state['device_'] to CPU when CUDA unavailable
  3. Handle both str and torch.device types
  4. Use the resolved device for ALL .to() calls (model, criterion, solver)

See the actual working fix: AdaptiveMotorControlLab#296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant