Skip to content

Conversation

@casteryh
Copy link
Contributor

@casteryh casteryh commented Nov 3, 2025

Summary

This PR adds a better workaround for the monarch bug where proc_mesh.stop() raises exit code 1, causing all other tests to fail.

Changes

  • Added monarch.actor.unhandled_fault_hook = lambda failure: None to all test files that use monarch actors
  • This prevents the unhandled fault from propagating and failing tests

Files Updated

  • tests/test_store.py
  • tests/test_tensor_slice.py
  • tests/test_state_dict.py
  • tests/test_large_tensors.py
  • tests/test_resharding_basic.py
  • tests/test_keys.py

Testing

This workaround allows proc_mesh.stop() calls to remain in place while preventing the exit code 1 error.

Notes

This is a temporary workaround until the underlying monarch bug is fixed.

This is a better workaround than commenting out proc_mesh.stop() calls.
The unhandled_fault_hook prevents proc_mesh.stop from raising exit code 1
which was causing all other tests to fail.

Files updated:
- test_store.py
- test_tensor_slice.py
- test_state_dict.py
- test_large_tensors.py
- test_resharding_basic.py
- test_keys.py
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 3, 2025
With the unhandled_fault_hook workaround in place, we can safely
uncomment all _proc_mesh.stop() calls that were previously commented out.

Files updated:
- test_store.py (4 locations)
- test_tensor_slice.py (4 locations)
- test_state_dict.py (2 locations)
- test_large_tensors.py (1 location)
- test_resharding_basic.py (2 locations)
@casteryh
Copy link
Contributor Author

casteryh commented Nov 3, 2025

Turns out this workaround still doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants