Skip to content

Conversation

@littlebullGit
Copy link
Contributor

@littlebullGit littlebullGit commented Nov 27, 2025

What does this PR do?

  • Adds a CUDA-only integration test that mirrors the reporter’s compiled ModelParallel setup so the KeyError('model.0.weight') reproduces in CI.
  • Fixes [ModelParallelStrategy.optimizer_state] so when torch.compile wraps the module, optimizer states get rekeyed through both the compiled wrapper and the original module before single-file checkpointing, preventing the KeyError.
  • Documents the fix in the unreleased changelog.

Fixes #21357

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes? (CUDA test runs in CI; CPU run skips as expected)
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 27, 2025
@littlebullGit littlebullGit marked this pull request as ready for review November 27, 2025 03:23
@littlebullGit littlebullGit force-pushed the fix/21357-modelparallel-checkpoint branch from 82f9a7d to db3d718 Compare November 27, 2025 03:48
@littlebullGit littlebullGit force-pushed the fix/21357-modelparallel-checkpoint branch from db3d718 to d4e476f Compare November 27, 2025 04:57
@littlebullGit littlebullGit changed the title Add regression test for ModelParallel single-file checkpoint CUDA test to reproduce: ModelParallelStrategy fails with non-distributed checkpoint. #21357 Nov 27, 2025
@codecov
Copy link

codecov bot commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (f3f6605) to head (ffcd6e4).
⚠️ Report is 1 commits behind head on master.
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (f3f6605) and HEAD (ffcd6e4). Click for more details.

HEAD has 385 uploads less than BASE
Flag BASE (f3f6605) HEAD (ffcd6e4)
python3.10 12 3
cpu 119 30
lightning 59 15
pytest 60 0
python3.11 24 6
lightning_fabric 30 0
python3.12 36 9
python3.12.7 35 9
python 12 3
pytorch2.1 12 6
pytest-full 59 30
pytorch2.2.2 6 3
pytorch2.3 6 3
pytorch_lightning 30 15
pytorch2.7 6 3
pytorch2.9 6 3
pytorch2.8 6 3
pytorch2.5.1 6 3
pytorch2.4.1 5 3
pytorch2.6 6 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21384     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         269      266      -3     
  Lines       23813    23784     -29     
=========================================
- Hits        20628    18746   -1882     
- Misses       3185     5038   +1853     

@bhimrazy bhimrazy marked this pull request as draft November 27, 2025 07:34
@littlebullGit littlebullGit changed the title CUDA test to reproduce: ModelParallelStrategy fails with non-distributed checkpoint. #21357 Fix ModelParallelStrategy fails with non-distributed checkpoint. #21384 Nov 27, 2025
@littlebullGit littlebullGit marked this pull request as ready for review November 27, 2025 16:08
@littlebullGit littlebullGit changed the title Fix ModelParallelStrategy fails with non-distributed checkpoint. #21384 Fix ModelParallelStrategy fails with non-distributed checkpoint. Nov 27, 2025
@littlebullGit littlebullGit force-pushed the fix/21357-modelparallel-checkpoint branch 2 times, most recently from 31b0976 to 646e01b Compare November 27, 2025 19:42
@littlebullGit littlebullGit force-pushed the fix/21357-modelparallel-checkpoint branch from 90af5cb to cd663c6 Compare November 28, 2025 06:47
@SkafteNicki SkafteNicki added distributed Generic distributed-related topic torch.compile labels Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

distributed Generic distributed-related topic has conflicts pl Generic label for PyTorch Lightning package torch.compile

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ModelParallelStrategy fails with non-distributed checkpoint.

2 participants