Skip to content

test: skip mfsdp_fully_shard cases when world_size < mesh size#4487

Open
wujingyue wants to merge 1 commit intoNVIDIA:mainfrom
wujingyue:fsdp-skip-when-not-enough-gpus
Open

test: skip mfsdp_fully_shard cases when world_size < mesh size#4487
wujingyue wants to merge 1 commit intoNVIDIA:mainfrom
wujingyue:fsdp-skip-when-not-enough-gpus

Conversation

@wujingyue
Copy link
Copy Markdown
Contributor

Summary

  • tests/unit_tests/distributed/megatron_fsdp/test_mfsdp_fully_shard.py builds an 8-rank init_device_mesh unconditionally, so launching the file with fewer GPUs (e.g. 2) errors out with Mesh should not be bigger than default world size N, but found 8 ranks! instead of skipping.
  • Added a guard in build_distributed_environment that computes the required world size from mesh_dim_config and calls pytest.skip(...) when the launched world size is too small. This mirrors the existing world_size != dp_size skip in the sibling test_mcore_fully_sharded_data_parallel.py.
  • No behavior change in CI (still 8 GPUs); only affects developer runs with fewer GPUs.

Test plan

  • torch.distributed.run --nproc_per_node 2 ... pytest test_mfsdp_fully_shard.py — previously failed with mesh-size errors, now reports 6 passed, 314 skipped, 6 xfailed, 0 failed.
  • CI on 8 GPUs to confirm no regression.

🤖 Generated with Claude Code

The tests in tests/unit_tests/distributed/megatron_fsdp/test_mfsdp_fully_shard.py
construct an 8-rank device mesh unconditionally, so launching the file with fewer
GPUs (e.g. 2) errors out with "Mesh should not be bigger than default world size".
Compute the required world size from mesh_dim_config and pytest.skip when the
launched world size is too small, mirroring the guard already in
test_mcore_fully_sharded_data_parallel.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wujingyue
Copy link
Copy Markdown
Contributor Author

/ok to test

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 27, 2026
@wujingyue wujingyue marked this pull request as ready for review April 27, 2026 23:30
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 27, 2026 23:30
@wujingyue wujingyue requested review from cspades and shjwudp April 27, 2026 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants