Refactored auto-microbatching hook handles for FSDP #3843

rithwik-db · 2025-05-02T01:21:48Z

Refactored auto-microbatching hook handles for FSDP1 with additional documentation.

This PR was originally designed to support FSDP2 auto microbatching, but since there are additional issues with FSDP2 state there, we moved that to a draft PR: #3866

composer/distributed/fsdp2.py

tests/common/models.py

tests/trainer/test_fsdp2.py

composer/distributed/fsdp2.py

composer/distributed/shared_utils.py

composer/distributed/fsdp2.py

composer/distributed/prepare_distributed.py

composer/distributed/shared_utils.py

tests/trainer/fsdp2_context.py

tests/trainer/test_fsdp2.py

bowenyang008 · 2025-05-21T23:27:43Z

can we run an e2e test to verify it works?

rithwik-db · 2025-05-22T06:39:25Z

can we run an e2e test to verify it works?

It seems hard to curate an e2e test to catch this failure that would be more informative than our current unit tests. We have these two unit tests:

FSDP1
FSDP2

I guess in theory, we could create a larger example where a certain MPT module raises a CUDA OOM error at a certain epoch given a specific batch size or we could use a MPT module with a massive hidden layer in the FFN that will run into OOM for one batch size but not 1/2 of it...

bowenyang008 · 2025-05-22T16:47:14Z

can we run an e2e test to verify it works?

It seems hard to curate an e2e test to catch this failure that would be more informative than our current unit tests. We have these two unit tests:

FSDP1

FSDP2

I guess in theory, we could create a larger example where a certain MPT module raises a CUDA OOM error at a certain epoch given a specific batch size or we could use a MPT module with a massive hidden layer in the FFN that will run into OOM for one batch size but not 1/2 of it...

I meant just a general e2e test, does not have to trigger OOM

rithwik-db · 2025-05-22T20:03:30Z

Tested here: mpt-7b-fsdp2-p39FPR and compared it to base (mpt-7b-fsdp2-AKLNwv) and the numbers look good based on the tolerations mentioned in the regression testing PR @bowenyang008 (note that it defaults to 8 microbatch size when auto is set)

composer/distributed/fsdp2.py

fixed test issues formatted gated non-wrapped to FSDP1 updated for FSDP2 propagated changes to trainer added minor test fix formatted formatted once more addressed comments formatted minor fix

rithwik-db changed the title ~~Added hook handles for FSDP2 to address automicrobatching~~ [WIP] Added hook handles for FSDP2 to address automicrobatching May 2, 2025

rithwik-db force-pushed the hookhandles branch from 643c29b to a5cc584 Compare May 2, 2025 23:37

rithwik-db changed the title ~~[WIP] Added hook handles for FSDP2 to address automicrobatching~~ Added hook handles for FSDP2 to supported auto microbatching May 2, 2025

rithwik-db changed the title ~~Added hook handles for FSDP2 to supported auto microbatching~~ Added hook handles for FSDP2 to support auto microbatching May 2, 2025

rithwik-db requested review from bowenyang008 and dakinggg May 3, 2025 00:01

bowenyang008 reviewed May 5, 2025

View reviewed changes

composer/distributed/fsdp2.py Outdated Show resolved Hide resolved

bowenyang008 reviewed May 5, 2025

View reviewed changes

tests/common/models.py Outdated Show resolved Hide resolved

bowenyang008 reviewed May 5, 2025

View reviewed changes

tests/common/models.py Outdated Show resolved Hide resolved

bowenyang008 reviewed May 5, 2025

View reviewed changes

tests/trainer/test_fsdp2.py Outdated Show resolved Hide resolved

dakinggg reviewed May 5, 2025

View reviewed changes

composer/distributed/fsdp2.py Outdated Show resolved Hide resolved

rithwik-db force-pushed the hookhandles branch 2 times, most recently from b126818 to 2a1cfb4 Compare May 5, 2025 22:11

rithwik-db requested review from bowenyang008 and dakinggg May 5, 2025 22:12

bowenyang008 reviewed May 6, 2025

View reviewed changes

composer/distributed/shared_utils.py Show resolved Hide resolved

bowenyang008 reviewed May 6, 2025

View reviewed changes

composer/distributed/shared_utils.py Outdated Show resolved Hide resolved

bowenyang008 reviewed May 6, 2025

View reviewed changes

composer/distributed/shared_utils.py Outdated Show resolved Hide resolved

bowenyang008 reviewed May 6, 2025

View reviewed changes

composer/distributed/shared_utils.py Outdated Show resolved Hide resolved

rithwik-db force-pushed the hookhandles branch 2 times, most recently from fa82a6b to 647ad56 Compare May 21, 2025 00:32

rithwik-db requested a review from bowenyang008 May 21, 2025 20:27