-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Description
Background
It is time to rewrite current checkpointing for a few reasons:
- Functionality support: move from FSDPv1 logic to DTensor logic, so we support DCP for FSDPv2, TP, etc.
- Cleanup: current checkpointing has served many generations of fms-fsdp, during which period many of developed features are either never used, or no longer used. We can leverage this chance to build it from scratch.
Proposal
- move from FSDPv1 checkpointing to DTensor checkpointing
- remove
metadata.pthfile saving and loading, move all metadata tostate_dictastrain_state, so state_dict is uniformed as:
state_dict = {
"model_state": model_state,
"optim_state": optim_state,
"train_state": train_state,
}
- simplify apis:
old: def init(self, ckpdir, n_to_save, parallel_mode, rank, local_rank, report_fn=None, model_auto_placement=False)
new: def init(self, path)
old: def load(self, model, optimizer, dataloader, path="", reset_stepcount=False, strict=True, is_compiled=False)
new: def load(self, model, optimizer, train_state)
old: def save(self, step, model, optimizer, dataloader, **kwargs)
new: def save(self, step, model, optimizer, train_state)
Detailed cleanups
- remove dataloader support, as data loader was moved outside checkpoint util long time ago.
- remove support for loading from a single ckpt and saving a single ckpt, for the following reasons:
- we barely/never used them.
- conversion, when needed, should be done as a post processing.
- remove support for ckpt clean up with max_ckpt_count, as this is not used and we are saving all ckpts with the smart gpfs solution.
- remove "get_oldest", as it is never used.
- remove the need for special taken-care for HSDP, as FSDPv2 no longer has this issue
- remove parallel_mode arg, for the same reason as above
- remove report_fn arg. default is good enough.
- remove model_auto_placement arg.
Many of these dropped features can be left there with no harm, but it is better to start from a cleaner version, and we can always add them back if necessary.
Metadata
Metadata
Assignees
Labels
No labels