Skip to content

MSC Checkpointing Changes #789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Apr 2, 2025

Conversation

chris-hawes
Copy link
Contributor

@chris-hawes chris-hawes commented Feb 11, 2025

Modulus Pull Request

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

Copy link
Collaborator

@akshaysubr akshaysubr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me. Had a few stylistic comments and only one other important comment about unifying the filesystem abstractions used in the code base.

@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

2 similar comments
@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

2 similar comments
@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

@NickGeneva NickGeneva merged commit 51a90bd into NVIDIA:main Apr 2, 2025
1 check passed
RishikeshRanade pushed a commit that referenced this pull request Apr 14, 2025
* Working changes to be cleaned up.

* Rename msc_config.yaml

* Fixed pytorch test issue by removing MSC Cache

* Updated project dependencies

* Find MSC config using absolute path.

* Re-added cuda test parameter.

* Add test to read from public S3 bucket using MSC.

* Revert save_checkpoint_freq value.

* Remove temporary printing

* Remove unnecessary dependency

* Switched to use consistent mechanism for detecting msc URIs

* Moved fsspec.filesystem logic into filesystem.py

* Change to cache for non-file protocols when reading non-modulus models.

* Moved code to generate checkpoint directory.directory

* Added get_checkpoint_dir import

* Address review feedback.

* Changes from code review.

* Addressed file test issue from review.

* Fix to file existence check.

* Fix merge conflicts due to project name change.

* Updated CHANGELOG.

* Added Multi-Storage Client to allow checkpointing to/from Object Storage

Signed-off-by: Chris Hawes <[email protected]>

* Addressed issues identified by pre-commit.

* Update filesystem.py

* Update __init__.py

* Update Dockerfile

---------

Signed-off-by: Chris Hawes <[email protected]>
Co-authored-by: Nicholas Geneva <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants