Skip to content

Conversation

@daviswer
Copy link
Collaborator

Addresses #126

Adjusts the save location of final single-file checkpoint to be one folder up (directly in the specified ckp directory, rather than under the 'checkpoints' subfolder). That way the dataloader checkpointer only has to deal with distributed checkpoint folders.

@daviswer daviswer requested a review from lchu6 January 14, 2025 21:12
@lchu6
Copy link
Contributor

lchu6 commented Jan 14, 2025

@daviswer one convention I have been using is:

ckpt_root_folder/checkpoints/step_xxx: for dcp
ckpt_root_folder/pth/step_xxx: for pth
ckpt_root_folder/hf/step_xxx: for converted hf format

would using this convention fit the need here as well? if so, we can use this convention.

@daviswer
Copy link
Collaborator Author

oh yeah that's a good idea, I'll set that up

Signed-off-by: Davis Wertheimer <[email protected]>
@lchu6
Copy link
Contributor

lchu6 commented Jan 14, 2025

@daviswer is the final pth file a single file or multiple?

@lchu6
Copy link
Contributor

lchu6 commented Jan 14, 2025

I think we should make it ckpt_root_folder/pth/step_xxx/consolidated.00.pth.

This will make it easier for future consumption/conversion, e.g. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py#L231

to summarize, our folders should look like:
ckpt_root_folder/checkpoints/step_xxx/: many dcp files
ckpt_root_folder/pth/step_xxx/: single consolidated.00.pth file
ckpt_root_folder/hf/step_xxx/: many safetensor files

@daviswer
Copy link
Collaborator Author

Sure I can do that. It is always a single file.

@lchu6 lchu6 merged commit 0fdb43d into main Jan 14, 2025
3 of 4 checks passed
@lchu6 lchu6 deleted the ckp-path-handling branch January 14, 2025 23:59
daviswer added a commit that referenced this pull request Jun 3, 2025
* make mamba

* add quick debug

* add quick debug

* revert debug verbosity

* Learning rate scheduler changed (Constant)

* Add AutoHandler

* Add Auto cfg option for AutoHAndler

* Len gets called before open

* path/filepath typo fix

* Partitioning fix from mup-search

* Cosine 0.01 decay

* Warmup interval change

* Schedule change

* Constant schedule

* LR schedule change (cool down and constant lr)

* Update dataset_utils.py

Added a check for length of doc

* LR schedule change (Warmup + constant)

* Update dataset_utils.py

* Cosine schedule

* For constant lr 1.5e5

* Schedule change

* Schedule change

* Final singlefile checkpoint saves one folder up (#127)

* Final singlefile checkpoint saves one folder up

Signed-off-by: Davis Wertheimer <[email protected]>

* save file under new pth subfolder

Signed-off-by: Davis Wertheimer <[email protected]>

* Repath for easier consumption/conversion

Signed-off-by: Davis Wertheimer <[email protected]>

---------

Signed-off-by: Davis Wertheimer <[email protected]>

* Added cool down

* length of doc check

* splitstrip cols and pass to fhandler

* fhandler col_names support

* Warmup for annealing

* Debugging

* Debugging II

* Empty shard check

* Added constant lr schedule with warmup

* added print for lenght of doc

* added print for lenght of doc II

* Update dataset_utils.py

* Update dataset_utils.py

* Update dataset_utils.py

* Update dataset_utils.py

* Adding print for debug

* Revert "Pulled from data-fixes branch"

This reverts commit ac5194b, reversing
changes made to 1b50708.

reverting changes

* Revert all changes made after March 6 (before merge)

* Revert all changes made after March 6 (before merge)

* removed print

---------

Signed-off-by: Davis Wertheimer <[email protected]>
Co-authored-by: Linsong Chu <[email protected]>
Co-authored-by: divykum2 <[email protected]>
Co-authored-by: divya-kumari32 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants