-
Notifications
You must be signed in to change notification settings - Fork 46
Final singlefile checkpoint saves one folder up #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Davis Wertheimer <[email protected]>
45cc260 to
292b459
Compare
|
@daviswer one convention I have been using is:
would using this convention fit the need here as well? if so, we can use this convention. |
|
oh yeah that's a good idea, I'll set that up |
Signed-off-by: Davis Wertheimer <[email protected]>
|
@daviswer is the final pth file a single file or multiple? |
|
I think we should make it This will make it easier for future consumption/conversion, e.g. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py#L231 to summarize, our folders should look like: |
|
Sure I can do that. It is always a single file. |
Signed-off-by: Davis Wertheimer <[email protected]>
* make mamba * add quick debug * add quick debug * revert debug verbosity * Learning rate scheduler changed (Constant) * Add AutoHandler * Add Auto cfg option for AutoHAndler * Len gets called before open * path/filepath typo fix * Partitioning fix from mup-search * Cosine 0.01 decay * Warmup interval change * Schedule change * Constant schedule * LR schedule change (cool down and constant lr) * Update dataset_utils.py Added a check for length of doc * LR schedule change (Warmup + constant) * Update dataset_utils.py * Cosine schedule * For constant lr 1.5e5 * Schedule change * Schedule change * Final singlefile checkpoint saves one folder up (#127) * Final singlefile checkpoint saves one folder up Signed-off-by: Davis Wertheimer <[email protected]> * save file under new pth subfolder Signed-off-by: Davis Wertheimer <[email protected]> * Repath for easier consumption/conversion Signed-off-by: Davis Wertheimer <[email protected]> --------- Signed-off-by: Davis Wertheimer <[email protected]> * Added cool down * length of doc check * splitstrip cols and pass to fhandler * fhandler col_names support * Warmup for annealing * Debugging * Debugging II * Empty shard check * Added constant lr schedule with warmup * added print for lenght of doc * added print for lenght of doc II * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Adding print for debug * Revert "Pulled from data-fixes branch" This reverts commit ac5194b, reversing changes made to 1b50708. reverting changes * Revert all changes made after March 6 (before merge) * Revert all changes made after March 6 (before merge) * removed print --------- Signed-off-by: Davis Wertheimer <[email protected]> Co-authored-by: Linsong Chu <[email protected]> Co-authored-by: divykum2 <[email protected]> Co-authored-by: divya-kumari32 <[email protected]>
Addresses #126
Adjusts the save location of final single-file checkpoint to be one folder up (directly in the specified ckp directory, rather than under the 'checkpoints' subfolder). That way the dataloader checkpointer only has to deal with distributed checkpoint folders.