Skip to content

Handling edge case of continued pretraining from finished run #126

@daviswer

Description

@daviswer

When the dataloader loads from checkpoint, it expects a path to the checkpoints directory, from which it pulls the most recent checkpoint folder and loads the relevant data.

This is a problem when continuing a completed run, as the final step of a completed run is to save a single-file checkpoint to the checkpoints directory. This messes up the dataloader when resuming, as the most recent item in the checkpoints directory is no longer a folder.

The solution for model checkpointing is to support both the checkpoints path, in which case it pulls the latest item, or a path to a particular checkpoint directory. The dataloader does not currently support the latter. We can either add this capability, or change the single-file save at the end of the run so that it goes outside the checkpoints directory, which should probably contain only checkpoint folders anyhow.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions