Handling edge case of continued pretraining from finished run

When the dataloader loads from checkpoint, it expects a path to the checkpoints directory, from which it pulls the most recent checkpoint folder and loads the relevant data. 

This is a problem when continuing a completed run, as the final step of a completed run is to save a single-file checkpoint to the checkpoints directory. This messes up the dataloader when resuming, as the most recent item in the checkpoints directory is no longer a folder. 

The solution for model checkpointing is to support both the checkpoints path, in which case it pulls the latest item, or a path to a particular checkpoint directory. The dataloader does not currently support the latter. We can either add this capability, or change the single-file save at the end of the run so that it goes outside the checkpoints directory, which should probably contain only checkpoint folders anyhow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling edge case of continued pretraining from finished run #126

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling edge case of continued pretraining from finished run #126

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions