- Generate a metadata.yaml file containing the number of training samples and evaluation samples for existing HDF5 datasets.
- Include the same metadata in the metadata.yaml for JSONL files after tokenization.
metadata.yaml
...
training_samples: 10000
evaluation_samples: 2000
Number of training samples is useful to calculate epoch to steps conversion for datasets.
metadata.yaml
Number of training samples is useful to calculate epoch to steps conversion for datasets.