Problem
For studio to effectively use data prep, metadata should be generated for older datasets already onboarded to studio.
Requirements
-
Create a function that accepts a folder with existing hdf5 files
-
Dataset might support both training and evaluation
-
Process the contents of the existing dataset, derive various metrics present in the metadata.yaml
-
Is it possible to generate all the fields in the metadata.yaml using a existing dataset ?
Background
-
Studio uses add_seq_metadata_dataset function here to populate the training sequences for existing hdf5 tokenized datasets.
-
This populates train_sequences field only in metadata.yaml . Other fields are not available for existing datasets.
-
Incase metadata.yaml already exists, train_sequences is added to metadata.yaml
-
metadata.yaml for existing datasets
- For new datasets, metadata.yaml is directly used. For new datasets prepared with data prep, all the fields are available
train_articles: 100
train_completion_tokens: 53020
train_input_tokens: 53020
max_batch_size_dev: null
max_batch_size_train: 13
max_seq_length: 1024
number_of_dev_files: 0
number_of_test_files: 0
number_of_training_files: 4
train_output_tokens: 55296
train_padding_tokens: 2276
train_prompt_tokens: 0
train_sequences: 54
token_type_ids: true
tokenizer_model_type: "<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>"
train_tokens_dropped_from_all_prompt: 0
train_tokens_dropped_from_packing: 0
vocab_size: 50257
Problem
For studio to effectively use data prep, metadata should be generated for older datasets already onboarded to studio.
Requirements
Create a function that accepts a folder with existing hdf5 files
Dataset might support both training and evaluation
Process the contents of the existing dataset, derive various metrics present in the metadata.yaml
Is it possible to generate all the fields in the metadata.yaml using a existing dataset ?
Background
Studio uses
add_seq_metadata_datasetfunction here to populate the training sequences for existing hdf5 tokenized datasets.This populates train_sequences field only in metadata.yaml . Other fields are not available for existing datasets.
Incase metadata.yaml already exists, train_sequences is added to metadata.yaml
metadata.yaml for existing datasets