Generate metadata.yaml for existing tokenized datasets

## Problem

For studio to effectively use data prep, metadata should be generated for older datasets already onboarded to studio.

Requirements
- Create a function that accepts a folder with existing hdf5 files
- Dataset might support both training and evaluation
- Process the contents of the existing dataset, derive various metrics present in the metadata.yaml

- Is it possible to generate all the fields in the metadata.yaml using a existing dataset ? 

## Background 

- Studio uses `add_seq_metadata_dataset` function here to populate the training sequences for existing hdf5 tokenized datasets. 
- This populates train_sequences field only in metadata.yaml . Other fields are not available for existing datasets.
- Incase metadata.yaml already exists, train_sequences is added to metadata.yaml

- metadata.yaml for existing datasets
```
train_sequences: 54
```

- For new datasets, metadata.yaml is directly used. For new datasets prepared with data prep, all the fields are available
```
train_articles: 100
train_completion_tokens: 53020
train_input_tokens: 53020
max_batch_size_dev: null
max_batch_size_train: 13
max_seq_length: 1024
number_of_dev_files: 0
number_of_test_files: 0
number_of_training_files: 4
train_output_tokens: 55296
train_padding_tokens: 2276
train_prompt_tokens: 0
train_sequences: 54
token_type_ids: true
tokenizer_model_type: "<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>"
train_tokens_dropped_from_all_prompt: 0
train_tokens_dropped_from_packing: 0
vocab_size: 50257
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate metadata.yaml for existing tokenized datasets #135

Problem

Background

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generate metadata.yaml for existing tokenized datasets #135

Description

Problem

Background

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions