Skip to content

Generate metadata.yaml for existing tokenized datasets #135

@vmly

Description

@vmly

Problem

For studio to effectively use data prep, metadata should be generated for older datasets already onboarded to studio.

Requirements

  • Create a function that accepts a folder with existing hdf5 files

  • Dataset might support both training and evaluation

  • Process the contents of the existing dataset, derive various metrics present in the metadata.yaml

  • Is it possible to generate all the fields in the metadata.yaml using a existing dataset ?

Background

  • Studio uses add_seq_metadata_dataset function here to populate the training sequences for existing hdf5 tokenized datasets.

  • This populates train_sequences field only in metadata.yaml . Other fields are not available for existing datasets.

  • Incase metadata.yaml already exists, train_sequences is added to metadata.yaml

  • metadata.yaml for existing datasets

train_sequences: 54
  • For new datasets, metadata.yaml is directly used. For new datasets prepared with data prep, all the fields are available
train_articles: 100
train_completion_tokens: 53020
train_input_tokens: 53020
max_batch_size_dev: null
max_batch_size_train: 13
max_seq_length: 1024
number_of_dev_files: 0
number_of_test_files: 0
number_of_training_files: 4
train_output_tokens: 55296
train_padding_tokens: 2276
train_prompt_tokens: 0
train_sequences: 54
token_type_ids: true
tokenizer_model_type: "<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>"
train_tokens_dropped_from_all_prompt: 0
train_tokens_dropped_from_packing: 0
vocab_size: 50257

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions