-
Notifications
You must be signed in to change notification settings - Fork 0
qtransform
pip install git+https://github.com/fhswf/eki-transformer-dev.git@develop#subdirectory=qtransform
or clone the repo and run pip install -e .
inside the eki-transformer/qtransform folder.
Depending on your system qtransform will either use the current working directory or the package folder (.venv/lib64/python3.x/site-packages/qtransform/
) for the config files.
If you want to change settings or modify the code, you may have to do that inside the package folder.
qtransfrom uses wandb for logging.
wandb is not optional at the moment.
WIP
qtransform run=train model=tinystack dataset=tinystack run.epochs=1 run.max_iters=300 tokenizer=tinystack
- infer
- train
- to train a model you need to specify the model, dataset and tokenizer.
names of the config files.
number of epochs to train.
number of iterations to train. number of training iterations for one epoch
Everytime a model is trained, a checkpoint is created.
When loading a checkpoint the model has to be set to the same model as the one used to create the checkpoint.
Checkpoint are loaded from the folder qtransform/chkpts
.
The Checkpoint is named with the following pattern:
YYYY-MM-DD_HH:MM:SS-<Random String><Model Name><Dataset Name>
.
Checkpoints will be saved with torch save
qtransform uses multiple config files to define the model, dataset, tokenizer and other settings.
These config files are stored as yaml files in the qtransform
folder.
The config files that will be used are defined in the command line arguments.
For example: model=tinystack
will load the config file model/tinystack.yaml
.
Config files can have placeholders. When a ??? is used, it will be replaced by the value of the command line argument. The value of the placeholder must be passed in the run command.
For example:
args:
n_layer : ???
n_head : ???
n_embd : ???
Example run command:
qtransform run=train ... n_layer=2 n_head=4 n_embd=512
.
The Dataset config is in qtransform/dataset. This config defines the dataset that will be used and how it will be loaded.
The dataset must consist of three splits:
- train (training data)
- eval (test data)
- bench (bachmarking data / used ?)
If the dataset does not have all splits, the config must define how to create the splits.
name: openwebtext
tokenized:
type: huggingface #name of python module which implements a class of type TokenizedDatasetGenerator
untokenized:
splits:
train:
eval: # eval split does not exist, therefore use a slice of train split
split: eval
mapping: train
size: 0.05
exists: False
bench: #benchmarking split does not exist, therefore use a slice of train split
split: bench
mapping: train
size: 0.05
exists: False
defaults:
- untokenized/huggingface
name: fhswf/tiny-stack
untokenized:
splits:
bench:
split: train
eval:
split: validation
mapping: test # eval split is called test in the dataset and has to be mapped to eval
defaults:
- untokenized/huggingface
- tokenized/hf
#test datasets from files
#https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
name: tiny_shakespeare
tokenized:
type: files #name of python module which implements a class of type TokenizedDatasetGenerator
defaults:
- untokenized/files
untokenized:
args:
cache_dir: ~/.qtransform/datasets/files/tiny_shakespeare/raw #absolute path to files. all files are tokenized within dir
The model config is in qtransform/model. This config defines the model that will be used and how it will be loaded.
cls: GPT
calc_loss_in_model: True
type: checkpoint
args:
n_layer : 2
n_head : 4
n_embd : 256
dropout : 0.1
bias : True
block_size : 512
vocab_size : 12631
transformer_active_func: ReLU
norm_layer: BatchNormTranspose
flash: False
single_output: False
use_weight_tying: False
shift_targets: True
pos_layer: "learned"
The tokenizer config is in qtransform/tokenizer. This config defines the tokenizer that will be used and how it will be loaded.
wrapper: TransformersTokenizer ???
pretrained_tokenizer: AutoTokenizer ???
module: transformers ???
fast: True # (https://huggingface.co/docs/transformers/main_classes/tokenizer)
# Encoding-Optionen
encoding: fhswf/BPE_GPT2_TinyStoriesV2_cleaned_2048 # Name of Huggingface-Tokenizer
All used values are default values for run.
always_save_checkpoint : True # if True, always save a checkpoint after each eval
epochs: 1
gradient_accumulation_steps : 1 # used to simulate larger batch sizes, leave empty or set to 1 for no accumulation
flash: False
export: False ## perform an export after #epochs are done
compile: True
max_iters : # number of training iterations for one epoch. total number of iterations: max_iters * epochs
save_epoch_interval: 1
log_steps_interval : 10
grad_clip : 0.7 # clip gradients at this value, or disable if :: 0.0
eval_steps_interval: 500 # currently number of batches not number of training samples!
eval_epoch_interval : 1 #perform eval after specified amount of epochs
# eval_iters: 200 #retrieve specified amount of batches for evaluation (unused)
save_steps_interval: 500 # save model pt every n steps (=batches for now)
scheduler_steps_interval: 500 # adjust learning rate every x steps (=batches for now) only applies when scheduler_step_type == "steps"
scheduler_step_type: "steps" ## either 'epoch' or 'steps' (= samples or batches ?= len of dataloader)
These Options can be adjusted in the run command.
qtransform run=train model=tinystack dataset=tinystack run.epochs=1 run.max_iters=300 tokenizer=tinystack run.gradient_accumulation_steps=2 run.eval_steps_interval=100