qtransform

Installation

pip install git+https://github.com/fhswf/eki-transformer-dev.git@develop#subdirectory=qtransform

or clone the repo and run pip install -e . inside the eki-transformer/qtransform folder.

Depending on your system qtransform will either use the current working directory or the package folder (.venv/lib64/python3.x/site-packages/qtransform/) for the config files.

If you want to change settings or modify the code, you may have to do that inside the package folder.

Wandb

qtransfrom uses wandb for logging.

wandb is not optional on the main-branch at the moment.

wandb can be disabled with the run parameter wandb.enabled=False on the development-dominik-branch.

WIP

Run

run example

qtransform run=train model=tinystack dataset=tinystack run.epochs=1 run.max_iters=300 tokenizer=tinystack

qtransform run=infer tokenizer=tinystack model.checkpoint=/home/paulweber/Code/qtransform-tinystack/qtransform/chkpts/2025-04-03_14_02_28-paltry-implementtinystacktinystack

run options

infer
train
- to train a model you need to specify the model, dataset and tokenizer.

Important options

model=, dataset=, tokenizer=

names of the config files.

run.epochs=

number of epochs to train.

run.max_iters=

number of iterations to train. number of training iterations for one epoch

model.checkpoint=

Everytime a model is trained, a checkpoint is created. When loading a checkpoint the model has to be set to the same model as the one used to create the checkpoint. Checkpoint are loaded from the folder qtransform/chkpts. The Checkpoint is named with the following pattern: YYYY-MM-DD_HH:MM:SS-<Random String><Model Name><Dataset Name>.

Running Jobs on the PC2 Cluster

The PC2 cluster uses SLURM to manage and execute compute jobs.

First, connect to the PC2 system by following the instructions here.

After logging in, choose a suitable storage location. It is recommended to use the Parallel File System for Projects — more details on available file systems can be found here. An example path on noctua2 might look like: $PC2PFS/hpc-prf-ekiapp/paulw/.

Next, clone the project repository and enter the pc2 within the repo:

git clone https://github.com/fhswf/eki-transformer-dev.git
cd eki-transformer-dev/pc2

Adjust the environment variables WANDB_API_KEY and WORK_HOME to your setup. Then select your script eval apps/tinystack.sh

All SLURM job configurations are located in the apps directory.

To start a job, simply run sbatch sbatch.sh

Configuration

qtransform uses multiple config files to define the model, dataset, tokenizer and other settings. These config files are stored as yaml files in the qtransform folder.

The config files that will be used are defined in the command line arguments. For example: model=tinystack will load the config file model/tinystack.yaml.

Placeholders

Config files can have placeholders. When a ??? is used, it will be replaced by the value of the command line argument. The value of the placeholder must be passed in the run command.

For example:

args:
  n_layer : ???
  n_head : ???
  n_embd : ???

Example run command: qtransform run=train ... n_layer=2 n_head=4 n_embd=512.

Config Files

Dataset

The Dataset config is in qtransform/dataset. This config defines the dataset that will be used and how it will be loaded.

The dataset must consist of three splits:

train (training data)
eval (test data)
bench (bachmarking data / used ?)

If the dataset does not have all splits, the config must define how to create the splits.

Dataset Config

only one split

name: openwebtext
tokenized:
  type: huggingface #name of python module which implements a class of type TokenizedDatasetGenerator

untokenized:
  splits:
    train:
    eval: # eval split does not exist, therefore use a slice of train split
      split: eval
      mapping: train
      size: 0.05
      exists: False
    bench: #benchmarking split does not exist, therefore use a slice of train split
      split: bench
      mapping: train
      size: 0.05
      exists: False
defaults:
  - untokenized/huggingface

two splits

name: fhswf/tiny-stack

untokenized:
  splits:
    bench:
      split: train
    eval:
      split: validation
      mapping: test # eval split is called test in the dataset and has to be mapped to eval

defaults:
  - untokenized/huggingface
  - tokenized/hf

Dataset from files

#test datasets from files
#https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
name: tiny_shakespeare
tokenized:
  type: files #name of python module which implements a class of type TokenizedDatasetGenerator

defaults:
  - untokenized/files
untokenized:
  args:
    cache_dir: ~/.qtransform/datasets/files/tiny_shakespeare/raw #absolute path to files. all files are tokenized within dir

Model

The model config is in qtransform/model. This config defines the model that will be used and how it will be loaded.

Example Model Config

cls: GPT
calc_loss_in_model: True
type: checkpoint
args:
  n_layer : 2
  n_head : 4
  n_embd : 256
  dropout : 0.1
  bias : True
  block_size : 512
  vocab_size : 12631
  transformer_active_func: ReLU
  norm_layer: BatchNormTranspose
  flash: False
  single_output: False
  use_weight_tying: False
  shift_targets: True
  pos_layer: "learned"

Tokenizer

The tokenizer config is in qtransform/tokenizer. This config defines the tokenizer that will be used and how it will be loaded.

Example Tokenizer Config

wrapper: TransformersTokenizer ???
pretrained_tokenizer: AutoTokenizer ???
module: transformers ???
fast: True  # (https://huggingface.co/docs/transformers/main_classes/tokenizer)

# Encoding-Optionen
encoding: fhswf/BPE_GPT2_TinyStoriesV2_cleaned_2048  # Name of Huggingface-Tokenizer

Extended

Run Options

All used values are default values for run.

always_save_checkpoint : True # if True, always save a checkpoint after each eval
epochs: 1
gradient_accumulation_steps : 1 # used to simulate larger batch sizes, leave empty or set to 1 for no accumulation
flash: False
export: False ## perform an export after #epochs are done
compile: True
max_iters : # number of training iterations for one epoch. total number of iterations: max_iters * epochs
save_epoch_interval: 1
log_steps_interval : 10
grad_clip : 0.7 # clip gradients at this value, or disable if :: 0.0

eval_steps_interval: 500 # currently number of batches not number of training samples!
eval_epoch_interval : 1 #perform eval after specified amount of epochs
# eval_iters: 200 #retrieve specified amount of batches for evaluation (unused)

save_steps_interval: 500 # save model pt every n steps (=batches for now)
scheduler_steps_interval: 500 # adjust learning rate every x steps (=batches for now) only applies when  scheduler_step_type == "steps"
scheduler_step_type: "steps"  ## either  'epoch' or 'steps' (= samples or batches ?= len of dataloader)

These Options can be adjusted in the run command.

qtransform run=train model=tinystack dataset=tinystack run.epochs=1 run.max_iters=300 tokenizer=tinystack run.gradient_accumulation_steps=2 run.eval_steps_interval=100

qtransform

qtransform

contents

Table of Contents

Installation

Wandb

Run

run example

run options

Important options

model=, dataset=, tokenizer=

run.epochs=

run.max_iters=

model.checkpoint=

Running Jobs on the PC2 Cluster

Configuration

Placeholders

Config Files

Dataset

Dataset Config

only one split

two splits

Dataset from files

Model

Example Model Config

Tokenizer

Example Tokenizer Config

Extended

Run Options

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally