Skip to content

Tinystack

Paul Weber edited this page May 7, 2025 · 1 revision

Table of Contents

Tiny Stack

Tiny Stack is a project to generate descriptions for error messages in Python. This project is meant to generate Data for Tiny Stack.

The main parts of the project are:

  • Generation of code snippets and descriptions
  • Code Execution to generate error messages
  • Huggingface Dataset Creation
  • Huggingface Tokenizer Creation

To generate code snippets, the project uses OpenAI's API to generate code snippets. The code snippets are generated in a batch request to reduce the cost of the API. The generated code snippets are then saved as .py files.

Currently gpt-4o-mini is used, because it has a good price-performance ratio. To further reduce the cost batch requests are used.

One Datapoint consists of:

  • A code snippet (max 500 Token)
  • A faulty code snippet (max 500 Token)
  • An error message (generated by running the code locally)
  • A description of the code snippet (max 250 Token)

In total max 1250 Token per Datapoint

With the current Pricing of the API, a single Datapoint costs a maximum of 0,000375 $.

After the data generation, the data is saved in a Huggingface Dataset and a Huggingface Tokenizer is created.

Dataset and Tokenizer are available on Huggingface.

Snippet Generation

The generation is split into multiple files. Code, faulty code and description generation are each split into request and download scripts. The request scripts start a batch request to OpenAI's API and save the response in a folder. The download scripts download the batch response and save the responses as .py files.

Execution Order

  1. request_code.py
  2. download_code.py
  3. request_faulty_code.py
  4. download_faulty_code.py
  5. Generate the error messages with the container in the script_runner directory.
  6. request_description.py
  7. download_description.py

All scripts must be run in order.

Script Runner

To obtain the error messages from a faulty python script, the sciripts are run in a container. As a basis python:3.11-slim is used.

The container runs all scripts in the internal scripts directory and saves the error messages to the internal logs directory.

Containers can hang if they require user input!

Usage

You need podman or docker to run and build the container. All commands expect to be run in the script_runner directory.

Build Container

podman build -t python-script-runner .

Run Container

In this example, the container will use the ./error_messages directory to save the error messages and the ./faulty_code for the scripts to run.

podman run --rm \
    -v "./error_messages:/app/logs" \
    -v "./faulty_code:/app/scripts:ro" \
    --privileged  \
    python-script-runner

Code Generation

All of these instructions should be run in the code_generation directory.

Dependencies

pip3 install -r requirements.txt

.env File

OPENAI_API_KEY=sk-...

Run the scripts

Example

python3 request_code.py
python3 download_code.py
...

Keep in mind, that the download is only available once the API request is finished. That could take up to 24h.

Dataset Creation

This Dataset will be published on Huggingface, to transform the data into a Huggingface Dataset, the create_dataset.py script is used.

Folders

All folders are relative to the script location, paths can be changed in the python script.

  • code_dir = ../data/code
  • faulty_code_dir = ../data/faulty_code
  • error_message_dir = ../data/error_messages
  • description_dir = ../data/description
  • output_dir = data
  • output_train_file = train.json
  • output_test_file = test.json
  • test_size = 0.2 # proportion of the dataset to be used for testing

Usage

The script can be run with python create_dataset.py. Once the script is run, it will output two files:

  • train.json: The training set
  • test.json: The test set Those files can be uploaded to HuggingFace.

Tokenizer

A special tokenizer is needed to tokenize the code snippets. This tokenizer will download the dataset from HuggingFace and use it to train a tokenizer.

All files will be saved in the tinystack-tokenizer directory.

It will create a tokenizer with the following files:

  • tokenizer.json
  • vocab.json
  • merges.txt