-
Notifications
You must be signed in to change notification settings - Fork 0
Tinystack
Tiny Stack is a project to generate descriptions for error messages in Python. This project is meant to generate Data for Tiny Stack.
The main parts of the project are:
- Generation of code snippets and descriptions
- Code Execution to generate error messages
- Huggingface Dataset Creation
- Huggingface Tokenizer Creation
To generate code snippets, the project uses OpenAI's API to generate code snippets. The code snippets are generated in a batch request to reduce the cost of the API. The generated code snippets are then saved as .py files.
Currently gpt-4o-mini
is used, because it has a good price-performance ratio. To further reduce the cost batch requests are used.
One Datapoint consists of:
- A code snippet (max 500 Token)
- A faulty code snippet (max 500 Token)
- An error message (generated by running the code locally)
- A description of the code snippet (max 250 Token)
In total max 1250 Token per Datapoint
With the current Pricing of the API, a single Datapoint costs a maximum of 0,000375 $.
After the data generation, the data is saved in a Huggingface Dataset and a Huggingface Tokenizer is created.
Dataset and Tokenizer are available on Huggingface.
The generation is split into multiple files. Code, faulty code and description generation are each split into request and download scripts. The request scripts start a batch request to OpenAI's API and save the response in a folder. The download scripts download the batch response and save the responses as .py files.
request_code.py
download_code.py
request_faulty_code.py
download_faulty_code.py
- Generate the error messages with the container in the
script_runner
directory. request_description.py
download_description.py
All scripts must be run in order.
To obtain the error messages from a faulty python script, the sciripts are run in a container.
As a basis python:3.11-slim
is used.
The container runs all scripts in the internal scripts
directory and saves the error messages to the internal logs
directory.
Containers can hang if they require user input!
You need podman or docker to run and build the container.
All commands expect to be run in the script_runner
directory.
podman build -t python-script-runner .
In this example, the container will use the ./error_messages
directory to save the error messages and the ./faulty_code
for the scripts to run.
podman run --rm \
-v "./error_messages:/app/logs" \
-v "./faulty_code:/app/scripts:ro" \
--privileged \
python-script-runner
All of these instructions should be run in the code_generation
directory.
pip3 install -r requirements.txt
OPENAI_API_KEY=sk-...
python3 request_code.py
python3 download_code.py
...
Keep in mind, that the download is only available once the API request is finished. That could take up to 24h.
This Dataset will be published on Huggingface, to transform the data into a Huggingface Dataset, the create_dataset.py
script is used.
All folders are relative to the script location, paths can be changed in the python script.
- code_dir =
../data/code
- faulty_code_dir =
../data/faulty_code
- error_message_dir =
../data/error_messages
- description_dir =
../data/description
- output_dir =
data
- output_train_file =
train.json
- output_test_file =
test.json
- test_size =
0.2
# proportion of the dataset to be used for testing
The script can be run with python create_dataset.py
. Once the script is run, it will output two files:
-
train.json
: The training set -
test.json
: The test set Those files can be uploaded to HuggingFace.
A special tokenizer is needed to tokenize the code snippets. This tokenizer will download the dataset from HuggingFace and use it to train a tokenizer.
All files will be saved in the tinystack-tokenizer
directory.
It will create a tokenizer with the following files:
tokenizer.json
vocab.json
merges.txt