This repository implements the main experiments of our CIKM'25 paper, Exploring Database Normalization Effects on SQL Generation.
In our paper, we systematically investigate how different levels of database normalization impact the performance of large language models (LLMs) in generating SQL queries.
Please refer to our paper for more details.
git clone https://github.com/CyberAgentAILab/exploring-dbnorm.git
poetry isntallCreate a .env file in the root directory and add your API keys for the LLMs you want to use:
OPENAI_API_KEY=your_openai_api_key (if you want to use OpenAI models)
GEMINI_API_KEY=your_gemini_api_key (if you want to use Gemini models)
ANTHROPIC_API_KEY=your_anthropic_api_key (if you want to use Claude models)sh scripts/create_scenario_instance.sh <scenario_name:basic,flight,ligbrary,class_registration> <random_seed:42>
# e.g., sh scripts/create_scenario_instance.sh basic 42This script creates a new scenario directory under output/<scenario_name>_<random_seed> containing the following files and directories:
cases/: Directory containing test case files in JSON format.db/: Directory containing the SQLite database file.data.csv: raw data file in CSV format.scenario.json: Configuration file for the scenario.
sh scripts/run_test.sh <scenario_output_dir> [model1 model2 ...]
# e.g., sh scripts/run_test.sh output/basic_s42 gpt-4o gpt-4o-miniThis script runs the full experiment pipeline for the specified scenario output directory, including generation, execution, evaluation, and analysis.
The results will be saved in a timestamped directory under results/ within the scenario output directory.
Also, you can specify the result directory to resume the experiment from a specific step:
sh scripts/run_test.sh <scenario_output_dir> [model1 model2 ...] --result-dir <result_directory>
# e.g., sh scripts/run_test.sh output/basic_s42 gpt-4o gpt-4o-mini --result-dir output/basic_s42/results/20231010_123456Experimental scenarios are defined in scenarios/<scenario_name> directory. Each scenario has its own subdirectory containing:
cases/: Test case files in JSON format.instructions/: Instructions for the LLM to generate SQL queries.schemas/: Database schema files in JSON format.scripts/: Scripts for data generation and database building.
Please refer to the individual scenario directories for more details on each scenario.
@misc{
kohita2025exploringdatabasenormalizationeffects,
title={Exploring Database Normalization Effects on SQL Generation},
author={Ryosuke Kohita},
year={2025},
eprint={2510.01989},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.01989},
}