Exploring Database Normalization Effects on SQL Generation

This repository implements the main experiments of our CIKM'25 paper, Exploring Database Normalization Effects on SQL Generation. In our paper, we systematically investigate how different levels of database normalization impact the performance of large language models (LLMs) in generating SQL queries. Please refer to our paper for more details.

Setup

git clone https://github.com/CyberAgentAILab/exploring-dbnorm.git
poetry isntall

Create a .env file in the root directory and add your API keys for the LLMs you want to use:

OPENAI_API_KEY=your_openai_api_key (if you want to use OpenAI models)
GEMINI_API_KEY=your_gemini_api_key (if you want to use Gemini models)
ANTHROPIC_API_KEY=your_anthropic_api_key (if you want to use Claude models)

How to synthesize a new scenario

sh scripts/create_scenario_instance.sh <scenario_name:basic,flight,ligbrary,class_registration> <random_seed:42>

# e.g., sh scripts/create_scenario_instance.sh basic 42

This script creates a new scenario directory under output/<scenario_name>_<random_seed> containing the following files and directories:

cases/: Directory containing test case files in JSON format.
db/: Directory containing the SQLite database file.
data.csv: raw data file in CSV format.
scenario.json: Configuration file for the scenario.

How to run experiments

sh scripts/run_test.sh <scenario_output_dir> [model1 model2 ...] 

# e.g., sh scripts/run_test.sh output/basic_s42 gpt-4o gpt-4o-mini

This script runs the full experiment pipeline for the specified scenario output directory, including generation, execution, evaluation, and analysis. The results will be saved in a timestamped directory under results/ within the scenario output directory.

Also, you can specify the result directory to resume the experiment from a specific step:

sh scripts/run_test.sh <scenario_output_dir> [model1 model2 ...] --result-dir <result_directory>
# e.g., sh scripts/run_test.sh output/basic_s42 gpt-4o gpt-4o-mini --result-dir output/basic_s42/results/20231010_123456

Scenario details

Experimental scenarios are defined in scenarios/<scenario_name> directory. Each scenario has its own subdirectory containing:

cases/: Test case files in JSON format.
instructions/: Instructions for the LLM to generate SQL queries.
schemas/: Database schema files in JSON format.
scripts/: Scripts for data generation and database building.

Please refer to the individual scenario directories for more details on each scenario.

Citation

@misc{
    kohita2025exploringdatabasenormalizationeffects,
    title={Exploring Database Normalization Effects on SQL Generation}, 
    author={Ryosuke Kohita},
    year={2025},
    eprint={2510.01989},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2510.01989}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scenarios		scenarios
scripts		scripts
sekkei		sekkei
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Exploring Database Normalization Effects on SQL Generation

Setup

How to synthesize a new scenario

How to run experiments

Scenario details

Citation

About

Uh oh!

Releases

Packages

Languages

License

CyberAgentAILab/exploring-dbnorm

Folders and files

Latest commit

History

Repository files navigation

Exploring Database Normalization Effects on SQL Generation

Setup

How to synthesize a new scenario

How to run experiments

Scenario details

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages