Skip to content

CyberAgentAILab/exploring-dbnorm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring Database Normalization Effects on SQL Generation

arXiv

This repository implements the main experiments of our CIKM'25 paper, Exploring Database Normalization Effects on SQL Generation. In our paper, we systematically investigate how different levels of database normalization impact the performance of large language models (LLMs) in generating SQL queries. Please refer to our paper for more details.

Setup

git clone https://github.com/CyberAgentAILab/exploring-dbnorm.git
poetry isntall

Create a .env file in the root directory and add your API keys for the LLMs you want to use:

OPENAI_API_KEY=your_openai_api_key (if you want to use OpenAI models)
GEMINI_API_KEY=your_gemini_api_key (if you want to use Gemini models)
ANTHROPIC_API_KEY=your_anthropic_api_key (if you want to use Claude models)

How to synthesize a new scenario

sh scripts/create_scenario_instance.sh <scenario_name:basic,flight,ligbrary,class_registration> <random_seed:42>

# e.g., sh scripts/create_scenario_instance.sh basic 42

This script creates a new scenario directory under output/<scenario_name>_<random_seed> containing the following files and directories:

  • cases/: Directory containing test case files in JSON format.
  • db/: Directory containing the SQLite database file.
  • data.csv: raw data file in CSV format.
  • scenario.json: Configuration file for the scenario.

How to run experiments

sh scripts/run_test.sh <scenario_output_dir> [model1 model2 ...] 

# e.g., sh scripts/run_test.sh output/basic_s42 gpt-4o gpt-4o-mini

This script runs the full experiment pipeline for the specified scenario output directory, including generation, execution, evaluation, and analysis. The results will be saved in a timestamped directory under results/ within the scenario output directory.

Also, you can specify the result directory to resume the experiment from a specific step:

sh scripts/run_test.sh <scenario_output_dir> [model1 model2 ...] --result-dir <result_directory>
# e.g., sh scripts/run_test.sh output/basic_s42 gpt-4o gpt-4o-mini --result-dir output/basic_s42/results/20231010_123456

Scenario details

Experimental scenarios are defined in scenarios/<scenario_name> directory. Each scenario has its own subdirectory containing:

  • cases/: Test case files in JSON format.
  • instructions/: Instructions for the LLM to generate SQL queries.
  • schemas/: Database schema files in JSON format.
  • scripts/: Scripts for data generation and database building.

Please refer to the individual scenario directories for more details on each scenario.

Citation

@misc{
    kohita2025exploringdatabasenormalizationeffects,
    title={Exploring Database Normalization Effects on SQL Generation}, 
    author={Ryosuke Kohita},
    year={2025},
    eprint={2510.01989},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2510.01989}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published