15-799 Special Topics in Databases: Query Optimization | Spring 2025
A project that optimizes SQL queries by providing a LLM with table schema and statistics, prompting it to generate a new query, and then verifying semantic equivalence using semantic prover and running both queries on generated datasets to check for identical outputs.
It utilizes the parser and prover in Query Equivalence Decider (QED) to perform the semantic equivalence check.
After cloning this project, include the submodules (prover & parser):
git submodule update --init --recursiveExport your OpenAI API key:
export OPENAI_API_KEY=[your api key]Install the necessary packages:
pip install openai==0.28.1 faker duckdb
curl -L https://nixos.org/nix/install | shRestart the terminal & run the commands below consecutively in order.
prompt_llm.pyprompts the LLM to give optimized queries by giving it the original queries, table schemas, and statistics on the tablegenerate_data.pygenerates random tables based on the scheme given for testing query equivalencecheck_equivalence.pycombines the results from above and outputs whether the original query and the LLM-optimized query are equal based on QED & the random tables generated
python prompt_llm.py --workload_path "dsb/queries" --schema_path "dsb/queries/create_tables.sql" --stats_path "dsb/statistics.txt"python generate_data.py --schema_path "dsb/queries/create_tables.sql" --data_output_path "data"python check_equivalence.py --workload_path "dsb/queries/optimized_queries.txt" --schema_path "dsb/queries/create_tables.sql" --load_path "dsb/queries/load_tables.sql"This is for benchmarking usage to determine whether the LLM's performance is due to it having seen the queries before (possibility for more comon benchmarks like TPC-H, TPC-DS, etc.) or it actually has the ability to rewrite the query. Refer to shuffle folder's documentation for how to use.
Install the necessary packages:
pip install pytest pytest-cov pandasTo run all unit tests & check code coverage:
pytest --cov=. --cov-config=.coveragerc