Verified LLM SQL Query Rewrite

15-799 Special Topics in Databases: Query Optimization | Spring 2025

A project that optimizes SQL queries by providing a LLM with table schema and statistics, prompting it to generate a new query, and then verifying semantic equivalence using semantic prover and running both queries on generated datasets to check for identical outputs.

It utilizes the parser and prover in Query Equivalence Decider (QED) to perform the semantic equivalence check.

Usage

After cloning this project, include the submodules (prover & parser):

git submodule update --init --recursive

Export your OpenAI API key:

export OPENAI_API_KEY=[your api key]

Install the necessary packages:

pip install openai==0.28.1 faker duckdb
curl -L https://nixos.org/nix/install | sh

Restart the terminal & run the commands below consecutively in order.

prompt_llm.py prompts the LLM to give optimized queries by giving it the original queries, table schemas, and statistics on the table
generate_data.py generates random tables based on the scheme given for testing query equivalence
check_equivalence.py combines the results from above and outputs whether the original query and the LLM-optimized query are equal based on QED & the random tables generated

python prompt_llm.py --workload_path "dsb/queries" --schema_path "dsb/queries/create_tables.sql" --stats_path "dsb/statistics.txt"

python generate_data.py --schema_path "dsb/queries/create_tables.sql" --data_output_path "data"

python check_equivalence.py --workload_path "dsb/queries/optimized_queries.txt" --schema_path "dsb/queries/create_tables.sql" --load_path "dsb/queries/load_tables.sql"

Shuffle Schemas

This is for benchmarking usage to determine whether the LLM's performance is due to it having seen the queries before (possibility for more comon benchmarks like TPC-H, TPC-DS, etc.) or it actually has the ability to rewrite the query. Refer to shuffle folder's documentation for how to use.

Tests & Code Coverage

Install the necessary packages:

pip install pytest pytest-cov pandas

To run all unit tests & check code coverage:

pytest --cov=. --cov-config=.coveragerc

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
designdoc		designdoc
dsb		dsb
dsb_shuffle		dsb_shuffle
parser @ 7e7b559		parser @ 7e7b559
prover @ 772f80b		prover @ 772f80b
shuffle		shuffle
slabcity		slabcity
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
check_equivalence.py		check_equivalence.py
front.py		front.py
generate_data.py		generate_data.py
prompt_llm.py		prompt_llm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Verified LLM SQL Query Rewrite

Usage

Shuffle Schemas

Tests & Code Coverage

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

s-wangru/Verified-SQL-query-rewrite

Folders and files

Latest commit

History

Repository files navigation

Verified LLM SQL Query Rewrite

Usage

Shuffle Schemas

Tests & Code Coverage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages