feat: Support cross-database evaluation with SQLite ground truth#465
feat: Support cross-database evaluation with SQLite ground truth#465arieljassan wants to merge 1 commit into
Conversation
…d truth resolution
|
/gcbrun |
IsmailMehdi
left a comment
There was a problem hiding this comment.
Solves a real problem (cross-engine BIRD eval on BQ), but a few correctness issues to address. Inline comments cover the file-specific items. Three cross-cutting points that don't fit a specific line:
No tests for any of the new code. Both sqlite_bridge.py and hybrid_xa_judge.py are trivially unit-testable (in-memory SQLite via sqlite3.connect(":memory:"), mock sys.argv). Without tests the fragile bits flagged inline will silently rot. Minimum I'd want to see: test_get_sqlite_ground_truth_uses_named_db (after the #2 fix), test_compare_result_sets_handles_decimal_vs_float, test_compare_result_sets_ignores_column_names_and_row_order, test_hybrid_xa_judge_main_with_matching_results.
No documentation in docs/configs/run-config.md or dataset-config docs for this new mode. PRs #453 and #459 got the same note — easier to add docs at PR time than to remember later.
PR description "Verification" is functional/manual only — no automated test references. Worth at least pointing to a scenario YAML that exercises the hybrid path so others can repro the test. Also: if you confirm the uv run --isolated import issue (inline comment on hybrid_xa_judge.py:10), worth describing in the PR how the manual verification got past that — I may be misreading the invocation.
| self.client = bigquery.Client(project=self.project_id) | ||
| self.tmp_users = [] | ||
|
|
||
| def ensure_database_exists(self): |
There was a problem hiding this comment.
Signature violates the abstract base class. databases/db.py:182 declares:
@abstractmethod
def ensure_database_exists(self, database_name: str) -> None:All 8 other implementations (mysql, postgres, sqlite, spanner, bigtable, mongodb, sqlserver) match this signature. The new BQ override drops the parameter entirely. Any generic caller that does db.ensure_database_exists("foo") — which is the documented contract — will hit TypeError: ensure_database_exists() takes 1 positional argument but 2 were given.
def ensure_database_exists(self, database_name: str) -> None:
# BigQuery datasets are project-scoped; no per-database creation needed.
pass| candidates = [ | ||
| f[:-7] for f in os.listdir(db_dir) if f.endswith(".sqlite") | ||
| ] | ||
| for cand in candidates: |
There was a problem hiding this comment.
Picks an arbitrary SQLite file by trial-and-error. Iterating every .sqlite file in the directory and using whichever one accepts the query is wrong in two ways:
- Non-deterministic. Returns whatever file
os.listdiryielded first that didn't raise. If two BIRD databases (california_schools,card_games) happen to have tables/columns with overlapping names, the resolved "truth" depends on the OS's directory ordering. Run the same scenario on two machines, get two different scores. - Silent wrong-answer risk. A query may "succeed" against the wrong DB by returning bogus rows that don't match the scenario's intent. The judge then scores against those rows and reports a misleading PASS/FAIL.
The scenario record already carries the database name (the database field in BIRD dataset entries — california_schools, etc.). Pass it through:
def get_sqlite_ground_truth(query: str, database: str) -> list:
sqlite_path = os.path.join(db_dir, f"{database}.sqlite")
if not os.path.exists(sqlite_path):
return []
conn = sqlite3.connect(sqlite_path)
try:
return pd.read_sql_query(query, conn).to_dict(orient="records")
finally:
conn.close()Bonus: dramatically faster — one connect instead of N.
| """Resolves candidate SQLite database files and executes query.""" | ||
| parent_dir = os.path.dirname(__file__) | ||
| root_dir = os.path.abspath(os.path.join(parent_dir, "..", "..")) | ||
| db_dir = os.path.join(root_dir, "db_connections", "bird") |
There was a problem hiding this comment.
Hardcoded path. db_connections/bird is specific to one dataset and won't work for other suites. The experiment_config already carries database_configs; derive the SQLite path from there instead of hardcoding the directory. As written, this scorer is silently bird-only — and nothing in the docstring says so.
| return [] | ||
|
|
||
|
|
||
| def is_hybrid_cross_db_enabled() -> bool: |
There was a problem hiding this comment.
Sniffing sys.argv to detect mode is fragile and creates several real problems:
- Only works for
python evalbench.py --experiment_config=foo.yaml. Other invocation paths don't have this argv pattern: the gRPC server (eval_server.py) constructs configs programmatically,run_suitepasses config paths via a different mechanism, and any programmatic / in-process caller doesn't have argv at all. Hybrid mode silently disables in these cases. - Re-parses the YAML on every call.
compare()runs once per scenario per trial; this is O(N) file reads of the same YAML. - Cross-scorer coupling via string match. Checks the PythonScorer's
script_pathto decide what LLMRater does. If someone renameshybrid_xa_judge.pyor moves it, LLMRater silently goes back to returning 0 on golden errors.
The signal that hybrid mode is active is conceptually a scorer config, not a global mode. Lift it to LLMRater.__init__:
def __init__(self, config, global_models):
...
self.hybrid_ground_truth = config.get("hybrid_ground_truth", False)And in YAML:
scorers:
llm_rater:
hybrid_ground_truth: trueThen the compare() branch checks self.hybrid_ground_truth. No argv sniffing, no cross-scorer coupling, no file I/O in the hot path.
| # If using hybrid judge, fetch ground truth from SQLite when BQ | ||
| # fails on golden query syntax (e.g. SQLite functions in reference | ||
| # queries). | ||
| if sqlite_bridge.is_hybrid_cross_db_enabled(): |
There was a problem hiding this comment.
Silent fallback — no log line when it fires. When hybrid mode kicks in and resolves a golden answer from SQLite, nothing in the logs records that this happened. The eval row just has a normal-looking score. Debugging "why does my BQ run report fewer golden errors than expected" becomes a source-dive.
Add one log line:
if sqlite_bridge.is_hybrid_cross_db_enabled():
logging.info(
"Hybrid ground truth: BQ golden query failed, resolving from "
"SQLite reference. query=%s", golden_query
)
golden_execution_result = ...|
|
||
| import pandas as pd | ||
|
|
||
| from evalbench.scorers.sqlite_bridge import get_sqlite_ground_truth |
There was a problem hiding this comment.
This import will fail at runtime. pythonscorer.py:57 invokes the script as:
command = ["uv", "run", "--isolated", self.script_path]--isolated strips the parent's PYTHONPATH and runs in a clean venv with no access to the evalbench source tree. from evalbench.scorers.sqlite_bridge import ... will raise ModuleNotFoundError. The script returns nonzero, the scorer reports a generic FAIL: Script failed with exit code..., and the user never sees a real result.
The PR description mentions functional end-to-end testing, so either I'm misreading the invocation OR the testing didn't actually exercise this code path. Worth confirming with a fresh-venv repro.
Fix: inline the ~10 lines of get_sqlite_ground_truth directly into hybrid_xa_judge.py so the script has no project-local imports. Together with #2's fix (pass database in), the inlined function becomes very small.
| @@ -0,0 +1,109 @@ | |||
| """Hybrid Execution Accuracy (XA) Cross-Database Evaluator for EvalBench.""" | |||
There was a problem hiding this comment.
Script lives at the repo root. All other Python under this repo lives under evalbench/ or datasets/. Convention break. Suggest moving to something like evalbench/scorers/judges/hybrid_xa_judge.py and updating the script_path in any example configs that reference it.
| normalized_row.append(None) | ||
| elif isinstance(val, (int, float, Decimal)): | ||
| try: | ||
| normalized_row.append(round(float(val), 4)) |
There was a problem hiding this comment.
compare_result_sets has three lossy behaviors worth pinning down in the docstring so the next reader doesn't second-guess them:
- Floats rounded to 4 decimals (
round(float(val), 4)). For some BIRD queries — AVG/SUM over money or rate fields — the reference may exceed 4 decimal places, turning a real mismatch into a false PASS. Either widen the precision or document the choice. - Sort key
lambda x: str(x)sorts[1, 10, 2]as["(1,)", "(10,)", "(2,)"]. Fine for equality (both sides sort identically) but unreadable in FAIL logs. Considerkey=lambda r: tuple(str(v) for v in r)for slightly less surprising debug output. .0suffix stripped from string values. If BQ returns the literal string"3.0"(e.g., a version label) and SQLite returns"3", they're considered equal. Almost certainly intentional for numeric-as-string coercion but worth a one-line comment.
A short docstring listing these explicitly makes the contract clear and lets future maintainers reason about edge cases without re-deriving them.
|
Maybe this should be built as a tool instead of inline with evals. wdyt ? |
bb25c95 to
1c07b41
Compare
Overview
This PR adds support for running the BIRD benchmark (and similar cross-database SQL evaluations) on BigQuery while using local SQLite databases for ground truth reference execution.
When evaluating AI-generated queries on BigQuery against reference queries written in SQLite syntax (e.g., using
STRFTIMEor SQLite math functions), BigQuery cannot execute the reference queries directly. This update introduces a hybrid bridging mechanism that dynamically resolves reference answers from local SQLite files while evaluating generated queries on BigQuery.Key Changes
evalbench/databases/bigquery.py: Implements the requiredensure_database_existsabstract method onBQDBto fulfill the baseDBclass contract.evalbench/scorers/sqlite_bridge.py&llmrater.py: Adds conditional ground truth resolution. Whengolden_erroroccurs and a hybrid judge is configured,LLMRaterautomatically fetches the true reference execution rows from the local SQLite database.hybrid_xa_judge.py: Adds a self-contained Execution Accuracy (XA) evaluator script forPythonScorer. It normalizes across engine data types (Decimal/Int64vsfloat), ignores column header differences, and compares rows order-independently.Verification
california_schoolsandcard_games).FLOAT64score compatibility).