feat: Support cross-database evaluation with SQLite ground truth by arieljassan · Pull Request #465 · GoogleCloudPlatform/evalbench

arieljassan · 2026-06-28T17:26:22Z

Overview

This PR adds support for running the BIRD benchmark (and similar cross-database SQL evaluations) on BigQuery while using local SQLite databases for ground truth reference execution.
When evaluating AI-generated queries on BigQuery against reference queries written in SQLite syntax (e.g., using STRFTIME or SQLite math functions), BigQuery cannot execute the reference queries directly. This update introduces a hybrid bridging mechanism that dynamically resolves reference answers from local SQLite files while evaluating generated queries on BigQuery.

Key Changes

evalbench/databases/bigquery.py: Implements the required ensure_database_exists abstract method on BQDB to fulfill the base DB class contract.
evalbench/scorers/sqlite_bridge.py & llmrater.py: Adds conditional ground truth resolution. When golden_error occurs and a hybrid judge is configured, LLMRater automatically fetches the true reference execution rows from the local SQLite database.
hybrid_xa_judge.py: Adds a self-contained Execution Accuracy (XA) evaluator script for PythonScorer. It normalizes across engine data types (Decimal/Int64 vs float), ignores column header differences, and compares rows order-independently.

Verification

Performed functional end-to-end benchmark tests across dual BIRD evaluation datasets (california_schools and card_games).
Confirmed functional ground truth resolution from local SQLite reference tables during live BigQuery evaluation runs.
Verified clean telemetry data warehousing into BigQuery (FLOAT64 score compatibility).

…d truth resolution

arieljassan · 2026-06-29T10:26:17Z

/gcbrun

IsmailMehdi

Solves a real problem (cross-engine BIRD eval on BQ), but a few correctness issues to address. Inline comments cover the file-specific items. Three cross-cutting points that don't fit a specific line:

No tests for any of the new code. Both sqlite_bridge.py and hybrid_xa_judge.py are trivially unit-testable (in-memory SQLite via sqlite3.connect(":memory:"), mock sys.argv). Without tests the fragile bits flagged inline will silently rot. Minimum I'd want to see: test_get_sqlite_ground_truth_uses_named_db (after the #2 fix), test_compare_result_sets_handles_decimal_vs_float, test_compare_result_sets_ignores_column_names_and_row_order, test_hybrid_xa_judge_main_with_matching_results.

No documentation in docs/configs/run-config.md or dataset-config docs for this new mode. PRs #453 and #459 got the same note — easier to add docs at PR time than to remember later.

PR description "Verification" is functional/manual only — no automated test references. Worth at least pointing to a scenario YAML that exercises the hybrid path so others can repro the test. Also: if you confirm the uv run --isolated import issue (inline comment on hybrid_xa_judge.py:10), worth describing in the PR how the manual verification got past that — I may be misreading the invocation.

IsmailMehdi · 2026-06-29T17:52:31Z

        self.client = bigquery.Client(project=self.project_id)
        self.tmp_users = []

+    def ensure_database_exists(self):


Signature violates the abstract base class. databases/db.py:182 declares:

@abstractmethod def ensure_database_exists(self, database_name: str) -> None:

All 8 other implementations (mysql, postgres, sqlite, spanner, bigtable, mongodb, sqlserver) match this signature. The new BQ override drops the parameter entirely. Any generic caller that does db.ensure_database_exists("foo") — which is the documented contract — will hit TypeError: ensure_database_exists() takes 1 positional argument but 2 were given.

def ensure_database_exists(self, database_name: str) -> None: # BigQuery datasets are project-scoped; no per-database creation needed. pass

IsmailMehdi · 2026-06-29T17:52:31Z

+    candidates = [
+        f[:-7] for f in os.listdir(db_dir) if f.endswith(".sqlite")
+    ]
+    for cand in candidates:


Picks an arbitrary SQLite file by trial-and-error. Iterating every .sqlite file in the directory and using whichever one accepts the query is wrong in two ways:

Non-deterministic. Returns whatever file os.listdir yielded first that didn't raise. If two BIRD databases (california_schools, card_games) happen to have tables/columns with overlapping names, the resolved "truth" depends on the OS's directory ordering. Run the same scenario on two machines, get two different scores.

Silent wrong-answer risk. A query may "succeed" against the wrong DB by returning bogus rows that don't match the scenario's intent. The judge then scores against those rows and reports a misleading PASS/FAIL.

The scenario record already carries the database name (the database field in BIRD dataset entries — california_schools, etc.). Pass it through:

def get_sqlite_ground_truth(query: str, database: str) -> list: sqlite_path = os.path.join(db_dir, f"{database}.sqlite") if not os.path.exists(sqlite_path): return [] conn = sqlite3.connect(sqlite_path) try: return pd.read_sql_query(query, conn).to_dict(orient="records") finally: conn.close()

Bonus: dramatically faster — one connect instead of N.

IsmailMehdi · 2026-06-29T17:52:31Z

+    """Resolves candidate SQLite database files and executes query."""
+    parent_dir = os.path.dirname(__file__)
+    root_dir = os.path.abspath(os.path.join(parent_dir, "..", ".."))
+    db_dir = os.path.join(root_dir, "db_connections", "bird")


Hardcoded path. db_connections/bird is specific to one dataset and won't work for other suites. The experiment_config already carries database_configs; derive the SQLite path from there instead of hardcoding the directory. As written, this scorer is silently bird-only — and nothing in the docstring says so.

IsmailMehdi · 2026-06-29T17:52:31Z

+    return []
+
+
+def is_hybrid_cross_db_enabled() -> bool:


Sniffing sys.argv to detect mode is fragile and creates several real problems:

Only works for python evalbench.py --experiment_config=foo.yaml. Other invocation paths don't have this argv pattern: the gRPC server (eval_server.py) constructs configs programmatically, run_suite passes config paths via a different mechanism, and any programmatic / in-process caller doesn't have argv at all. Hybrid mode silently disables in these cases.

Re-parses the YAML on every call. compare() runs once per scenario per trial; this is O(N) file reads of the same YAML.

Cross-scorer coupling via string match. Checks the PythonScorer's script_path to decide what LLMRater does. If someone renames hybrid_xa_judge.py or moves it, LLMRater silently goes back to returning 0 on golden errors.

The signal that hybrid mode is active is conceptually a scorer config, not a global mode. Lift it to LLMRater.__init__:

def __init__(self, config, global_models): ... self.hybrid_ground_truth = config.get("hybrid_ground_truth", False)

And in YAML:

scorers: llm_rater: hybrid_ground_truth: true

Then the compare() branch checks self.hybrid_ground_truth. No argv sniffing, no cross-scorer coupling, no file I/O in the hot path.

IsmailMehdi · 2026-06-29T17:52:31Z

+            # If using hybrid judge, fetch ground truth from SQLite when BQ
+            # fails on golden query syntax (e.g. SQLite functions in reference
+            # queries).
+            if sqlite_bridge.is_hybrid_cross_db_enabled():


Silent fallback — no log line when it fires. When hybrid mode kicks in and resolves a golden answer from SQLite, nothing in the logs records that this happened. The eval row just has a normal-looking score. Debugging "why does my BQ run report fewer golden errors than expected" becomes a source-dive.

Add one log line:

if sqlite_bridge.is_hybrid_cross_db_enabled(): logging.info( "Hybrid ground truth: BQ golden query failed, resolving from " "SQLite reference. query=%s", golden_query ) golden_execution_result = ...

IsmailMehdi · 2026-06-29T17:52:31Z

+
+import pandas as pd
+
+from evalbench.scorers.sqlite_bridge import get_sqlite_ground_truth


This import will fail at runtime. pythonscorer.py:57 invokes the script as:

command = ["uv", "run", "--isolated", self.script_path]

--isolated strips the parent's PYTHONPATH and runs in a clean venv with no access to the evalbench source tree. from evalbench.scorers.sqlite_bridge import ... will raise ModuleNotFoundError. The script returns nonzero, the scorer reports a generic FAIL: Script failed with exit code..., and the user never sees a real result.

The PR description mentions functional end-to-end testing, so either I'm misreading the invocation OR the testing didn't actually exercise this code path. Worth confirming with a fresh-venv repro.

Fix: inline the ~10 lines of get_sqlite_ground_truth directly into hybrid_xa_judge.py so the script has no project-local imports. Together with #2's fix (pass database in), the inlined function becomes very small.

IsmailMehdi · 2026-06-29T17:52:31Z

@@ -0,0 +1,109 @@
+"""Hybrid Execution Accuracy (XA) Cross-Database Evaluator for EvalBench."""


Script lives at the repo root. All other Python under this repo lives under evalbench/ or datasets/. Convention break. Suggest moving to something like evalbench/scorers/judges/hybrid_xa_judge.py and updating the script_path in any example configs that reference it.

IsmailMehdi · 2026-06-29T17:52:31Z

+                    normalized_row.append(None)
+                elif isinstance(val, (int, float, Decimal)):
+                    try:
+                        normalized_row.append(round(float(val), 4))


compare_result_sets has three lossy behaviors worth pinning down in the docstring so the next reader doesn't second-guess them:

Floats rounded to 4 decimals (round(float(val), 4)). For some BIRD queries — AVG/SUM over money or rate fields — the reference may exceed 4 decimal places, turning a real mismatch into a false PASS. Either widen the precision or document the choice.

Sort key lambda x: str(x) sorts [1, 10, 2] as ["(1,)", "(10,)", "(2,)"]. Fine for equality (both sides sort identically) but unreadable in FAIL logs. Consider key=lambda r: tuple(str(v) for v in r) for slightly less surprising debug output.

.0 suffix stripped from string values. If BQ returns the literal string "3.0" (e.g., a version label) and SQLite returns "3", they're considered equal. Almost certainly intentional for numeric-as-string coercion but worth a one-line comment.

A short docstring listing these explicitly makes the contract clear and lets future maintainers reason about edge cases without re-deriving them.

IsmailMehdi · 2026-06-29T17:55:05Z

Maybe this should be built as a tool instead of inline with evals. wdyt ?

feat(scorers): add hybrid execution accuracy judging and SQLite groun…

1c07b41

…d truth resolution

arieljassan requested a review from IsmailMehdi as a code owner June 28, 2026 17:26

IsmailMehdi reviewed Jun 29, 2026

View reviewed changes

arieljassan force-pushed the feat/bird-xa-benchmark branch from bb25c95 to 1c07b41 Compare June 30, 2026 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support cross-database evaluation with SQLite ground truth#465

feat: Support cross-database evaluation with SQLite ground truth#465
arieljassan wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
arieljassan:feat/bird-xa-benchmark

arieljassan commented Jun 28, 2026

Uh oh!

arieljassan commented Jun 29, 2026

Uh oh!

IsmailMehdi left a comment

Uh oh!

IsmailMehdi Jun 29, 2026

Uh oh!

IsmailMehdi Jun 29, 2026

Uh oh!

IsmailMehdi Jun 29, 2026

Uh oh!

IsmailMehdi Jun 29, 2026

Uh oh!

IsmailMehdi Jun 29, 2026

Uh oh!

IsmailMehdi Jun 29, 2026

Uh oh!

IsmailMehdi Jun 29, 2026

Uh oh!

IsmailMehdi Jun 29, 2026

Uh oh!

IsmailMehdi commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		import pandas as pd

		from evalbench.scorers.sqlite_bridge import get_sqlite_ground_truth

		@@ -0,0 +1,109 @@
		"""Hybrid Execution Accuracy (XA) Cross-Database Evaluator for EvalBench."""

Uh oh!

Conversation

arieljassan commented Jun 28, 2026

Overview

Key Changes

Verification

Uh oh!

arieljassan commented Jun 29, 2026

Uh oh!

IsmailMehdi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IsmailMehdi commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants