Skip to content

feat: Support cross-database evaluation with SQLite ground truth#465

Open
arieljassan wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
arieljassan:feat/bird-xa-benchmark
Open

feat: Support cross-database evaluation with SQLite ground truth#465
arieljassan wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
arieljassan:feat/bird-xa-benchmark

Conversation

@arieljassan

Copy link
Copy Markdown
Member

Overview

This PR adds support for running the BIRD benchmark (and similar cross-database SQL evaluations) on BigQuery while using local SQLite databases for ground truth reference execution.
When evaluating AI-generated queries on BigQuery against reference queries written in SQLite syntax (e.g., using STRFTIME or SQLite math functions), BigQuery cannot execute the reference queries directly. This update introduces a hybrid bridging mechanism that dynamically resolves reference answers from local SQLite files while evaluating generated queries on BigQuery.

Key Changes

  • evalbench/databases/bigquery.py: Implements the required ensure_database_exists abstract method on BQDB to fulfill the base DB class contract.
  • evalbench/scorers/sqlite_bridge.py & llmrater.py: Adds conditional ground truth resolution. When golden_error occurs and a hybrid judge is configured, LLMRater automatically fetches the true reference execution rows from the local SQLite database.
  • hybrid_xa_judge.py: Adds a self-contained Execution Accuracy (XA) evaluator script for PythonScorer. It normalizes across engine data types (Decimal/Int64 vs float), ignores column header differences, and compares rows order-independently.

Verification

  • Performed functional end-to-end benchmark tests across dual BIRD evaluation datasets (california_schools and card_games).
  • Confirmed functional ground truth resolution from local SQLite reference tables during live BigQuery evaluation runs.
  • Verified clean telemetry data warehousing into BigQuery (FLOAT64 score compatibility).

@arieljassan arieljassan requested a review from IsmailMehdi as a code owner June 28, 2026 17:26
@arieljassan

Copy link
Copy Markdown
Member Author

/gcbrun

@IsmailMehdi IsmailMehdi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solves a real problem (cross-engine BIRD eval on BQ), but a few correctness issues to address. Inline comments cover the file-specific items. Three cross-cutting points that don't fit a specific line:

No tests for any of the new code. Both sqlite_bridge.py and hybrid_xa_judge.py are trivially unit-testable (in-memory SQLite via sqlite3.connect(":memory:"), mock sys.argv). Without tests the fragile bits flagged inline will silently rot. Minimum I'd want to see: test_get_sqlite_ground_truth_uses_named_db (after the #2 fix), test_compare_result_sets_handles_decimal_vs_float, test_compare_result_sets_ignores_column_names_and_row_order, test_hybrid_xa_judge_main_with_matching_results.

No documentation in docs/configs/run-config.md or dataset-config docs for this new mode. PRs #453 and #459 got the same note — easier to add docs at PR time than to remember later.

PR description "Verification" is functional/manual only — no automated test references. Worth at least pointing to a scenario YAML that exercises the hybrid path so others can repro the test. Also: if you confirm the uv run --isolated import issue (inline comment on hybrid_xa_judge.py:10), worth describing in the PR how the manual verification got past that — I may be misreading the invocation.

self.client = bigquery.Client(project=self.project_id)
self.tmp_users = []

def ensure_database_exists(self):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signature violates the abstract base class. databases/db.py:182 declares:

@abstractmethod
def ensure_database_exists(self, database_name: str) -> None:

All 8 other implementations (mysql, postgres, sqlite, spanner, bigtable, mongodb, sqlserver) match this signature. The new BQ override drops the parameter entirely. Any generic caller that does db.ensure_database_exists("foo") — which is the documented contract — will hit TypeError: ensure_database_exists() takes 1 positional argument but 2 were given.

def ensure_database_exists(self, database_name: str) -> None:
    # BigQuery datasets are project-scoped; no per-database creation needed.
    pass

candidates = [
f[:-7] for f in os.listdir(db_dir) if f.endswith(".sqlite")
]
for cand in candidates:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picks an arbitrary SQLite file by trial-and-error. Iterating every .sqlite file in the directory and using whichever one accepts the query is wrong in two ways:

  1. Non-deterministic. Returns whatever file os.listdir yielded first that didn't raise. If two BIRD databases (california_schools, card_games) happen to have tables/columns with overlapping names, the resolved "truth" depends on the OS's directory ordering. Run the same scenario on two machines, get two different scores.
  2. Silent wrong-answer risk. A query may "succeed" against the wrong DB by returning bogus rows that don't match the scenario's intent. The judge then scores against those rows and reports a misleading PASS/FAIL.

The scenario record already carries the database name (the database field in BIRD dataset entries — california_schools, etc.). Pass it through:

def get_sqlite_ground_truth(query: str, database: str) -> list:
    sqlite_path = os.path.join(db_dir, f"{database}.sqlite")
    if not os.path.exists(sqlite_path):
        return []
    conn = sqlite3.connect(sqlite_path)
    try:
        return pd.read_sql_query(query, conn).to_dict(orient="records")
    finally:
        conn.close()

Bonus: dramatically faster — one connect instead of N.

"""Resolves candidate SQLite database files and executes query."""
parent_dir = os.path.dirname(__file__)
root_dir = os.path.abspath(os.path.join(parent_dir, "..", ".."))
db_dir = os.path.join(root_dir, "db_connections", "bird")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded path. db_connections/bird is specific to one dataset and won't work for other suites. The experiment_config already carries database_configs; derive the SQLite path from there instead of hardcoding the directory. As written, this scorer is silently bird-only — and nothing in the docstring says so.

return []


def is_hybrid_cross_db_enabled() -> bool:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sniffing sys.argv to detect mode is fragile and creates several real problems:

  • Only works for python evalbench.py --experiment_config=foo.yaml. Other invocation paths don't have this argv pattern: the gRPC server (eval_server.py) constructs configs programmatically, run_suite passes config paths via a different mechanism, and any programmatic / in-process caller doesn't have argv at all. Hybrid mode silently disables in these cases.
  • Re-parses the YAML on every call. compare() runs once per scenario per trial; this is O(N) file reads of the same YAML.
  • Cross-scorer coupling via string match. Checks the PythonScorer's script_path to decide what LLMRater does. If someone renames hybrid_xa_judge.py or moves it, LLMRater silently goes back to returning 0 on golden errors.

The signal that hybrid mode is active is conceptually a scorer config, not a global mode. Lift it to LLMRater.__init__:

def __init__(self, config, global_models):
    ...
    self.hybrid_ground_truth = config.get("hybrid_ground_truth", False)

And in YAML:

scorers:
  llm_rater:
    hybrid_ground_truth: true

Then the compare() branch checks self.hybrid_ground_truth. No argv sniffing, no cross-scorer coupling, no file I/O in the hot path.

# If using hybrid judge, fetch ground truth from SQLite when BQ
# fails on golden query syntax (e.g. SQLite functions in reference
# queries).
if sqlite_bridge.is_hybrid_cross_db_enabled():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent fallback — no log line when it fires. When hybrid mode kicks in and resolves a golden answer from SQLite, nothing in the logs records that this happened. The eval row just has a normal-looking score. Debugging "why does my BQ run report fewer golden errors than expected" becomes a source-dive.

Add one log line:

if sqlite_bridge.is_hybrid_cross_db_enabled():
    logging.info(
        "Hybrid ground truth: BQ golden query failed, resolving from "
        "SQLite reference. query=%s", golden_query
    )
    golden_execution_result = ...

Comment thread hybrid_xa_judge.py

import pandas as pd

from evalbench.scorers.sqlite_bridge import get_sqlite_ground_truth

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import will fail at runtime. pythonscorer.py:57 invokes the script as:

command = ["uv", "run", "--isolated", self.script_path]

--isolated strips the parent's PYTHONPATH and runs in a clean venv with no access to the evalbench source tree. from evalbench.scorers.sqlite_bridge import ... will raise ModuleNotFoundError. The script returns nonzero, the scorer reports a generic FAIL: Script failed with exit code..., and the user never sees a real result.

The PR description mentions functional end-to-end testing, so either I'm misreading the invocation OR the testing didn't actually exercise this code path. Worth confirming with a fresh-venv repro.

Fix: inline the ~10 lines of get_sqlite_ground_truth directly into hybrid_xa_judge.py so the script has no project-local imports. Together with #2's fix (pass database in), the inlined function becomes very small.

Comment thread hybrid_xa_judge.py
@@ -0,0 +1,109 @@
"""Hybrid Execution Accuracy (XA) Cross-Database Evaluator for EvalBench."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Script lives at the repo root. All other Python under this repo lives under evalbench/ or datasets/. Convention break. Suggest moving to something like evalbench/scorers/judges/hybrid_xa_judge.py and updating the script_path in any example configs that reference it.

Comment thread hybrid_xa_judge.py
normalized_row.append(None)
elif isinstance(val, (int, float, Decimal)):
try:
normalized_row.append(round(float(val), 4))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare_result_sets has three lossy behaviors worth pinning down in the docstring so the next reader doesn't second-guess them:

  1. Floats rounded to 4 decimals (round(float(val), 4)). For some BIRD queries — AVG/SUM over money or rate fields — the reference may exceed 4 decimal places, turning a real mismatch into a false PASS. Either widen the precision or document the choice.
  2. Sort key lambda x: str(x) sorts [1, 10, 2] as ["(1,)", "(10,)", "(2,)"]. Fine for equality (both sides sort identically) but unreadable in FAIL logs. Consider key=lambda r: tuple(str(v) for v in r) for slightly less surprising debug output.
  3. .0 suffix stripped from string values. If BQ returns the literal string "3.0" (e.g., a version label) and SQLite returns "3", they're considered equal. Almost certainly intentional for numeric-as-string coercion but worth a one-line comment.

A short docstring listing these explicitly makes the contract clear and lets future maintainers reason about edge cases without re-deriving them.

@IsmailMehdi

Copy link
Copy Markdown
Collaborator

Maybe this should be built as a tool instead of inline with evals. wdyt ?

@arieljassan arieljassan force-pushed the feat/bird-xa-benchmark branch from bb25c95 to 1c07b41 Compare June 30, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants