-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
Summary
BigQuery backend in SQLFrame generates Spark-style MAP(...) for Python dict columns, which BigQuery does not support. This causes queries like CREATE OR REPLACE TABLE ... SELECT MAP(['headline'], ['haferflocki']) AS title to fail with Function not found: MAP.
Reproduction Example as repo
https://github.com/kaojo/sql_frame_issue_repro
Minimal Reproduction as code examples
# Python 3.11+ (tested with 3.13)
# pip install sqlframe[bigquery] google-cloud-bigquery
from sqlframe.bigquery import BigQuerySession
# Adjust these to a project/dataset you have access to. The dataset must exist.
PROJECT_ID = "your-gcp-project"
DATASET = "your_dataset"
TABLE = f"{PROJECT_ID}.{DATASET}.repro_map_issue"
# Create a session that points to the dataset (database in SQLFrame terms)
session = BigQuerySession()
# Input data includes a Python dict (nested) column `title`.
# On Spark-like dialects this becomes a MAP/STRUCT; SQLFrame emits MAP() here.
data = [
{"country": "DE", "dan": 3456789, "title": {"headline": "haferflocki"}},
{"country": "DE", "dan": 3456781, "title": {"headline": "erdbaerkaese"}},
]
# Create DataFrame
df = session.createDataFrame(data)
# Repro path A: writing to table (CTAS) -> always fails with BigQuery error
# Expected: either a STRUCT or JSON-compatible expression
# Actual: BigQuery error: Function not found: MAP
try:
df.write.mode("overwrite").saveAsTable(TABLE)
except Exception as e:
print("saveAsTable failed:", e)
# Repro path B: even collecting (no write) can trigger a query with MAP(...)
try:
df.collect()
except Exception as e:
print("collect failed:", e)Actual Error
Observed on df.write.mode("overwrite").saveAsTable(TABLE):
google.cloud.bigquery.dbapi.exceptions.DatabaseError: 400 ... Function not found: MAP; Did you mean max? at [1:...]
The formatted SQL (from stack trace) includes segments like:
CREATE OR REPLACE TABLE `project.dataset.repro_map_issue` AS
SELECT ... MAP(['headline'], ['haferflocki']) AS `title` ...
Expected Behavior
- SQLFrame BigQuery dialect should not emit
MAP(...)(Spark SQL function) because BigQuery does not implement it. - For a Python
dictcolumn, the BigQuery dialect should serialize to a BigQuery-compatible type/expression, such as:STRUCT('haferflocki' AS headline)for a typed struct column; or- a JSON column using
TO_JSON(STRUCT(...))/JSON '{"headline":"haferflocki"}'when using BigQueryJSONtype.
Why This Matters
- Writing meaningfull tests with SQLFrame is not possible when nested/dict fields are present in input data.
Environment
- SQLFrame: latest as of Nov 2025 (please see maintainers’ lockfile); issue reproduced with
sqlframe 3.43.8 - Python: 3.13.5
- google-cloud-bigquery: 3.x
- Backend: BigQuery Standard SQL
Suggested Fix (for SQLFrame)
- In the BigQuery dialect/layer that compiles Python dict literals or map-typed expressions, replace
MAP(keys, values)emission with a BigQuery-compatible representation:
CREATE OR REPLACE TABLE `your_project_id`.your_dataset_id.your_table AS
SELECT
a1.country AS country,
a1.dan AS dan,
STRUCT((
SELECT
value
FROM
UNNEST(a1.title)
WHERE
key = 'headline'
) AS headline) AS title
FROM
UNNEST(ARRAY[STRUCT('DE' AS country, 3456789 AS dan, ARRAY[STRUCT('headline' AS key, 'haferflocki' AS value)] AS title),
STRUCT('DE' AS country, 3456781 AS dan, ARRAY[STRUCT('headline' AS key, 'erdbaerkaese' AS value)] AS title),
STRUCT('AT' AS country, 3456789 AS dan, ARRAY[STRUCT('headline' AS key, 'haferflocki' AS value)] AS title),
STRUCT('AT' AS country, 3456781 AS dan, ARRAY[STRUCT('headline' AS key, 'erdbaerkaese' AS value)] AS title)]) AS a1;
Notes
- The failure is reproducible even with a single-row
createDataFramecontaining a dict. - The error also appears when calling
.collect()without writing, because the generated SELECT still usesMAP(...).
Workaround
Define your data instead like this:
data = [ {"country": "DE", "dan": 3456789, "title": Row(headline="haferflocki")}, ]
Metadata
Metadata
Assignees
Labels
No labels