Skip to content

BigQuery backend emits Spark-style MAP() for dict columns, causing Function not found: MAP errors #555

@kaojo

Description

@kaojo

Summary

BigQuery backend in SQLFrame generates Spark-style MAP(...) for Python dict columns, which BigQuery does not support. This causes queries like CREATE OR REPLACE TABLE ... SELECT MAP(['headline'], ['haferflocki']) AS title to fail with Function not found: MAP.

Reproduction Example as repo

https://github.com/kaojo/sql_frame_issue_repro

Minimal Reproduction as code examples

# Python 3.11+ (tested with 3.13)
# pip install sqlframe[bigquery] google-cloud-bigquery

from sqlframe.bigquery import BigQuerySession

# Adjust these to a project/dataset you have access to. The dataset must exist.
PROJECT_ID = "your-gcp-project"
DATASET = "your_dataset"
TABLE = f"{PROJECT_ID}.{DATASET}.repro_map_issue"

# Create a session that points to the dataset (database in SQLFrame terms)
session = BigQuerySession()

# Input data includes a Python dict (nested) column `title`.
# On Spark-like dialects this becomes a MAP/STRUCT; SQLFrame emits MAP() here.
data = [
    {"country": "DE", "dan": 3456789, "title": {"headline": "haferflocki"}},
    {"country": "DE", "dan": 3456781, "title": {"headline": "erdbaerkaese"}},
]

# Create DataFrame
df = session.createDataFrame(data)

# Repro path A: writing to table (CTAS) -> always fails with BigQuery error
# Expected: either a STRUCT or JSON-compatible expression
# Actual: BigQuery error: Function not found: MAP
try:
    df.write.mode("overwrite").saveAsTable(TABLE)
except Exception as e:
    print("saveAsTable failed:", e)

# Repro path B: even collecting (no write) can trigger a query with MAP(...)
try:
    df.collect()
except Exception as e:
    print("collect failed:", e)

Actual Error

Observed on df.write.mode("overwrite").saveAsTable(TABLE):

google.cloud.bigquery.dbapi.exceptions.DatabaseError: 400 ... Function not found: MAP; Did you mean max? at [1:...]

The formatted SQL (from stack trace) includes segments like:

CREATE OR REPLACE TABLE `project.dataset.repro_map_issue` AS
SELECT ... MAP(['headline'], ['haferflocki']) AS `title` ...

Expected Behavior

  • SQLFrame BigQuery dialect should not emit MAP(...) (Spark SQL function) because BigQuery does not implement it.
  • For a Python dict column, the BigQuery dialect should serialize to a BigQuery-compatible type/expression, such as:
    • STRUCT('haferflocki' AS headline) for a typed struct column; or
    • a JSON column using TO_JSON(STRUCT(...)) / JSON '{"headline":"haferflocki"}' when using BigQuery JSON type.

Why This Matters

  • Writing meaningfull tests with SQLFrame is not possible when nested/dict fields are present in input data.

Environment

  • SQLFrame: latest as of Nov 2025 (please see maintainers’ lockfile); issue reproduced with sqlframe 3.43.8
  • Python: 3.13.5
  • google-cloud-bigquery: 3.x
  • Backend: BigQuery Standard SQL

Suggested Fix (for SQLFrame)

  • In the BigQuery dialect/layer that compiles Python dict literals or map-typed expressions, replace MAP(keys, values) emission with a BigQuery-compatible representation:
CREATE OR REPLACE TABLE `your_project_id`.your_dataset_id.your_table  AS 
SELECT
  a1.country AS country,
  a1.dan AS dan,
  STRUCT((
      SELECT
        value
      FROM
        UNNEST(a1.title)
      WHERE
        key = 'headline'
    ) AS headline) AS title
FROM
  UNNEST(ARRAY[STRUCT('DE' AS country, 3456789 AS dan, ARRAY[STRUCT('headline' AS key, 'haferflocki' AS value)] AS title),
  STRUCT('DE' AS country, 3456781 AS dan, ARRAY[STRUCT('headline' AS key, 'erdbaerkaese' AS value)] AS title),
  STRUCT('AT' AS country, 3456789 AS dan, ARRAY[STRUCT('headline' AS key, 'haferflocki' AS value)] AS title),
  STRUCT('AT' AS country, 3456781 AS dan, ARRAY[STRUCT('headline' AS key, 'erdbaerkaese' AS value)] AS title)]) AS a1;

Notes

  • The failure is reproducible even with a single-row createDataFrame containing a dict.
  • The error also appears when calling .collect() without writing, because the generated SELECT still uses MAP(...).

Workaround

Define your data instead like this:

data = [ {"country": "DE", "dan": 3456789, "title": Row(headline="haferflocki")}, ]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions