-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Column selections in generated SQL may be duplicated under the following conditions:
- There are shared column names between the DataFrames that have been joined. They need not be included in the join condition.
.withColumnRenamedis called multiple times on the DataFrame after joining.
Example
from sqlframe.standalone.session import StandaloneSession
spark = StandaloneSession()
df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}]) # shares `foo` and `baz` columns in common with `df`
df_joined = (
df.join(df2, on=df.foo == df2.foo)
.withColumnRenamed("bar", "bar2")
.withColumnRenamed("qux", "qux2") # Calling `.withColumnRenamed` a second time after join
)
print(df_joined.sql())SELECT
CAST(`a1`.`foo` AS BIGINT) AS `foo`,
CAST(`a1`.`bar` AS STRING) AS `bar2`,
CAST(`a1`.`baz` AS STRING) AS `baz`,
CAST(`a1`.`foo` AS BIGINT) AS `foo`, -- selects from `a1` again
CAST(`a1`.`baz` AS STRING) AS `baz`, -- selects from `a1` again
CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
(1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
(1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)Notice that
fooandbazstatements are selected twice, both coming froma1and none froma2.
Expected
SELECT
CAST(`a1`.`foo` AS BIGINT) AS `foo`,
CAST(`a1`.`bar` AS STRING) AS `bar2`,
CAST(`a1`.`baz` AS STRING) AS `baz`,
CAST(`a2`.`foo` AS BIGINT) AS `foo`, -- selects from `a2`
CAST(`a2`.`baz` AS STRING) AS `baz`, -- selects from `a2`
CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
(1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
(1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)I understand that there are duplicate columns in this immediate result, and it would be a rather useless query on its own. But when attempting to select intended columns using aliases, this becomes problematic:
from sqlframe.standalone.session import StandaloneSession
spark = StandaloneSession()
df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}]).alias("df")
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}]).alias("df2")
df_joined = (
df.join(df2, on=df.foo == df2.foo)
.withColumnRenamed("bar", "bar2")
.withColumnRenamed("qux", "qux2")
)
print(df_joined.select("df2.foo").sql()) # Select a column from `df2`sqlglot.errors.OptimizeError: Column '"t28660952"."foo"' could not be resolved for table: 't28660952'
Just confirming that it's fine when using the "df" alias:
print(df_joined.select("df.foo").sql())SELECT
CAST(`a1`.`foo` AS BIGINT) AS `df.foo`
FROM VALUES
(1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
(1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)Using withColumnsRenamed
The obvious solution is to use .withColumnsRenamed rather than .withColumnRenamed when there are multiple sequential renames like shown in the example:
from sqlframe.standalone.session import StandaloneSession
spark = StandaloneSession()
df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}])
df_joined = df.join(df2, on=df.foo == df2.foo).withColumnsRenamed(
{"bar": "bar2", "qux": "qux2"}
)
print(df_joined.sql())SELECT
CAST(`a1`.`foo` AS BIGINT) AS `foo`,
CAST(`a1`.`bar` AS STRING) AS `bar2`,
CAST(`a1`.`baz` AS STRING) AS `baz`,
CAST(`a2`.`foo` AS BIGINT) AS `foo`,
CAST(`a2`.`baz` AS STRING) AS `baz`,
CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
(1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
(1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)However, that places this requirement on the developer to ensure they know to do this. Additionally, there may be legitimate cases where multiple .withColumnRenamed statements may occur, although not sequentially as shown in the example.
Without Optimizer
Also worth noting what the generated query looks like when using optimize=False:
# (uses the original example)
print(df_joined.sql(optimize=False))WITH `t60937266` AS (
SELECT
CAST(`foo` AS BIGINT) AS `foo`,
CAST(`bar` AS STRING) AS `bar`,
CAST(`baz` AS STRING) AS `baz`
FROM VALUES
(1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
), `t31938746` AS (
SELECT
CAST(`foo` AS BIGINT) AS `foo`,
CAST(`baz` AS STRING) AS `baz`,
CAST(`qux` AS STRING) AS `qux`
FROM VALUES
(1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
), `t47132488` AS (
SELECT
`t60937266`.`foo`,
`t60937266`.`bar` AS `bar2`,
`t60937266`.`baz`,
`t31938746`.`foo`, -- correctly selects from `a2`
`t31938746`.`baz`, -- correctly selects from `a2`
`t31938746`.`qux`
FROM `t60937266`
INNER JOIN `t31938746`
ON `t60937266`.`foo` = `t31938746`.`foo`
)
SELECT
`foo` AS `foo`,
`bar2` AS `bar2`,
`baz` AS `baz2`,
`foo` AS `foo`,
`baz` AS `baz2`,
`qux`
FROM `t47132488`While the t47132488 CTE gets the select statement right, the final select statement does not alias these shared columns. Calling .withColumnRenamed multiple times again contributes here, because here is what the non-optimized version looks like when only calling .withColumnRenamed once:
from sqlframe.standalone.session import StandaloneSession
spark = StandaloneSession()
df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}])
df_joined = df.join(df2, on=df.foo == df2.foo).withColumnRenamed("bar", "bar2")
print(df_joined.sql(optimize=False))WITH `t60937266` AS (
SELECT
CAST(`foo` AS BIGINT) AS `foo`,
CAST(`bar` AS STRING) AS `bar`,
CAST(`baz` AS STRING) AS `baz`
FROM VALUES
(1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
), `t31938746` AS (
SELECT
CAST(`foo` AS BIGINT) AS `foo`,
CAST(`baz` AS STRING) AS `baz`,
CAST(`qux` AS STRING) AS `qux`
FROM VALUES
(1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
)
SELECT
`t60937266`.`foo` AS `foo`,
`t60937266`.`bar` AS `bar2`,
`t60937266`.`baz` AS `baz`,
`t31938746`.`foo` AS `foo`,
`t31938746`.`baz` AS `baz`,
`t31938746`.`qux`
FROM `t60937266`
INNER JOIN `t31938746`
ON `t60937266`.`foo` = `t31938746`.`foo`I will try to dig in further, but I wanted to report this while it was still fresh in my mind after encountering it directly.