Skip to content

Same column being selected multiple times when there are shared column names and multiple .withColumnRenamed statements #549

@themattmorris

Description

@themattmorris

Column selections in generated SQL may be duplicated under the following conditions:

  • There are shared column names between the DataFrames that have been joined. They need not be included in the join condition.
  • .withColumnRenamed is called multiple times on the DataFrame after joining.

Example

from sqlframe.standalone.session import StandaloneSession


spark = StandaloneSession()


df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}])  # shares `foo` and `baz` columns in common with `df`
df_joined = (
    df.join(df2, on=df.foo == df2.foo)
    .withColumnRenamed("bar", "bar2")
    .withColumnRenamed("qux", "qux2")  # Calling `.withColumnRenamed` a second time after join
)
print(df_joined.sql())
SELECT
  CAST(`a1`.`foo` AS BIGINT) AS `foo`,
  CAST(`a1`.`bar` AS STRING) AS `bar2`,
  CAST(`a1`.`baz` AS STRING) AS `baz`,
  CAST(`a1`.`foo` AS BIGINT) AS `foo`,  -- selects from `a1` again
  CAST(`a1`.`baz` AS STRING) AS `baz`,  -- selects from `a1` again
  CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
  (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
  (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
  ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)

Notice that foo and baz statements are selected twice, both coming from a1 and none from a2.

Expected

SELECT
  CAST(`a1`.`foo` AS BIGINT) AS `foo`,
  CAST(`a1`.`bar` AS STRING) AS `bar2`,
  CAST(`a1`.`baz` AS STRING) AS `baz`,
  CAST(`a2`.`foo` AS BIGINT) AS `foo`,  -- selects from `a2`
  CAST(`a2`.`baz` AS STRING) AS `baz`,  -- selects from `a2`
  CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
  (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
  (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
  ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)

I understand that there are duplicate columns in this immediate result, and it would be a rather useless query on its own. But when attempting to select intended columns using aliases, this becomes problematic:

from sqlframe.standalone.session import StandaloneSession


spark = StandaloneSession()


df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}]).alias("df")
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}]).alias("df2")
df_joined = (
    df.join(df2, on=df.foo == df2.foo)
    .withColumnRenamed("bar", "bar2")
    .withColumnRenamed("qux", "qux2")
)


print(df_joined.select("df2.foo").sql())  # Select a column from `df2`
sqlglot.errors.OptimizeError: Column '"t28660952"."foo"' could not be resolved for table: 't28660952'

Just confirming that it's fine when using the "df" alias:

print(df_joined.select("df.foo").sql())
SELECT
  CAST(`a1`.`foo` AS BIGINT) AS `df.foo`
FROM VALUES
  (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
  (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
  ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)

Using withColumnsRenamed

The obvious solution is to use .withColumnsRenamed rather than .withColumnRenamed when there are multiple sequential renames like shown in the example:

from sqlframe.standalone.session import StandaloneSession


spark = StandaloneSession()


df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}])
df_joined = df.join(df2, on=df.foo == df2.foo).withColumnsRenamed(
    {"bar": "bar2", "qux": "qux2"}
)


print(df_joined.sql())
SELECT
  CAST(`a1`.`foo` AS BIGINT) AS `foo`,
  CAST(`a1`.`bar` AS STRING) AS `bar2`,
  CAST(`a1`.`baz` AS STRING) AS `baz`,
  CAST(`a2`.`foo` AS BIGINT) AS `foo`,
  CAST(`a2`.`baz` AS STRING) AS `baz`,
  CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
  (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
  (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
  ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)

However, that places this requirement on the developer to ensure they know to do this. Additionally, there may be legitimate cases where multiple .withColumnRenamed statements may occur, although not sequentially as shown in the example.

Without Optimizer

Also worth noting what the generated query looks like when using optimize=False:

# (uses the original example)
print(df_joined.sql(optimize=False))
WITH `t60937266` AS (
  SELECT
    CAST(`foo` AS BIGINT) AS `foo`,
    CAST(`bar` AS STRING) AS `bar`,
    CAST(`baz` AS STRING) AS `baz`
  FROM VALUES
    (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
), `t31938746` AS (
  SELECT
    CAST(`foo` AS BIGINT) AS `foo`,
    CAST(`baz` AS STRING) AS `baz`,
    CAST(`qux` AS STRING) AS `qux`
  FROM VALUES
    (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
), `t47132488` AS (
  SELECT
    `t60937266`.`foo`,
    `t60937266`.`bar` AS `bar2`,
    `t60937266`.`baz`,
    `t31938746`.`foo`,  -- correctly selects from `a2`
    `t31938746`.`baz`,  -- correctly selects from `a2`
    `t31938746`.`qux`
  FROM `t60937266`
  INNER JOIN `t31938746`
    ON `t60937266`.`foo` = `t31938746`.`foo`
)
SELECT
  `foo` AS `foo`,
  `bar2` AS `bar2`,
  `baz` AS `baz2`,
  `foo` AS `foo`,
  `baz` AS `baz2`,
  `qux`
FROM `t47132488`

While the t47132488 CTE gets the select statement right, the final select statement does not alias these shared columns. Calling .withColumnRenamed multiple times again contributes here, because here is what the non-optimized version looks like when only calling .withColumnRenamed once:

from sqlframe.standalone.session import StandaloneSession


spark = StandaloneSession()


df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}])
df_joined = df.join(df2, on=df.foo == df2.foo).withColumnRenamed("bar", "bar2")
print(df_joined.sql(optimize=False))
WITH `t60937266` AS (
  SELECT
    CAST(`foo` AS BIGINT) AS `foo`,
    CAST(`bar` AS STRING) AS `bar`,
    CAST(`baz` AS STRING) AS `baz`
  FROM VALUES
    (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
), `t31938746` AS (
  SELECT
    CAST(`foo` AS BIGINT) AS `foo`,
    CAST(`baz` AS STRING) AS `baz`,
    CAST(`qux` AS STRING) AS `qux`
  FROM VALUES
    (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
)
SELECT
  `t60937266`.`foo` AS `foo`,
  `t60937266`.`bar` AS `bar2`,
  `t60937266`.`baz` AS `baz`,
  `t31938746`.`foo` AS `foo`,
  `t31938746`.`baz` AS `baz`,
  `t31938746`.`qux`
FROM `t60937266`
INNER JOIN `t31938746`
  ON `t60937266`.`foo` = `t31938746`.`foo`

I will try to dig in further, but I wanted to report this while it was still fresh in my mind after encountering it directly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions