Same column being selected multiple times when there are shared column names and multiple `.withColumnRenamed` statements

Column selections in generated SQL may be duplicated under the following conditions:

* There are shared column names between the DataFrames that have been joined. They need not be included in the join condition.
* `.withColumnRenamed` is called multiple times on the DataFrame after joining.

## Example

```python
from sqlframe.standalone.session import StandaloneSession


spark = StandaloneSession()


df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}])  # shares `foo` and `baz` columns in common with `df`
df_joined = (
    df.join(df2, on=df.foo == df2.foo)
    .withColumnRenamed("bar", "bar2")
    .withColumnRenamed("qux", "qux2")  # Calling `.withColumnRenamed` a second time after join
)
print(df_joined.sql())
```

```sql
SELECT
  CAST(`a1`.`foo` AS BIGINT) AS `foo`,
  CAST(`a1`.`bar` AS STRING) AS `bar2`,
  CAST(`a1`.`baz` AS STRING) AS `baz`,
  CAST(`a1`.`foo` AS BIGINT) AS `foo`,  -- selects from `a1` again
  CAST(`a1`.`baz` AS STRING) AS `baz`,  -- selects from `a1` again
  CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
  (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
  (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
  ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)
```

> Notice that `foo` and `baz` statements are selected twice, both coming from `a1` and none from `a2`.

## Expected

```sql
SELECT
  CAST(`a1`.`foo` AS BIGINT) AS `foo`,
  CAST(`a1`.`bar` AS STRING) AS `bar2`,
  CAST(`a1`.`baz` AS STRING) AS `baz`,
  CAST(`a2`.`foo` AS BIGINT) AS `foo`,  -- selects from `a2`
  CAST(`a2`.`baz` AS STRING) AS `baz`,  -- selects from `a2`
  CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
  (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
  (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
  ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)
```

I understand that there are duplicate columns in this immediate result, and it would be a rather useless query on its own. But when attempting to select intended columns using aliases, this becomes problematic:

```python
from sqlframe.standalone.session import StandaloneSession


spark = StandaloneSession()


df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}]).alias("df")
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}]).alias("df2")
df_joined = (
    df.join(df2, on=df.foo == df2.foo)
    .withColumnRenamed("bar", "bar2")
    .withColumnRenamed("qux", "qux2")
)


print(df_joined.select("df2.foo").sql())  # Select a column from `df2`
```

```
sqlglot.errors.OptimizeError: Column '"t28660952"."foo"' could not be resolved for table: 't28660952'
```

Just confirming that it's fine when using the "df" alias:

```python
print(df_joined.select("df.foo").sql())
```

```sql
SELECT
  CAST(`a1`.`foo` AS BIGINT) AS `df.foo`
FROM VALUES
  (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
  (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
  ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)
```

## Using `withColumnsRenamed`

The obvious solution is to use `.withColumnsRenamed` rather than `.withColumnRenamed` when there are multiple sequential renames like shown in the example:

```python
from sqlframe.standalone.session import StandaloneSession


spark = StandaloneSession()


df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}])
df_joined = df.join(df2, on=df.foo == df2.foo).withColumnsRenamed(
    {"bar": "bar2", "qux": "qux2"}
)


print(df_joined.sql())
```

```sql
SELECT
  CAST(`a1`.`foo` AS BIGINT) AS `foo`,
  CAST(`a1`.`bar` AS STRING) AS `bar2`,
  CAST(`a1`.`baz` AS STRING) AS `baz`,
  CAST(`a2`.`foo` AS BIGINT) AS `foo`,
  CAST(`a2`.`baz` AS STRING) AS `baz`,
  CAST(`a2`.`qux` AS STRING) AS `qux2`
FROM VALUES
  (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
JOIN VALUES
  (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
  ON CAST(`a1`.`foo` AS BIGINT) = CAST(`a2`.`foo` AS BIGINT)
```

However, that places this requirement on the developer to ensure they know to do this. Additionally, there may be legitimate cases where multiple `.withColumnRenamed` statements may occur, although not sequentially as shown in the example.

## Without Optimizer

Also worth noting what the generated query looks like when using `optimize=False`:

```python
# (uses the original example)
print(df_joined.sql(optimize=False))
```

```sql
WITH `t60937266` AS (
  SELECT
    CAST(`foo` AS BIGINT) AS `foo`,
    CAST(`bar` AS STRING) AS `bar`,
    CAST(`baz` AS STRING) AS `baz`
  FROM VALUES
    (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
), `t31938746` AS (
  SELECT
    CAST(`foo` AS BIGINT) AS `foo`,
    CAST(`baz` AS STRING) AS `baz`,
    CAST(`qux` AS STRING) AS `qux`
  FROM VALUES
    (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
), `t47132488` AS (
  SELECT
    `t60937266`.`foo`,
    `t60937266`.`bar` AS `bar2`,
    `t60937266`.`baz`,
    `t31938746`.`foo`,  -- correctly selects from `a2`
    `t31938746`.`baz`,  -- correctly selects from `a2`
    `t31938746`.`qux`
  FROM `t60937266`
  INNER JOIN `t31938746`
    ON `t60937266`.`foo` = `t31938746`.`foo`
)
SELECT
  `foo` AS `foo`,
  `bar2` AS `bar2`,
  `baz` AS `baz2`,
  `foo` AS `foo`,
  `baz` AS `baz2`,
  `qux`
FROM `t47132488`
```

While the `t47132488` CTE gets the select statement right, the final select statement does not alias these shared columns. Calling `.withColumnRenamed` multiple times again contributes here, because here is what the non-optimized version looks like when only calling `.withColumnRenamed` once:

```python
from sqlframe.standalone.session import StandaloneSession


spark = StandaloneSession()


df = spark.createDataFrame([{"foo": 1, "bar": "a", "baz": "a"}])
df2 = spark.createDataFrame([{"foo": 1, "baz": "a", "qux": "a"}])
df_joined = df.join(df2, on=df.foo == df2.foo).withColumnRenamed("bar", "bar2")
print(df_joined.sql(optimize=False))
```

```sql
WITH `t60937266` AS (
  SELECT
    CAST(`foo` AS BIGINT) AS `foo`,
    CAST(`bar` AS STRING) AS `bar`,
    CAST(`baz` AS STRING) AS `baz`
  FROM VALUES
    (1, 'a', 'a') AS `a1`(`foo`, `bar`, `baz`)
), `t31938746` AS (
  SELECT
    CAST(`foo` AS BIGINT) AS `foo`,
    CAST(`baz` AS STRING) AS `baz`,
    CAST(`qux` AS STRING) AS `qux`
  FROM VALUES
    (1, 'a', 'a') AS `a2`(`foo`, `baz`, `qux`)
)
SELECT
  `t60937266`.`foo` AS `foo`,
  `t60937266`.`bar` AS `bar2`,
  `t60937266`.`baz` AS `baz`,
  `t31938746`.`foo` AS `foo`,
  `t31938746`.`baz` AS `baz`,
  `t31938746`.`qux`
FROM `t60937266`
INNER JOIN `t31938746`
  ON `t60937266`.`foo` = `t31938746`.`foo`
```

I will try to dig in further, but I wanted to report this while it was still fresh in my mind after encountering it directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Same column being selected multiple times when there are shared column names and multiple `.withColumnRenamed` statements #549

Example

Expected

Using `withColumnsRenamed`

Without Optimizer

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Same column being selected multiple times when there are shared column names and multiple .withColumnRenamed statements #549

Description

Example

Expected

Using withColumnsRenamed

Without Optimizer

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Same column being selected multiple times when there are shared column names and multiple `.withColumnRenamed` statements #549

Using `withColumnsRenamed`