Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating tables with aggregation with automatically named columns results in unnamed columns error. #2175

Open
2 tasks done
cottrell opened this issue Mar 14, 2023 · 6 comments

Comments

@cottrell
Copy link

What happened?

This one compiles without error

let A = (
  from invoices
  filter invoice_date >= @1970-01-16
  derive [                        # This adds columns
    transaction_fees = 0.8,
    income = total - transaction_fees  # Columns can use other columns
  ]
  filter income > 1     # Transforms can be repeated.
  group customer_id (   # Use a nested pipeline on each group
    aggregate [         # Aggregate each group to a single row
      count = count,
    ]
  )
)

from A

But the below PRQL input does not.

Some minor name thing probably.

PRQL input

let A = (
  from invoices
  filter invoice_date >= @1970-01-16
  derive [                        # This adds columns
    transaction_fees = 0.8,
    income = total - transaction_fees  # Columns can use other columns
  ]
  filter income > 1     # Transforms can be repeated.
  group customer_id (   # Use a nested pipeline on each group
    aggregate [         # Aggregate each group to a single row
      count,
    ]
  )
)

from A

SQL output

Error: 
    ╭─[:16:6]
    │
 16from A
    ·      ┬  
    ·      ╰── This table contains unnamed columns that need to be referenced by name
    · 
    · Help: The name may have been overridden later in the pipeline.
────╯

Expected SQL output

WITH table_1 AS (
  SELECT
    customer_id,
    total - 0.8 AS _expr_0
  FROM
    invoices
  WHERE
    invoice_date >= DATE '1970-01-16'
),
"A" AS (
  SELECT
    customer_id,
    COUNT(*) AS count
  FROM
    table_1 AS table_0
  WHERE
    _expr_0 > 1
  GROUP BY
    customer_id
)
SELECT
  customer_id,
  count
FROM
  "A"

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

MVCE confirmation

  • Minimal example
  • New issue

Anything else?

No response

@cottrell cottrell added the bug Invalid compiler output or panic label Mar 14, 2023
@max-sixty
Copy link
Member

This is surprising, thanks a lot @cottrell ...

Here's an even simpler example:

let a = (
  from invoices
  group customer_id (
    aggregate [count]
  )
)

from a
Error: 
   ╭─[:8:6]
   │
 8 │ from a
   ·      ┬  
   ·      ╰── This table contains unnamed columns that need to be referenced by name
   · 
   · Help: The name may have been overridden later in the pipeline.
───╯

@aljazerzen
Copy link
Member

I think this message is quite self-explanatory, but as its author, of course I'm biased. Let me elaborate:

Definition of the table compiles to this:

WITH a AS (
    SELECT customer_id, COUNT(*) FROM invoices GROUP BY customer_id
)

If you try to write the main SELECT, you get something like:

WITH ...
SELECT customer_id, ? FROM a

... but because the second column doesn't have a name, you cannot reference it!

(a partial solution would be to use a SELECT * here, but you cannot do that if you append select ![customer_id] to the main query)

How should we change the error to make this more obvious?

@max-sixty
Copy link
Member

(a partial solution would be to use a SELECT * here, but you cannot do that if you append select ![customer_id] to the main query)

Just from the perspective of the result, without thinking at all about how it's built:

@aljazerzen
Copy link
Member

Ok, I must admit that for this specific case, it could produce a *.

@cottrell
Copy link
Author

I can barely remember this one now but it seems to me like it might simplify to do away with the convenience of "nameless" or "implicitly named" aggregations? i.e. "count = count" not "count". I'm not sure if it was more complicated than that. More explicit and less surface area is usually better in the long run.

@aljazerzen
Copy link
Member

Automatic name inference can be tricky. In this case it seems obvious that the inferred named should be count, but we have had different suggestions (that I cannot find now):

  • infer column name from function name:

    from orders
    aggregate [count] # column named `count`
    

    but also:

    from orders
    select [lag 1 created_at] # column named `lag`
    
  • infer column name from the only agg function argument:

    from orders
    aggregate [sum total, sum discount_percent]
    # columns `total` and `discount_percent`
    

There is also an option to combine these two rules, but I'm hesitant to choose something that would be too complicated. I don't want something unpredictable that people would rely on, but would then change when the function definition is changed.

SQL engines each have their own rules around this so there is no option to say "let's just do what most SQL engines do".

(this is a bit off topic, let's create a new issue for more discussion)

@aljazerzen aljazerzen added compiler and removed bug Invalid compiler output or panic labels Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants