Skip to content

union all +aggregate function in the recursive cte results an infinite loop #1131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
l1t1 opened this issue May 21, 2025 · 2 comments
Open
Labels
bug Something isn't working

Comments

@l1t1
Copy link

l1t1 commented May 21, 2025

Describe the bug
union all +aggregate function in the recursive cte results an infinite loop

To Reproduce
Steps to reproduce the behavior:

import datafusion
from datafusion import SessionContext

ctx = SessionContext()
df = ctx.sql("""
with recursive t as(select value i,1 lv from generate_series(1,6,1)
union all
select max(i),max(lv)+1 from t where lv<2)
select * from t where lv=2;
""")

>>> print(df)

Expected behavior
it outputs

DataFrame()
+---+----+
| i | lv |
+---+----+
| 6 | 2  |
+---+----+

Additional context
Add any other context about the problem here.

@l1t1 l1t1 added the bug Something isn't working label May 21, 2025
@kosiew
Copy link
Contributor

kosiew commented May 23, 2025

hi @l1t1 ,

duckdb too goes into an infinite loop:

❯ duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D WITH RECURSIVE t(i, lv) AS (
    SELECT generate_series AS i, 1 AS lv
      FROM generate_series(1, 6, 1)
    UNION ALL
    SELECT MAX(i), MAX(lv) + 1
      FROM t
     WHERE lv < 2
  )
  SELECT *
    FROM t
   WHERE lv = 2;

I think this isn’t a bug in DataFusion-Python so much as a quirk of SQL’s recursive-CTE semantics when you use UNION ALL with an aggregate that never changes its output. Any engine that follows the SQL standard will do the same. DuckDB likewise spins forever because the recursive member keeps re-emitting the same row, and UNION ALL does not remove duplicates, so the CTE never reaches a fixpoint.

Amending the query to UNION does complete in duckdb:

❯ duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D WITH RECURSIVE t(i, lv) AS (
      SELECT generate_series AS i, 1 AS lv
        FROM generate_series(1, 6, 1)
      UNION
      SELECT MAX(i), MAX(lv) + 1
        FROM t
       WHERE lv < 2
    )
    SELECT *
      FROM t
     WHERE lv = 2;
┌───────┬───────┐
│   i   │  lv   │
│ int64 │ int32 │
├───────┼───────┤
│     6 │     2 │
└───────┴───────┘

But the same query is not yet supported in Datafusion:

>>> sql = """
... with recursive t as(select value i,1 lv from generate_series(1,6,1)
... union
... select max(i),max(lv)+1 from t where lv<2)
... select * from t where lv=2;
... """
>>> df = ctx.sql(sql)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kosiew/GitHub/datafusion-python/python/datafusion/context.py", line 589, in sql
    return DataFrame(self.ctx.sql(query))
                     ^^^^^^^^^^^^^^^^^^^
Exception: DataFusion error: NotImplemented("Recursive queries with a distinct 'UNION' (in which the previous iteration's results will be de-duplicated) is not supported")

@l1t1
Copy link
Author

l1t1 commented May 24, 2025

@kosiew thank you, I learned it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants