feat(join): support `_.col.upper()` and `lambda left, right: ...` as predicates #10703

NickCrews · 2025-01-22T07:55:35Z

Is your feature request related to a problem?

I have these two tables, which I want to join regardless of capitalization. I want to be able to do this:

import ibis
tl = ibis.memtable({"f": ["a", "b", "c"]})
tr = ibis.memtable({"f": ["A", "B", "Z"]})
tl.join(tr, ibis._.f.upper())
# expect {"f": ["a", "b"], "f_right": ["A", "B"]}
# Currently get `SignatureValidationError: JoinLink(...)`

Currently I have to actually pass tl.f.upper() == tr.f.upper().

Another problem I have is reusing my predicates. I want to be able to do all of these:

def equalish(col: str):
    def pred(left: ibis.Table, right: ibis.Table):
        return (left[col] == right[col]) | (left[col].isnull() & right[col].isnull())
   return pred

preds = [
    "a",
    _.x.upper(),
    lambda t: t.y.abs(),
    lambda l, r: l.foo == r.bar,
    (_.baz, _.qux.upper()),
    equalish("x"),
    equalish("y"),
]
a.join(b, preds)
c.join(d, preds)

Currently, I have to reference the tables directly in the predicates, so I have to actually do the binding at right-before-join time, which prevents me from being able to reuse them.

def equalish(left, right, col):
    return (left[col] == right[col]) | (left[col].isnull() & right[col].isnull())

a.join(b, [equalish(a, b, "x"), equalish(a, b, "y")])
c.join(d, [equalish(c, d, "x"), equalish(c, d, "y")])

Describe the solution you'd like

See below

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

resolves ibis-project#10703

NickCrews · 2025-01-25T02:10:06Z

After @kszucs's comment, I think this needs more general discussion of the spec before I dive into an implementation.

After thinking about it and experimenting with current behavior, I have a more generalized proposal for how to adjust join conditions.

"Single" Values

Currently, if given something that is "singular", it should be resolved against each table individually, and then combined with ==. This is good behavior, I want to keep this, and just add the ability to pass _.col.

`lt.join(rt, "f")`

Currently works as expected. My proposals below are just generalizations of this.

`lt.join(rt, lambda t: t.f)`

Currently, works as expected, the same as lt.join(rt, "f")

After, it works the same.

`lt.join(rt, _.f)`

Currently, it fails (resolves against lt, giving lt.f, which isn't a BooleanColumn)

After, it resolves to lt.f == rt.f, same semantics of lt.join(rt, "f") More generally, we should support lt.join(rt, _.anything_that_join_supports)

If we change this behavior, then this would actually NOT break current usages of lt.join(rt, _.f == rt.f)!
Currently, this works by binding any Deferreds to the left table, resulting in (lt.f == rt.f).
After the change, after resolving against each table individually and then combining with ==, we would get a condition of
(lt.f == rt.f) == (rt.f == rt.f) which is the same as (lt.f == rt.f) == True which is the same as (lt.f == rt.f), which is the same as the current behavior! Double check my algebra/logic here, I might be wrong. We could add a test before the change and ensure that it doesn't start failing.

However, a "better" way of doing this, that I think we should actually document, is lt.join(rt, lambda l, r: l.f == r.f), see below. Even though lt.join(rt, _.f == rt.f) "works", I don't like it: It feels arbitrary how it binds to the left table, not the right. Also, I don't see how it is that useful, since you still need the direct reference to rt, which makes it so you can't reuse it in other join conditions. Since you are already hardcoding in the rt in there, it's not that much more of a pain to also hardcode in the lt.

"Pair" Values

We already support this with 2-tuples eg lt.join(rt, [("f", _.f + 1), (_.f.abs(), _.f + 1)]).
I think we should extend this to lambdas.

`lt.join(rt, lambda l, r: l.f == r.f)`

Currently, it fails with TypeError: <lambda>() missing 1 required positional argument: 'r'
After, it works. This is the preferred new spelling of lt.join(rt, _.f == rt.f)

In all of these, we should probably support recursion. eg lt.join(rt, lambda l, r: (lambda y: _.f)) should work.
It looks like currently lt.join(rt, lambda x: (lambda y: y.f)) works as expected, so maybe this wouldn't actually be a huge lift.

kszucs · 2025-01-28T13:37:16Z

Don't want to further complicate it, but deferreds actually support multiple placeholders, not just _, like:

l, r = var("l"), var("r")

left_table.join(right_table, l.field_a == r.field_a && r.field_b == l.field_b)

NickCrews · 2025-01-28T20:05:02Z

Oh wow, I didn't know about that. That's not a part of the public API though, right? Are you saying there is a way to use this API to support one of the use cases I describe above?

Am I correct here: It requires the relation name to match the var name, eg t.name("left").join(t2.name("right"), var("x").field_g == var("y").field_h) would fail? If it does require a match, then this would not allow for reusing predicates between diffferent joins.

NickCrews added the feature Features or general enhancements label Jan 22, 2025

github-project-automation bot added this to Ibis planning and roadmap Jan 22, 2025

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jan 22, 2025

NickCrews added a commit to NickCrews/ibis that referenced this issue Jan 22, 2025

feat(api): support passing Deferreds as predicates to joins

7617e9a

resolves ibis-project#10703

NickCrews linked a pull request Jan 22, 2025 that will close this issue

feat(api): support passing Deferreds as predicates to joins #10704

Draft

NickCrews added a commit to NickCrews/ibis that referenced this issue Jan 22, 2025

feat(api): support passing Deferreds as predicates to joins

5718c7e

resolves ibis-project#10703

NickCrews mentioned this issue Jan 26, 2025

feat: support lambda left, right: (left.col == right.col) | (left.col.isnull() & right.col.isnull()) as predicate in joins #10705

Closed

1 task

NickCrews changed the title ~~feat: support Deferreds as predicates to joins~~ feat(join): support Deferreds and lambda left, right: ... as predicates Jan 26, 2025

NickCrews changed the title ~~feat(join): support Deferreds and lambda left, right: ... as predicates~~ feat(join): support _.col.upper() and lambda left, right: ... as predicates Jan 26, 2025

NickCrews marked this as a duplicate of #10705 Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(join): support `_.col.upper()` and `lambda left, right: ...` as predicates #10703

feat(join): support `_.col.upper()` and `lambda left, right: ...` as predicates #10703

NickCrews commented Jan 22, 2025 •

edited

Loading

NickCrews commented Jan 25, 2025 •

edited

Loading

kszucs commented Jan 28, 2025

NickCrews commented Jan 28, 2025

feat(join): support _.col.upper() and lambda left, right: ... as predicates #10703

feat(join): support _.col.upper() and lambda left, right: ... as predicates #10703

Comments

NickCrews commented Jan 22, 2025 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Code of Conduct

NickCrews commented Jan 25, 2025 • edited Loading

"Single" Values

lt.join(rt, "f")

lt.join(rt, lambda t: t.f)

lt.join(rt, _.f)

"Pair" Values

lt.join(rt, lambda l, r: l.f == r.f)

kszucs commented Jan 28, 2025

NickCrews commented Jan 28, 2025

feat(join): support `_.col.upper()` and `lambda left, right: ...` as predicates #10703

feat(join): support `_.col.upper()` and `lambda left, right: ...` as predicates #10703

NickCrews commented Jan 22, 2025 •

edited

Loading

NickCrews commented Jan 25, 2025 •

edited

Loading

`lt.join(rt, "f")`

`lt.join(rt, lambda t: t.f)`

`lt.join(rt, _.f)`

`lt.join(rt, lambda l, r: l.f == r.f)`