Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(join): support _.col.upper() and lambda left, right: ... as predicates #10703

Open
1 task done
NickCrews opened this issue Jan 22, 2025 · 3 comments · May be fixed by #10704
Open
1 task done

feat(join): support _.col.upper() and lambda left, right: ... as predicates #10703

NickCrews opened this issue Jan 22, 2025 · 3 comments · May be fixed by #10704
Labels
feature Features or general enhancements

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Jan 22, 2025

Is your feature request related to a problem?

I have these two tables, which I want to join regardless of capitalization. I want to be able to do this:

import ibis
tl = ibis.memtable({"f": ["a", "b", "c"]})
tr = ibis.memtable({"f": ["A", "B", "Z"]})
tl.join(tr, ibis._.f.upper())
# expect {"f": ["a", "b"], "f_right": ["A", "B"]}
# Currently get `SignatureValidationError: JoinLink(...)`

Currently I have to actually pass tl.f.upper() == tr.f.upper().

Another problem I have is reusing my predicates. I want to be able to do all of these:

def equalish(col: str):
    def pred(left: ibis.Table, right: ibis.Table):
        return (left[col] == right[col]) | (left[col].isnull() & right[col].isnull())
   return pred

preds = [
    "a",
    _.x.upper(),
    lambda t: t.y.abs(),
    lambda l, r: l.foo == r.bar,
    (_.baz, _.qux.upper()),
    equalish("x"),
    equalish("y"),
]
a.join(b, preds)
c.join(d, preds)

Currently, I have to reference the tables directly in the predicates, so I have to actually do the binding at right-before-join time, which prevents me from being able to reuse them.

def equalish(left, right, col):
    return (left[col] == right[col]) | (left[col].isnull() & right[col].isnull())

a.join(b, [equalish(a, b, "x"), equalish(a, b, "y")])
c.join(d, [equalish(c, d, "x"), equalish(c, d, "y")])

Describe the solution you'd like

See below

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the feature Features or general enhancements label Jan 22, 2025
NickCrews added a commit to NickCrews/ibis that referenced this issue Jan 22, 2025
NickCrews added a commit to NickCrews/ibis that referenced this issue Jan 22, 2025
@NickCrews
Copy link
Contributor Author

NickCrews commented Jan 25, 2025

After @kszucs's comment, I think this needs more general discussion of the spec before I dive into an implementation.

After thinking about it and experimenting with current behavior, I have a more generalized proposal for how to adjust join conditions.

"Single" Values

Currently, if given something that is "singular", it should be resolved against each table individually, and then combined with ==. This is good behavior, I want to keep this, and just add the ability to pass _.col.

lt.join(rt, "f")

Currently works as expected. My proposals below are just generalizations of this.

lt.join(rt, lambda t: t.f)

Currently, works as expected, the same as lt.join(rt, "f")

After, it works the same.

lt.join(rt, _.f)

Currently, it fails (resolves against lt, giving lt.f, which isn't a BooleanColumn)

After, it resolves to lt.f == rt.f, same semantics of lt.join(rt, "f") More generally, we should support lt.join(rt, _.anything_that_join_supports)

If we change this behavior, then this would actually NOT break current usages of lt.join(rt, _.f == rt.f)!
Currently, this works by binding any Deferreds to the left table, resulting in (lt.f == rt.f).
After the change, after resolving against each table individually and then combining with ==, we would get a condition of
(lt.f == rt.f) == (rt.f == rt.f) which is the same as (lt.f == rt.f) == True which is the same as (lt.f == rt.f), which is the same as the current behavior! Double check my algebra/logic here, I might be wrong. We could add a test before the change and ensure that it doesn't start failing.

However, a "better" way of doing this, that I think we should actually document, is lt.join(rt, lambda l, r: l.f == r.f), see below. Even though lt.join(rt, _.f == rt.f) "works", I don't like it: It feels arbitrary how it binds to the left table, not the right. Also, I don't see how it is that useful, since you still need the direct reference to rt, which makes it so you can't reuse it in other join conditions. Since you are already hardcoding in the rt in there, it's not that much more of a pain to also hardcode in the lt.

"Pair" Values

We already support this with 2-tuples eg lt.join(rt, [("f", _.f + 1), (_.f.abs(), _.f + 1)]).
I think we should extend this to lambdas.

lt.join(rt, lambda l, r: l.f == r.f)

Currently, it fails with TypeError: <lambda>() missing 1 required positional argument: 'r'
After, it works. This is the preferred new spelling of lt.join(rt, _.f == rt.f)

In all of these, we should probably support recursion. eg lt.join(rt, lambda l, r: (lambda y: _.f)) should work.
It looks like currently lt.join(rt, lambda x: (lambda y: y.f)) works as expected, so maybe this wouldn't actually be a huge lift.

@NickCrews NickCrews changed the title feat: support Deferreds as predicates to joins feat(join): support Deferreds and lambda left, right: ... as predicates Jan 26, 2025
@NickCrews NickCrews changed the title feat(join): support Deferreds and lambda left, right: ... as predicates feat(join): support _.col.upper() and lambda left, right: ... as predicates Jan 26, 2025
@NickCrews NickCrews marked this as a duplicate of #10705 Jan 26, 2025
@kszucs
Copy link
Member

kszucs commented Jan 28, 2025

Don't want to further complicate it, but deferreds actually support multiple placeholders, not just _, like:

l, r = var("l"), var("r")

left_table.join(right_table, l.field_a == r.field_a && r.field_b == l.field_b)

@NickCrews
Copy link
Contributor Author

Oh wow, I didn't know about that. That's not a part of the public API though, right? Are you saying there is a way to use this API to support one of the use cases I describe above?

Am I correct here: It requires the relation name to match the var name, eg t.name("left").join(t2.name("right"), var("x").field_g == var("y").field_h) would fail? If it does require a match, then this would not allow for reusing predicates between diffferent joins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

Successfully merging a pull request may close this issue.

2 participants