- 
                Notifications
    You must be signed in to change notification settings 
- Fork 52
          feat: add isin to the specification
          #959
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kgryte. Looks pretty good to me. I agree with the design choices in the PR description.
Would we be okay with requiring that value equality must be used? Is there a scenario where we want to allow libraries some wiggle room, such as with NaN and signed zero comparison?
I am not sure wiggle room is needed here. This function has more to do with equal than with unique I think. I just checked NumPy, PyTorch, JAX and CuPy - all seem to be using value equality for nan.
Are we okay with leaving out
assume_unique?
Yes.
Are we okay with not mandating reshape behavior if
x2is multi-dimensional?
I think that that part of the np.isin docstring is confusing. Reshaping is meaningless, the only point of that is trying to express that the comparisons are element-wise. It'd be better to have a simple double for-loop with pseudo-code. There is no broadcasting either, any shapes should work and the output has the same shape as x1.
| Parameters | ||
| ---------- | ||
| x1: Union[array, int, float, complex, bool] | ||
| first input array. **May** have any data type. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double-check, are we happy with e.g. torch not allowing complex values here:
In [16]: torch.isin(1j, torch.arange(3, dtype=torch.float64))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[16], line 1
----> 1 torch.isin(1j, torch.arange(3, dtype=torch.float64))
RuntimeError: Unsupported input type encountered for isin(): ComplexDouble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ev-br How difficult would this be to work around in the compat layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's doable of course. That said, we are then adding to the growing "thickness" of the allegedly thin compat shim that is array-api-compat, and we'd be adding one more thing that realistically pytorch is not going to implement in a foreseeable future. Meaning we should be very clear that these small steps all move the compat layer from a temporary solution to permanent. Which looks like we should just stop pretending that the compat layer is temporary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are actual use-cases for isin?  I suspect it's used on floats, but in general it is more of an integer API (since true floats are likely to not match exactly), so if there are very few use-cases maybe it makes sense to limit what is promised to work?
(In the sense, that one point of the Array API was to not add a lot of complicated/awkward API.)
| x2: Union[array, int, float, complex, bool] | ||
| second input array. **May** have any data type. | ||
| invert: bool | ||
| boolean indicating whether to invert the test criterion. If ``True``, the function **must** test whether each element in ``x1`` is *not* in ``x2``. If ``False``, the function **must** test whether each element in ``x1`` is in ``x2``. Default: ``False``. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if isin(x1, x2, invert=True) is exactly equivalent to logical_not(isin(x1, x2)), we could drop the argument completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that may be true... which in fact might be a bit awkward, because I am not sure if NaN logic adds up nicely or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as nans are never isin via equality comparison, then it seems to be unambiguous (if strange at a first sight)
In [23]: np.isin(np.nan, [np.nan], invert=True)
Out[23]: array(True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the weird thing is whether np.nan not in [3.] since np.nan != 3. so how does invert=True work?  Like np.nan not in [3.] or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [24]: np.isin(np.nan, [3], invert=True)
Out[24]: array(True)
In [25]: np.isin(np.nan, [3])
Out[25]: array(False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry, mind-slip.  Somehow I sometimes think just inverting can lead to weird things with NaNs, but that only works with the other comparisons not == and !=.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, here it only works because of equality comparison IIUC. Otherwise you're completely right, nans throw off logical inversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ev-br Yes, in principle, we could drop the invert kwarg; however, the point of having that kwarg is to allow for more efficient operations in libraries not supporting graph-based optimization. Personally, I would prefer a separate "is not in" API, but the current lay of the land is having a kwarg in isin which negates the element-wise result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that the performance optimization argument I just stated is a bit tenuous here, given that the OP explicitly advocates against including assume_unique, but the difference between invert and assume_unique is that the former conveys semantic meaning, whereas the latter is primarily about making implementer lives easier.
|  | ||
| def isin( | ||
| x1: Union[array, int, float, complex, bool], | ||
| x2: Union[array, int, float, complex, bool], | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the very late comment! I'm afraid allowing x2 to be an array is problematic for ndonnx (or rather ONNX in general). If x2 is known "at build time" - as it would be if it were, for instance, a Sequence[int | float | complex | bool] - then there are efficient ways to implement this in the ONNX standard using a HashMap. However, if the values of x2 are not known ahead of time, then one would be forced to implement this using broadcasting, such as x1[:, None] == x2[None, :] and a subsequent any reduction. The performance of such an implementation may be surprisingly bad in some circumstances. This is also the reason why we use a Sequence rather than an array in the current ndonnx implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbourjau Does your comment also apply to searchsorted (ref: https://data-apis.org/array-api/latest/API_specification/generated/array_api.searchsorted.html#array_api.searchsorted)? There, you also have an x2 containing potentially "unknown" values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed fall back to broadcasting in our implementation of searchsorted (see here).
| invert: bool = False, | ||
| ) -> array: | ||
| """ | ||
| Tests whether each element in ``x1`` is in ``x2``. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this wording:
| Tests whether each element in ``x1`` is in ``x2``. | |
| Tests for each element in ``x1`` if it is in ``x2``. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that the current description is ambiguous whether the function performs an element-wise operation or a reduction. Will update.
This PR:
resolves RFC: add
isinfor elementwise set inclusion test #854 by addingisinto the specification.of the keyword arguments determined according to array comparison data, this PR chooses to support only the
invertkwarg. Theassume_uniquekwarg was not included for the following reasons:assume_uniquewhen usingisinand that was when searching lists of already known unique values.assume_uniqueis something of a performance optimization/implementation detail which we have generally attempted to avoid when standardizing APIs.does not place restrictions on the shape of
x2. While some libraries may choose to flatten a multi-dimensionalx2, that is something of an implementation detail and not strictly necessary. For example, an implementation could defer to an "includes" kernel which performs nested loop iteration without needing to perform explicit reshapes/copies.adds support for scalar arguments for either
x1orx2. This follows recent general practice in standardized APIs, with the restriction that at least one ofx1orx2must be an array.specifies that value equality should be used, but not must be used. This follows other set APIs (e.g.,
unique*). As a consequence of value equality,NaNvalues can never test asTrueand there is no distinction between signed zeros.allows both
x1andx2to be of any data type. However, ifx1andx2have no promotable data type, behavior is left unspecified and thus implementation-defined.Questions
Update: answers provided based on feedback below and discussions during workgroup meetings.
NaNand signed zero comparison?must, notshould, due to predominant usage patterns.assume_unique?x2is multi-dimensional?