Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API (string dtype): comparisons between different string classes #60639

Open
rhshadrach opened this issue Jan 1, 2025 · 25 comments
Open

API (string dtype): comparisons between different string classes #60639

rhshadrach opened this issue Jan 1, 2025 · 25 comments
Labels
API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations Strings String extension data type and string data

Comments

@rhshadrach
Copy link
Member

Some comparisons between different classes of string (e.g. string[pyarrow] and str) raise. Resolving this is straightforward except for what class should be returned. I would expect it should always be the left obj, e.g. string[pyarrow] == str should return string[pyarrow] whereas str == string[pyarrow] should return str. Is this the concensus?

We currently run into issues with how Python handles subclasses with comparison dunders.

lhs = pd.array(["x", pd.NA, "y"], dtype="string[pyarrow]")
rhs = pd.array(["x", pd.NA, "y"], dtype=pd.StringDtype("pyarrow", np.nan))

print(lhs.__eq__(rhs))
# <ArrowExtensionArray>
# [True, <NA>, True]
# Length: 3, dtype: bool[pyarrow]

print(lhs == rhs)
# [ True False  True]

The two results above differ because ArrowStringArrayNumpySemantics is a proper subclass of ArrowStringArray and therefore Python first calls rhs.__eq__(lhs).

We can avoid this by special casing this particular case in ArrowStringArrayNumpySemantics, but I wanted to open up an issue for discussion before proceeding.

cc @WillAyd @jorisvandenbossche

@rhshadrach rhshadrach added Numeric Operations Arithmetic, Comparison, and Logical operations Strings String extension data type and string data Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels Jan 1, 2025
@WillAyd
Copy link
Member

WillAyd commented Jan 1, 2025

I would expect it should always be the left obj, e.g. string[pyarrow] == str should return string[pyarrow] whereas str == string[pyarrow] should return str

You mean the return types should be bool right? I'm assuming so but let me know if I misunderstand

This is another motivating case for PDEP-13 #58455

I think without that, they should probably just return the bool extension type to at least preserve the possibility of null values

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 1, 2025

Ah, indeed, thanks! I meant the Boolean dtype determined by the left side. So string[pyarrow] == str would return bool[pyarrow]. Is this in conflict with your last sentence above? I'm not sure what the bool extension type means.

@WillAyd
Copy link
Member

WillAyd commented Jan 1, 2025

pd.BooleanDtype is the bool extension type

@jorisvandenbossche
Copy link
Member

Just to be explicit, the two possible return values we currently have for the example above (in case of consistent dtypes for left and right operand):

>>> lhs == rhs.astype(lhs.dtype)
<ArrowExtensionArray>
[True, <NA>, True]
Length: 3, dtype: bool[pyarrow]

>>> lhs.astype(rhs.dtype) == rhs
array([ True, False,  True])

are bool[pyarrow] and np.dtype('bool').
I personally agree that this should be pd.BooleanDtype instead of ArrowDtype for the first one (that is something that has been changed a while ago, and I thought we had an issue about it, but can't directly find it). But let's here focus on which of the current dtypes should be returned in case of mixed operands, i.e. essentially the question of whether to always prioritize the left operand or whether there should be a kind of hierarchy.

While in general letting the left operand take priority sounds fine and something that typically happens in Python (with Python also automatically first calling __eq__ of the lhs before trying the rhs), in the context of array objects and data types, it might make more sense to have a form of hierarchy / priority between groups of dtypes?

For example also for the nullable numeric dtypes vs numpy dtypes, we always give preference to the nullable dtype in that case:

>>> ser1 = pd.Series([1, 2, 3], dtype="int64")   # numpy
>>> ser2 = pd.Series([1, 2, 3], dtype="Int64")   # nullable

# numpy gives numpy bool
>>> (ser1 == ser1).dtype
dtype('bool')

# but once left OR right is nullable, result is nullable
>>> (ser1 == ser2).dtype
BooleanDtype
>>> (ser2 == ser1).dtype
BooleanDtype

@jorisvandenbossche
Copy link
Member

Making the overview of the result of == for all possible string dtype combinations (row labels is LHS, column labels RHS dtype):

object str str string string string[pyarrow]
object bool bool bool bool[pyarrow] boolean bool[pyarrow]
str (pyarrow) bool bool <error> bool <error> bool
str (python) bool bool bool bool bool bool
string (pyarrow) bool[pyarrow] bool <error> bool[pyarrow] <error> bool[pyarrow]
string (python) boolean boolean bool boolean boolean boolean
string[pyarrow] bool[pyarrow] bool <error> bool[pyarrow] <error> bool[pyarrow]
import numpy as np
import pandas as pd
import pyarrow as pa

dtypes = [
    np.dtype(object),
    pd.StringDtype("pyarrow", na_value=np.nan),
    pd.StringDtype("python", na_value=np.nan),
    pd.StringDtype("pyarrow", na_value=pd.NA),
    pd.StringDtype("python", na_value=pd.NA),
    pd.ArrowDtype(pa.string())
]

results = []
for dt1, dt2 in itertools.product(dtypes, dtypes):
    ser1 = pd.Series(["a", None, "b"], dtype=dt1)
    ser2 = pd.Series(["a", None, "b"], dtype=dt2)
    try:
        res = ser1 == ser2
        results.append((dt1, dt2, res.dtype))
    except:
        results.append((dt1, dt2, "<error>"))

df_results = pd.DataFrame(results, columns=["left", "right", "result"])

print(df_results.pivot(index="left", columns="right", values="result").to_markdown())

Some quick observations:

  • There are a few cases that error right now, especially involving the python vs pyarrow storage of the future default str dtype, so something we should fix.
  • For object dtype "strings", all other actual string dtypes take precedence (i.e. they return the "native" bool dtype for their family of strings), even if they are the RHS operand (i.e. this is the first row of the table)
    • Should we do the same for the new default str dtype? I.e. should essentially the second and third row be the same as the first row?

@WillAyd
Copy link
Member

WillAyd commented Jan 3, 2025

That's quite the table there...thanks for putting that together. I still think that all of the extension types equality ops should return the boolean extension type. The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API.

In general I find this a really non-value added activity for end users of the library, so long term still push for the logical type system. But in the interim I think we can not rock the boat to much and just stick with what was in place already for the string extension type

@jorisvandenbossche
Copy link
Member

(about returning the boolean extension dtype) ... The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API.

For "string(pyarrow)", fully agreed (and as mentioned above, this dtype also did that in the past, this was only changed in pandas 2.0).
But to be clear, for the "str" variants, that is IMO out of the question on the short term. As that was for me the whole point to write a PDEP about it and the reason we introduced the "str" / pd.StringDtype(na_value=np.nan) variants in addition to the existing StringDtype using pd.NA, i.e. so it could have different beahviour regarding default dtypes and missing value (and so not use the opt-in "nullable" dtypes, for now).

(of course we could also have added a NaN variant of the boolean extension dtype, and then it could have used that. But we didn't do that, and I think that is now too late for 3.x?)

And I also fully agree we should improve that messy table with logical dtypes.


But that said, we still need to make a decision on the short term for 2.3 / 3.x about which resulting dtype to use in case of mixed dtype operands:

  1. Give priority to the LHS
    • ser[str] == ser[string] -> numpy bool dtype
    • ser[string] == ser[str] -> boolean extension dtype
  2. Define some hierarchy (e.g. object < NaN-variant "str" < NA-variant "string" < ArrowDtype):
    • ser[str] == ser[string] -> boolean extension dtype
    • ser[string] == ser[str] -> boolean extension dtype

(and for the hierarchy option, also have "python < pyarrow" for the storage within the NaN or NA variant. This is less relevant for == equality, but for example for + resulting in new strings, the exact string dtype and its storage is also relevant)

I personally lean to the second option, I think.

(side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype)

@WillAyd
Copy link
Member

WillAyd commented Jan 3, 2025

Out of those two options I definitely prefer the hierarchy; I think giving the lhs more importance than the rhs is really hard to keep track of, especially in a potentially long pipeline of operations. I think your proposed solution works well too, although something does feel weird about mixing between NA markers for the str / string types...but that may be the least of all evils.

(side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype)

This may be worth further discussing in the PDEP, but I think it is just an implementation detail (?) and not something that should worry the end user too much

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 3, 2025

My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series. Likewise, if I have two NaN Series (pd.StringDtype("python", na_value=np.nan) and pd.StringDtype("pyarrow", na_value=np.nan), I would be also surprised to get back a Series with pd.NA-semantics.

I think this puts me in the hierarchy camp. My proposed hierarchy is:

object < (pyarrow, NaN) < (python, NaN) <  (pyarrow, NA) < (python, NA)

My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference.

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference.

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?

Of course, hopefully all of this is a pathological edge-case that users don't encounter.

Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up

My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series.

This is definitely valid with the current implementation of the extension types, although keep in mind with the proposed logical types that the storage becomes an implementation detail. We could set up whatever rules we wanted to manage this, although in the least pathological cases I would think pyarrow would be the main storage

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

object < (pyarrow, NaN) < (python, NaN) < (pyarrow, NA) < (python, NA)

Thinking through this one some more, I'm also not sure that the python string implementations should ever take priority over pyarrow. That's a huge performance degradation

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 4, 2025

So since you've opted into, we should give it preference.
This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?

If you have PyArrow installed, then pd.Series(list("xyz")) gives you PyArrow backed. So to end up with Python, you need to have opted into it.

Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up

I expect the common occurrence will be object dtype against one of the other 4, but not e.g. NA-pyarrow against NaN-pyarrow.

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

That's probably true for data that you create in your process, but when you bring I/O into the mix things can easily get mixed up.. For instance, if you load a parquet file that someone saved with dtype="string" but you use the default type system, you are going to mix up NA/NaN sentinels, even with PyArrow installed.

I don't know if our parquet I/O keeps the storage backend as part of the metadata, but if it does that would also make it easy to mix up types (ex: a user without PyArrow installed saves a file that gets loaded by someone with PyArrow)

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 4, 2025

I don't know if our parquet I/O keeps the storage backend as part of the metadata

Current parquet behavior is to always infer str.

pd.set_option("infer_string", True)

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="object")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="string")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="str")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

Ah that looks like a bug. dtype="string" should be roundtripping (it does without infer_string)

@WillAyd
Copy link
Member

WillAyd commented Jan 4, 2025

I vaguely recall there being some work done upstream in Arrow to better differentiate these. Not sure if the version of that matters but Joris would know best

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 5, 2025

For the hierarchy, my first reaction was also to find it odd to do "pyarrow < python" (since pyarrow is always preferred over python if installed). But your explanation certainly makes sense (you only get that situation when explicitly opting in to "python", so in that case preserve it).

Now, the question if how often users will accidentally have data using the "python" storage?
Our constructors and most IO methods (for formats that don't store pandas-specific metadata) will be fine I suppose.

One exception I can think of is pickle. Using to_pickle()/read_pickle roundtrip right now will preserve the storage (so when pickling in an env without pyarrow and then reading an env with pyarrow can give you a non-default storage for the string dtype).

For Parquet:

Current parquet behavior is to always infer str.

Ah that looks like a bug. dtype="string" should be roundtripping (it does without infer_string)
..
I vaguely recall there being some work done upstream in Arrow to better differentiate these. Not sure if the version of that matters but Joris would know best

That's indeed a known limitation right now. Normally pyarrow's to_pandas() tries to preserve the pandas extension dtype stored in the metadata, but for getting the default string dtype to work we currently overrule that mechanism by simply asking for "str" for all string columns through the types_mapper keyword:

elif using_string_dtype():
types_mapper = _arrow_string_types_mapper()

(and when specifying that keyword, that currently overrides any other mechanisms to infer the dtype from the metadata)

I am fixing this on the pyarrow side in apache/arrow#44195 (essentially moving the logic of choosing the "str" dtype by default for pandas >= 3 to pyarrow, so that then we don't have to override that anymore). So starting with pyarrow 19.0, we should be able to restore the functionality of preserving "string" dtype.

For both "string" and "str" dtype, it will preserve that dtype (i.e. whether it uses NaN vs NA), but not the storage. That is because we don't include that information in the __str__ repr, only in the __repr__ repr (for "string", as for "str" we don't include it in either), and pyarrow uses str(..) for storing information about the dtype in the metadata.

(we currently don't really control the pyarrow->pandas conversion, which makes changes to the default behaviour for this conversion tricky and difficult, see xref #59780 for moving that logic from pyarrow to pandas)

@WillAyd
Copy link
Member

WillAyd commented Jan 5, 2025

Maybe we can just disallow a user from trying to force python storage when pyarrow is installed? That would help simplify the rules further

@rhshadrach
Copy link
Member Author

Maybe we can just disallow a user from trying to force python storage when pyarrow is installed?

What happens with pickles that contain Python storage?

@jbrockmendel
Copy link
Member

Haven’t read the whole thread, just OP. In general == should be commutative. We use pandas_priority to determine which object defers to which.

@WillAyd
Copy link
Member

WillAyd commented Jan 7, 2025

What happens with pickles that contain Python storage?

Do you think this would happen from someone starting in an environment without pyarrow, exporting an object via pickle, then loading it into another environment with pyarrow? If so, I wonder if we shouldn't restore that state using the pyarrow backend.

@rhshadrach
Copy link
Member Author

Talked about this in the dev meeting today, the main idea floated there was raising on certain cases:

  • raise anytime storage or NA-semantics differ; or
  • raise anytime NA-semantics differ, but not storage.

I think the main thing captured is that we do not what to be changing what NA-semantics the user is dealing with. I think that when presented with both NA and NaN we should resist the temptation to guess.

For storage, there was a desire to have pyarrow take priority over python if we aren't going to raise.

However we choose to proceed, the consensus in the dev meeting that it should apply to all binary operations.

@WillAyd
Copy link
Member

WillAyd commented Jan 8, 2025

I definitely don't think we should be raising on storage - one of the main values of PDEP-14 was to offer an abstraction over what storage is being used.

I think the main thing captured is that we do not what to be changing what NA-semantics the user is dealing with. I think that when presented with both NA and NaN we should resist the temptation to guess.

This is a good point when thinking about the output of an operation. The input is pretty clear though - we are comparing values that are missing (at least with strings, floats have some ambiguity). Given our long term strategy for types has been to move away from using np.nan as a missing value, I think we should prefer pd.NA in this case

@jorisvandenbossche
Copy link
Member

the main idea floated there was raising on certain cases:

I would find it strange to start raising an error for those certain cases, while right now we almost never raise an error, but rather return False instead of an error if values are not really comparable.
(for example if you ask if floats are equal to datetimes, we don't error but return all Falses)

I know that in this case it is not really about the values being comparable or not (the string values itself are perfectly comparable), but about implicit conversion to different NA semantics. So that's a different situation ..

But still, we are talking here in the context of string dtype. But what do we do for all the other dtypes where we also support those different NA semantics? Would you also want to start raising for that? (eg when comparing numpy floats with nullable ints or floats?)

@WillAyd
Copy link
Member

WillAyd commented Jan 13, 2025

That's a good point. FWIW our status quo seems to prefer using pd.NA as the missing value sentinel when mixing NA-semantics today:

>>> ser1 = pd.Series([1, np.nan, 2])
>>> ser2 = pd.Series([1, pd.NA, 2], dtype=pd.Float64Dtype())
>>> ser1 < ser2
0    False
1     <NA>
2    False
dtype: boolean

Although that comparison is likely not very common, I think sticking with that status quo is the best path forward for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants