-
-
Notifications
You must be signed in to change notification settings - Fork 189
Fix cats global #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cats global #953
Conversation
|
I understand that with Datapage V2, the fix is not working. Martin, would you have any advice on how to fix this? |
|
There's a lot of work here, it will take me time to get to!
I don't think there's any difference in how dictionaries are stored in v2, but I'll try to think about it. |
Thanks a lot Martin. I understand it can take time no worries. |
Martin, after further investigations, I can see why the remapping that is calculated is not applied with v2. In if ph.type == parquet_thrift.PageType.DATA_PAGE_V2:
num += read_data_page_v2(infile, schema_helper, se, ph.data_page_header_v2, cmd,
dic, assign, num, use_cat, off, ph, row_idx, selfmade=selfmade,
row_filter=row_filter)
continueAnd because of the elif defi is not None:
part = assign[num:num+len(defi)]
if isinstance(part.dtype, pd.core.arrays.masked.BaseMaskedDtype):
# TODO: could have read directly into array
part._mask[:] = defi != max_defi
part = part._data
elif part.dtype.kind != "O":
part[defi != max_defi] = my_nan
if d and not use_cat:
part[defi == max_defi] = dic[val]
elif not use_cat:
part[defi == max_defi] = convert(val, se, dtype=assign.dtype)
elif remap_dict:
# Apply remapping of categorical codes
part[defi == max_defi] = remap_array[val] # /!\ remapping code here /!\
else:
part[defi == max_defi] = valSo this clarifies why it is not executed. elif (use_cat and data_header2.encoding in [
parquet_thrift.Encoding.PLAIN_DICTIONARY,
parquet_thrift.Encoding.RLE_DICTIONARY,
]) or (data_header2.encoding == parquet_thrift.Encoding.RLE):Can you confirm? |
0874baa to
eda1eb9
Compare
eda1eb9 to
d55ce77
Compare
|
Has been replaced by PR #954 |
PR aiming to solve #949
The solution implemented is a "at read time" solution. Step by step:
ParquetFileattribute has been created to store categorical values :global_cats,ParquetFile.read_row_group_file), it passes this global attribute down toread_colread_coluses this global attribute and populates it with new categorical values that are encountered when reading successively new row groupAdditional modifications are:
ParquetFile(row group selection) with__getitem__,global_catsis reset. It could be used in a future modification to retrieve categorical values (why not) but in case of slicing, fewer categorical values would remain relevant.Anyhow, at next
read_row_group_fileoperation, it would be repopulated with the right valuesglobal_catshas been added to__getstate__to ensure corret pickling