Skip to content

Conversation

@yohplala
Copy link

@yohplala yohplala commented Mar 24, 2025

PR aiming to solve #949

The solution implemented is a "at read time" solution. Step by step:

  • a new ParquetFile attribute has been created to store categorical values : global_cats,
  • when reading row group (with ParquetFile.read_row_group_file), it passes this global attribute down to read_col
  • read_col uses this global attribute and populates it with new categorical values that are encountered when reading successively new row group
  • whenever new values are found (or existing values with inconsistent codes compared to previous row groups), it creates a remapping table specific for this row group. It uses it to update to correct codes the column values.

Additional modifications are:

  • when slicing a ParquetFile (row group selection) with __getitem__, global_cats is reset. It could be used in a future modification to retrieve categorical values (why not) but in case of slicing, fewer categorical values would remain relevant.
    Anyhow, at next read_row_group_file operation, it would be repopulated with the right values
  • global_cats has been added to __getstate__ to ensure corret pickling
  • 2 test cases have been provided, testing 1 or 2 categorical columns, appending up to 3 times, using categorical strings and integers

@yohplala
Copy link
Author

I understand that with Datapage V2, the fix is not working.
https://github.com/dask/fastparquet/actions/runs/14037016911/job/39297384215?pr=953

Martin, would you have any advice on how to fix this?
I will try to look a bit later.

@martindurant
Copy link
Member

There's a lot of work here, it will take me time to get to!

with Datapage V2, the fix is not working

I don't think there's any difference in how dictionaries are stored in v2, but I'll try to think about it.

@yohplala
Copy link
Author

There's a lot of work here, it will take me time to get to!

I don't think there's any difference in how dictionaries are stored in v2, but I'll try to think about it.

Thanks a lot Martin. I understand it can take time no worries.
Yes, if you think to something regarding data page v2, I am interested to know. This is the difference I can see between this CI which is failing and the other CIs, which are all ok.

@yohplala
Copy link
Author

yohplala commented Mar 24, 2025

I don't think there's any difference in how dictionaries are stored in v2, but I'll try to think about it.

Martin, after further investigations, I can see why the remapping that is calculated is not applied with v2.

In read_col(), there is an "early exit" with a continue statement if v2 is activated and the mapping seems to be managed actually in read_data_page_v2 here:

        if ph.type == parquet_thrift.PageType.DATA_PAGE_V2:
            num += read_data_page_v2(infile, schema_helper, se, ph.data_page_header_v2, cmd,
                                     dic, assign, num, use_cat, off, ph, row_idx, selfmade=selfmade,
                                     row_filter=row_filter)
            continue

And because of the continue statement, the code I have implemented after to achieve the remapping is not executed:

        elif defi is not None:
            part = assign[num:num+len(defi)]
            if isinstance(part.dtype, pd.core.arrays.masked.BaseMaskedDtype):
                # TODO: could have read directly into array
                part._mask[:] = defi != max_defi
                part = part._data
            elif part.dtype.kind != "O":
                part[defi != max_defi] = my_nan
            if d and not use_cat:
                part[defi == max_defi] = dic[val]
            elif not use_cat:
                part[defi == max_defi] = convert(val, se, dtype=assign.dtype)
            elif remap_dict:
                # Apply remapping of categorical codes
                part[defi == max_defi] = remap_array[val]       # /!\ remapping code here /!\
            else:
                part[defi == max_defi] = val

So this clarifies why it is not executed.
In read_data_page_v2, I would however be glad of your confirmation as to the elif branch where I should achieve the remapping?
We have several elif and which one is the one that relates to categorical management?
I would think it is this elif branch, because use_cat is present:

    elif (use_cat and data_header2.encoding in [
        parquet_thrift.Encoding.PLAIN_DICTIONARY,
        parquet_thrift.Encoding.RLE_DICTIONARY,
    ]) or (data_header2.encoding == parquet_thrift.Encoding.RLE):

Can you confirm?

@yohplala yohplala force-pushed the fix-cats-global branch 2 times, most recently from 0874baa to eda1eb9 Compare March 25, 2025 13:20
@yohplala
Copy link
Author

yohplala commented Mar 25, 2025

Has been replaced by PR #954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants