Fix cats global #953

yohplala · 2025-03-24T13:50:47Z

PR aiming to solve #949

The solution implemented is a "at read time" solution. Step by step:

a new ParquetFile attribute has been created to store categorical values : global_cats,
when reading row group (with ParquetFile.read_row_group_file), it passes this global attribute down to read_col
read_col uses this global attribute and populates it with new categorical values that are encountered when reading successively new row group
whenever new values are found (or existing values with inconsistent codes compared to previous row groups), it creates a remapping table specific for this row group. It uses it to update to correct codes the column values.

Additional modifications are:

when slicing a ParquetFile (row group selection) with __getitem__, global_cats is reset. It could be used in a future modification to retrieve categorical values (why not) but in case of slicing, fewer categorical values would remain relevant.
Anyhow, at next read_row_group_file operation, it would be repopulated with the right values
global_cats has been added to __getstate__ to ensure corret pickling
2 test cases have been provided, testing 1 or 2 categorical columns, appending up to 3 times, using categorical strings and integers

yohplala · 2025-03-24T13:57:59Z

I understand that with Datapage V2, the fix is not working.
https://github.com/dask/fastparquet/actions/runs/14037016911/job/39297384215?pr=953

Martin, would you have any advice on how to fix this?
I will try to look a bit later.

martindurant · 2025-03-24T13:59:40Z

There's a lot of work here, it will take me time to get to!

with Datapage V2, the fix is not working

I don't think there's any difference in how dictionaries are stored in v2, but I'll try to think about it.

yohplala · 2025-03-24T14:20:06Z

There's a lot of work here, it will take me time to get to!

I don't think there's any difference in how dictionaries are stored in v2, but I'll try to think about it.

Thanks a lot Martin. I understand it can take time no worries.
Yes, if you think to something regarding data page v2, I am interested to know. This is the difference I can see between this CI which is failing and the other CIs, which are all ok.

yohplala · 2025-03-24T20:48:22Z

I don't think there's any difference in how dictionaries are stored in v2, but I'll try to think about it.

Martin, after further investigations, I can see why the remapping that is calculated is not applied with v2.

In read_col(), there is an "early exit" with a continue statement if v2 is activated and the mapping seems to be managed actually in read_data_page_v2 here:

        if ph.type == parquet_thrift.PageType.DATA_PAGE_V2:
            num += read_data_page_v2(infile, schema_helper, se, ph.data_page_header_v2, cmd,
                                     dic, assign, num, use_cat, off, ph, row_idx, selfmade=selfmade,
                                     row_filter=row_filter)
            continue

And because of the continue statement, the code I have implemented after to achieve the remapping is not executed:

        elif defi is not None:
            part = assign[num:num+len(defi)]
            if isinstance(part.dtype, pd.core.arrays.masked.BaseMaskedDtype):
                # TODO: could have read directly into array
                part._mask[:] = defi != max_defi
                part = part._data
            elif part.dtype.kind != "O":
                part[defi != max_defi] = my_nan
            if d and not use_cat:
                part[defi == max_defi] = dic[val]
            elif not use_cat:
                part[defi == max_defi] = convert(val, se, dtype=assign.dtype)
            elif remap_dict:
                # Apply remapping of categorical codes
                part[defi == max_defi] = remap_array[val]       # /!\ remapping code here /!\
            else:
                part[defi == max_defi] = val

So this clarifies why it is not executed.
In read_data_page_v2, I would however be glad of your confirmation as to the elif branch where I should achieve the remapping?
We have several elif and which one is the one that relates to categorical management?
I would think it is this elif branch, because use_cat is present:

    elif (use_cat and data_header2.encoding in [
        parquet_thrift.Encoding.PLAIN_DICTIONARY,
        parquet_thrift.Encoding.RLE_DICTIONARY,
    ]) or (data_header2.encoding == parquet_thrift.Encoding.RLE):

Can you confirm?

yohplala · 2025-03-25T13:43:28Z

Has been replaced by PR #954

yohplala mentioned this pull request Mar 24, 2025

Wrong categories data on reading with engine='fastparquet' #949

Open

yohplala force-pushed the fix-cats-global branch 2 times, most recently from 0874baa to eda1eb9 Compare March 25, 2025 13:20

Fix categorical data appending - ticket dask#949.

d55ce77

yohplala force-pushed the fix-cats-global branch from eda1eb9 to d55ce77 Compare March 25, 2025 13:25

yohplala closed this Mar 25, 2025

yohplala deleted the fix-cats-global branch March 25, 2025 13:36

yohplala mentioned this pull request Mar 25, 2025

Fix categorical data handling with global dictionaries #954

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix cats global #953

Fix cats global #953

Uh oh!

yohplala commented Mar 24, 2025 •

edited

Loading

Uh oh!

yohplala commented Mar 24, 2025

Uh oh!

martindurant commented Mar 24, 2025

Uh oh!

yohplala commented Mar 24, 2025

Uh oh!

yohplala commented Mar 24, 2025 •

edited

Loading

Uh oh!

yohplala commented Mar 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix cats global #953

Fix cats global #953

Uh oh!

Conversation

yohplala commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yohplala commented Mar 24, 2025

Uh oh!

martindurant commented Mar 24, 2025

Uh oh!

yohplala commented Mar 24, 2025

Uh oh!

yohplala commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yohplala commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yohplala commented Mar 24, 2025 •

edited

Loading

yohplala commented Mar 24, 2025 •

edited

Loading

yohplala commented Mar 25, 2025 •

edited

Loading