Support for PyArrow parquet partitioning #7465

bsprenger · 2025-03-18T10:49:42Z

bsprenger
Mar 18, 2025

Hi all! 😄 Is there support for loading parquet datasets with partitioning as described in the arrow docs?

Particularly, when I use load_dataset(), it currently seems not to be able to reconstruct the partition columns as described in the doc. For example, if I create a folder structure:

dataset_name/
  year=2007/
    month=01/
       data0.parquet
       data1.parquet
       ...
    month=02/
       data0.parquet
       data1.parquet
       ...
    month=03/
    ...
  year=2008/
    month=01/
    ...
  ...

using Pyarrow datasets I can load this dataset, specify the partition columns, and it will automatically add the year, month etc as columns, but I don't see a way to do this through load_dataset (which ignores the key=value folder structure). Does someone know if there's a way around this?

Thanks so much! 😄

Nintorac · 2025-08-04T13:05:05Z

Nintorac
Aug 4, 2025

Hey, did you ever figure anything out for this?

1 reply

bsprenger Aug 14, 2025
Author

Sorry for the delayed response. Per this thread, it does not appear to be supported :( And from what I can tell looking in the Datasets internals, this appears to still be the case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for PyArrow parquet partitioning #7465

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support for PyArrow parquet partitioning #7465

Uh oh!

bsprenger Mar 18, 2025

Replies: 1 comment · 1 reply

Uh oh!

Nintorac Aug 4, 2025

Uh oh!

bsprenger Aug 14, 2025 Author

bsprenger
Mar 18, 2025

Replies: 1 comment 1 reply

Nintorac
Aug 4, 2025

bsprenger Aug 14, 2025
Author