Skip to content

Explore changing the form of the Parquet page indexes in ParquetMetaData #8818

@etseidl

Description

@etseidl

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The page indexes in ParquetMetaData are currently defined as

pub type ParquetColumnIndex = Vec<Vec<ColumnIndexMetaData>>;
pub type ParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>;

The trouble is that the indexes are optional per column chunk.

pub struct ColumnChunkMetaData {
    ...
    offset_index_offset: Option<i64>,
    offset_index_length: Option<i32>,
    column_index_offset: Option<i64>,
    column_index_length: Option<i32>,
    ...
}

This often leads to issues when dealing with partially missing indexes. For the column index, we currently have an enum variant to handle a missing index (ColumnIndexMetaData::NONE). But the offset index has no such structure, so a missing offset index can sometimes lead to a panic.

Describe the solution you'd like
I think the index structures should mirror the ColumnChunkMetaData and allow for None at the leaf level. That is, redefine the page index structures to be

pub type ParquetColumnIndex = Vec<Vec<Option<ColumnIndexMetaData>>>;
pub type ParquetOffsetIndex = Vec<Vec<Option<OffsetIndexMetaData>>>;

Describe alternatives you've considered
A more radical change could be to move the column and offset index structures to the ColumnChunkMetaData and either do away with the current Vec<Vec<Idx>> indexes, or convert them to a vector of references or Arcs.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions