-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The page indexes in ParquetMetaData are currently defined as
pub type ParquetColumnIndex = Vec<Vec<ColumnIndexMetaData>>;
pub type ParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>;The trouble is that the indexes are optional per column chunk.
pub struct ColumnChunkMetaData {
...
offset_index_offset: Option<i64>,
offset_index_length: Option<i32>,
column_index_offset: Option<i64>,
column_index_length: Option<i32>,
...
}This often leads to issues when dealing with partially missing indexes. For the column index, we currently have an enum variant to handle a missing index (ColumnIndexMetaData::NONE). But the offset index has no such structure, so a missing offset index can sometimes lead to a panic.
Describe the solution you'd like
I think the index structures should mirror the ColumnChunkMetaData and allow for None at the leaf level. That is, redefine the page index structures to be
pub type ParquetColumnIndex = Vec<Vec<Option<ColumnIndexMetaData>>>;
pub type ParquetOffsetIndex = Vec<Vec<Option<OffsetIndexMetaData>>>;Describe alternatives you've considered
A more radical change could be to move the column and offset index structures to the ColumnChunkMetaData and either do away with the current Vec<Vec<Idx>> indexes, or convert them to a vector of references or Arcs.
Additional context