Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refine record batch project #1014

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Feb 27, 2025

This PR:

Context

For read or write in iceberg-rust, we interoperate with arrow RecordBatch. There are similar project tasks when reading and write to project RecordBatch:

  • In equality delete, we need to project record batch according to field id
  • in read, the user may scan part of the column and we also need to project record batch according to field id
  • for projects in reading, the iceberg also defines project rules to fulfill the value when a column is missing. Like init-default

For now, these tasks are implemented in different structs, RecordBatchTransform, and RecordBatchProjector and there are redundant codes. After we introduce the SchemaWithPartner visitor in #731, we can unify the implementation under the "visitor" framework.

The project task is generally separated into two parts:

  1. accessor: retrieve the array of partners.
    We can custom accessor to achieve init-default. E.g. when fail to retrieve the array, accessor can construct the array according to init-default. And we also can custom accessor to achieve name-map.
  2. visitor: do some convert task.
    We can customize visitors to do the type of promotion here.

The PR try this work to refactor our RecordBatchTransform, and RecordBatchProjector. And it also complete #405 (comment) using ArrowProjectionVisitor as @liurenjie1024 mention

@ZENOTME ZENOTME force-pushed the record_batch_projector branch 2 times, most recently from a9dc195 to 28dbba9 Compare February 28, 2025 05:39
@ZENOTME ZENOTME changed the title refine: unify record batch project feat: refine record batch project Feb 28, 2025
@ZENOTME ZENOTME marked this pull request as ready for review February 28, 2025 07:10
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 28, 2025

cc @liurenjie1024 @sdd @Xuanwo @Fokko

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 28, 2025

Potential improvement: For now, we will trave the whole record in visitor pattern, but in some case we don't need to travese child column. E.g. the map column is passthrough, we don't need type promotion for its child so we hope to return map column directly and stop traversing its child.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant