-
Notifications
You must be signed in to change notification settings - Fork 715
feat(iceberg): support DeleteScan for datafusion engine #24065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Mingzhuo Yin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for DeleteScan operations (both equality and position deletes) to the DataFusion engine for Iceberg tables. Previously, only DataScan was supported. The changes enable proper handling of delete files in Iceberg's copy-on-write operations.
Key changes:
- Converted
IcebergScan::new()to an async function to support pre-loading file scan tasks during initialization - Added
list_iceberg_scan_task()method to enumerate and filter data/delete files based on scan type - Implemented metadata column appending (sequence number, file path, file position) for tracking row lineage
- Changed partition handling from single partition to multi-partition based on file scan tasks
- Removed hidden column filtering to expose Iceberg metadata columns through the DataFusion schema
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/frontend/src/datafusion/iceberg_table_provider.rs | Made scan() async and removed hidden column filter to expose Iceberg metadata columns |
| src/frontend/src/datafusion/iceberg_executor.rs | Added delete scan support with task enumeration, multi-partition execution, and metadata column handling |
| e2e_test/iceberg/test_case/pure_slt/iceberg_datafusion_engine.slt | Added test cases for delete operations with DataFusion engine |
Comments suppressed due to low confidence (1)
src/frontend/src/datafusion/iceberg_executor.rs:64
- The
#[allow(dead_code)]attributes onneed_seq_numandneed_file_path_and_posare no longer needed as these fields are now actively used in theexecute_innermethod (lines 265-266). These attributes should be removed.
need_seq_num: bool,
#[allow(dead_code)]
need_file_path_and_pos: bool,
Signed-off-by: Mingzhuo Yin <[email protected]>
Signed-off-by: Mingzhuo Yin <[email protected]>
| plan_properties, | ||
| }; | ||
| inner.tasks = inner.list_iceberg_scan_task().try_collect().await?; | ||
| inner.plan_properties.partitioning = Partitioning::UnknownPartitioning(inner.tasks.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a session variable called batch_parallelism, I think we can use this value to control the parallelism instead of using the task number directly, because the task number could be very large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in new commit
Signed-off-by: Mingzhuo Yin <[email protected]>
This stack of pull requests is managed by Graphite. Learn more about stacking. |
chenzl25
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…bs#24065) Signed-off-by: Mingzhuo Yin <[email protected]>

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Checklist
Documentation
Release note