Skip to content

feat(datafusion): Support insert_into in IcebergTableProvider #1511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

CTTY
Copy link
Contributor

@CTTY CTTY commented Jul 15, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

@@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor {
Ok(schema_partner)
}

// todo generate field_pos in datafusion instead of passing to here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it tricky to handle this case: the input from datafusion won't have field id, and we will need to assign them manually. maybe there is a way to do name mapping here?

@CTTY CTTY force-pushed the ctty/df-insert branch from 7843b0d to 2f9efa8 Compare July 16, 2025 03:37
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr, just finished round of review. My suggestion is to start with unpartitioned table first.

// Define a schema.
Arc::new(ArrowSchema::new(vec![
Field::new("data_files", DataType::Utf8, false),
Field::new("count", DataType::UInt64, false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datafusion expects insert_into to return the number of rows(count) it written: https://datafusion.apache.org/user-guide/sql/dml.html#insert Here I'm sending count to the commit node, and have the commit node to return the number of rows eventually.

Technically we don't need to follow Datafusion's convention on insert_into and can return nothing, do you think that would be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still follow datafusion's convention. But do we really need this? DataFile has a field called record_count, and I think it's enough for insert only case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah using record_count makes more sense, I'll fix this

@@ -432,3 +433,69 @@ async fn test_metadata_table() -> Result<()> {

Ok(())
}

#[tokio::test]
async fn test_insert_into() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of adding this kind of integration tests. How about adding sqllogictests?

// Define a schema.
Arc::new(ArrowSchema::new(vec![
Field::new("data_files", DataType::Utf8, false),
Field::new("count", DataType::UInt64, false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still follow datafusion's convention. But do we really need this? DataFile has a field called record_count, and I think it's enough for insert only case?

PlanProperties::new(
EquivalenceProperties::new(schema),
input.output_partitioning().clone(),
input.pipeline_behavior(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be Final?

Copy link
Contributor Author

@CTTY CTTY Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking maybe IcebergWriteExec can be used for the steaming case so the pipeline behavior and boundedness should be the same as input's. for normal INSERT INTO query it shouldn't matter as well

EquivalenceProperties::new(schema),
input.output_partitioning().clone(),
input.pipeline_behavior(),
input.boundedness(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be Bounded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants