Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform: Remove snapshot expiration patch, replace with custom delete callback which checks against metadata location #4059

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amogh-jahagirdar
Copy link
Contributor

This change removes the snapshot expiration patch for uniform which currently prevents any deletion of shared data files
as part of expiration. We can achieve the same behavior by passing in a custom delete callback which checks if a path to delete is in the Iceberg metadata location, if it is we can clean it up, if it's not we should not clean it up since it would be a data file.

Note, one edge case which is also addressed is in case the user some configured their Iceberg metadata location to be the same as the data location. In this case, we take the conservative approach of not doing any metadata cleanup.
Such a configuration in practice is very rare since it goes against a user's interest to not separate (separation of metadata in a different prefix ensures better throughput for instance), but it didn't add too much complexity to defend against that so it was added.

Which Delta project/connector is this regarding?

  • [ X] Spark
  • Standalone
  • Flink
  • Kernel
  • Other

Description

This change removes the snapshot expiration patch for uniform which currently prevents any deletion of shared data files
as part of expiration. We can achieve the same behavior by passing in a custom delete callback which checks if a path to delete is in the Iceberg metadata location, if it is we can clean it up, if it's not we should not clean it up since it would be a data file.

Note, one edge case which is also addressed is in case the user some configured their Iceberg metadata location to be the same as the data location. In this case, we take the conservative approach of not doing any metadata cleanup.
Such a configuration in practice is very rare since it goes against a user's interest to not separate (separation of metadata in a different prefix ensures better throughput for instance), but it didn't add too much complexity to defend against that so it was added.

In general, We want to remove Uniform custom Iceberg patches since this will put us on a path for being able to upgrade more effectively and make it more maintainable.

How was this patch tested?

Added integration tests for expiration to ConvertToIcebergSuite

Does this PR introduce any user-facing changes?

No, behavior is preserved

… callback which checks against metadata location
@@ -123,21 +128,160 @@ class ConvertToIcebergSuite extends QueryTest with Eventually {
withDefaultTablePropsInSQLConf {
deltaSpark.range(10).write.format("delta")
.option("path", testTablePath)
.option("delta.enableIcebergCompatV2", "true")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these tests passed for me (even before my change) without setting this so I'm fixing it in this PR

.saveAsTable(testTableName)
}
verifyReadWithIceberg(testTableName, 0 to 19 map (Row(_)))
}
}

def runDeltaSql(sqlStr: String): Unit = {
test("Expire Snapshots") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this case in spark/src/test/scala/org/apache/spark/sql/delta/UniversalFormatSuiteBase.scala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants