[Feature][Spark] Fixing data skipping for timestamp stats with truncated timezone offsets (Fixes #5512) #5568
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
This PR fixes a correctness bug in the 3.3 line where timestamp column statistics can cause Delta to skip files incorrectly for historical timestamps (e.g. Europe/Stockholm in 1850) whose timezone offsets include seconds.
Older Delta versions (3.2.x / 3.3.x) wrote stats with timezone offsets truncated to minutes in JSON (e.g. +00:53 instead of +00:53:28). When data skipping uses these stats, files that do contain matching rows may be pruned, so queries silently “lose” data.
Note: I have made a reader side fix for backward compatibility & safety as existing tables with already-bad 3.3.2 stats continue to work; widening the [min, max] window prevents data loss when predicates hit those files. Also i have made the write side fix to not allow truncated timestamps , so that newly written stats are correct.
This PR:
Code used for Reproducing the issue
Output in delta-3.3.2
Delta stats for 3.3.2:
Output in delta-4.0.0
Delta stats in 4.0:
What changes have been made in the PR
DataSkippingReader: widen timestamp stats bounds
Delta already compensates for millisecond truncation of timestamps in JSON by widening MAX by +1 ms. For issue #5249, older writers can also be off by up to 59 seconds because timezone offsets were truncated from HH:mm:ss to HH:mm.
To avoid incorrect pruning, this PR treats timestamp stats as approximate and widens the effective interval.
How was this patch tested?
Added a regression test
data skipping with timestamp stats truncated by seconds (issue 5249)inDataSkippingDeltaTests.scala, which:TIMESTAMPvalue,minValues/maxValuesare shifted by +30 seconds (mimicking the 3.2.x/3.3.x truncated-offset behavior),
is not skipped by data skipping.
Ran
spark/testOnly org.apache.spark.sql.delta.DataSkippingDeltaTestslocally.An assembly jar was created with the changes and the same repro code was tested locally and it works and we get the expected output
Delta stats in 3.3 after fix
Does this PR introduce any user-facing changes?
Yes, but only as a bug fix:
TIMESTAMPfilterson historical timezones (with second-level offsets, e.g.
+00:53:28) willnow return the correct rows instead of silently missing them due to
overly-aggressive stats-based data skipping.
[min, max]range for timestamp statsis widened by up to 59 seconds, so in rare cases slightly more files may be
scanned for timestamp predicates. No API, configuration, or protocol-breaking
changes are introduced.