[Bug][Spark]Add read-path assertion for corrupted checkpoint recovery (issue #5458) #5565
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
This PR adds a missing read-path assertion to the existing corrupted-checkpoint recovery test in SnapshotManagementSuite, so the behavior described in #5458 is captured as an executable test.
Concretely, in the test:
we now also read the table data after corrupting checkpoint 0 and assert the row count:
This encodes in code the expectation that a corrupted checkpoint must not make a healthy table unreadable and that Delta should fall back to JSON log files.
PROBLEM
In Delta 4.0, when the latest checkpoint parquet is corrupted, snapshot creation and spark.read.format("delta").load(path) can fail with [FAILED_READ_FILE.NO_HINT]. In Delta 3.2 the same scenario succeeded by falling back to JSON logs (see #5458).
Fix
In SnapshotManagement, broaden the exception type captured during snapshot creation from checkpoint segments so that IO/parquet errors when reading checkpoint files go through the existing retry/fallback path. If all retries fail, the first error is still rethrown as before.
Tests
Verified that:
How to Reproduce / Run
Run just this test:
build/sbt \ 'project spark' \ 'testOnly org.apache.spark.sql.delta.SnapshotManagementSuite -- -z "recover from a corrupt checkpoint: previous checkpoint doesn"Current behavior (expected with this PR):
On a 4.x stack, this test fails with
[FAILED_READ_FILE.NO_HINT] when trying to read the corrupted checkpoint parquet.which doesn't fail with 3.2
This PR is intentionally test‑only. A follow-up PR will:
How was this was tested?
I ran the below code in delta 3.2 and delta 4.0 with the count.Below are the corresponding logs:
Code:
Logs In delta 3.2 , jar - io.delta:delta-spark_2.13:3.2.2
Logs In delta 4.0, jar - io.delta:delta-spark_2.13:4.0.0
The count gets printed in 3.2 along with the error , but in 4.0 the count doesn't get printed meaning it falls back to reading JSON files.
Does this PR introduce any user-facing changes?
No.this is just a test case.The actual fix must be done and an other PR would be raised