Avoid OOM-killing query if result-level caching fails #17652
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #17651.
Description
Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity.
ByteArrayOutputStream
materializes this case as anOutOfMemoryError
, which is not caught and terminates the node. This limits the allocated buffer for storing query results to whatever is set inCacheConfig.getResultLevelCacheLimit()
.Important Note
I opted to use
LimitedOutputStream
here as it is already used withByteArrayOutputStream
. While ok in a QueryRunners (single-threaded), this still is less-than-ideal in the general case because it doesn't guarantee strict consistency between overflow exception delivery and ordering of writes to the buffer(see another example below). As such, this class in general is *not* thread-safe and I think should be refactored to account for this. This is because every case ofLimitedOutputStream
already usesByteArrayOutputStream
, which *is* already using locks, we should suffer no performance hit by synchronizingLimitedOutputStream::write
methods. This is just in the general spirit of future-proofing code, given that we're already using locks, we might as well avoid as many future races as we can : ). Given that this would take some changes to theLimitedOutputStream
API (from extendingByteArrayOutputStream
directly) I've opted to not change these APIs here, but in a separate PR.Changes to
LimitedOutputStream
Release note
Avoid OOM-killing node if large result-level cache population fails for query
Key changed/added classes in this PR
processing/src/main/java/org/apache/druid/io/LimitedOutputStream.java
server/src/main/java/org/apache/druid/query/ResultLevelCachingQueryRunner.java
server/src/test/java/org/apache/druid/query/ResultLevelCachingQueryRunnerTest.java
This PR has: