Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid OOM-killing query if result-level caching fails #17652

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jtuglu-netflix
Copy link
Contributor

@jtuglu-netflix jtuglu-netflix commented Jan 22, 2025

Fixes #17651.

Description

Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity. ByteArrayOutputStream materializes this case as an OutOfMemoryError, which is not caught and terminates the node. This limits the allocated buffer for storing query results to whatever is set in CacheConfig.getResultLevelCacheLimit().

Important Note

I opted to use LimitedOutputStream here as it is already used with ByteArrayOutputStream. While ok in a QueryRunners (single-threaded), this still is less-than-ideal in the general case because it doesn't guarantee strict consistency between overflow exception delivery and ordering of writes to the buffer(see another example below). As such, this class in general is *not* thread-safe and I think should be refactored to account for this. This is because every case of LimitedOutputStream already uses ByteArrayOutputStream, which *is* already using locks, we should suffer no performance hit by synchronizing LimitedOutputStream::write methods. This is just in the general spirit of future-proofing code, given that we're already using locks, we might as well avoid as many future races as we can : ). Given that this would take some changes to the LimitedOutputStream API (from extending ByteArrayOutputStream directly) I've opted to not change these APIs here, but in a separate PR.

Changes to LimitedOutputStream

  • Expose an public OutputStream get() which returns the output stream for stream-specific operations.
  • Set wrapped to be atomic. This isn't a complete fix for the thread-safety concerns above, but at least it prevents a future simple race case where multiple threads writing can result in uncaught buffer overflows:
T1: write():
T1: read written = INT_MAX - 1
INT
T2: write():
T2: read written = INT_MAX - 1
T2: write written += 1
T2: write() succeeds
INT
T1: write written += 1
T1: write() succeeds
FIN

Release note

Avoid OOM-killing node if large result-level cache population fails for query


Key changed/added classes in this PR
  • processing/src/main/java/org/apache/druid/io/LimitedOutputStream.java
  • server/src/main/java/org/apache/druid/query/ResultLevelCachingQueryRunner.java
  • server/src/test/java/org/apache/druid/query/ResultLevelCachingQueryRunnerTest.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu-netflix jtuglu-netflix force-pushed the fix-oom-on-result-level-cache-population branch from 5f7c019 to 551d891 Compare January 22, 2025 06:34
…or query

Currently, result-level caching which attempts to allocate a large enough buffer to store query results will overflow the Integer.MAX_INT capacity. ByteArrayOutputStream materializes this case as an OutOfMemoryError, which is not caught and terminates the node. This limits the allocated buffer for storing query results to whatever is set in `CacheConfig.getResultLevelCacheLimit()`.
@jtuglu-netflix jtuglu-netflix force-pushed the fix-oom-on-result-level-cache-population branch from 551d891 to 8bc7891 Compare January 22, 2025 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ResultLevelCache Population OOM
1 participant