Skip to content

Conversation

@zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Nov 14, 2025

What changes were proposed in this pull request?

A fix to let Spark throw OOM rather than hang when there's not enough JVM heap memory for broadcast hashed relation. The fix is done by passing the current JVM's heap size rather than Long.MaxValue / 2 to create the temporary UnifiedMemoryManager for broadcasting.

This is an optimal setting because if the size we passed is too large, i.e., the current Long.MaxValue / 2, it will cause hanging; if the size is smaller than the current JVM heap size, the OOM might be thrown too early even when there's room in memory for the newly created hashed relation.

Before:

new UnifiedMemoryManager(
    new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"),
    Long.MaxValue,
    Long.MaxValue / 2,
    1)

After:

new UnifiedMemoryManager(
    new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"),
    Runtime.getRuntime.maxMemory,
    Runtime.getRuntime.maxMemory / 2, 1)

Why are the changes needed?

Report the error fast instead of hanging.

Does this PR introduce any user-facing change?

In some scenarios where large unsafe hashed relations are allocated for broadcast hash join, user will see a meaningful OOM instead of hanging.

Before (hangs):

15:07:38.456 WARN org.apache.spark.memory.TaskMemoryManager: Failed to allocate a page (8589934592 bytes), try again.
15:07:38.501 WARN org.apache.spark.memory.TaskMemoryManager: Failed to allocate a page (8589934592 bytes), try again.
15:07:38.539 WARN org.apache.spark.memory.TaskMemoryManager: Failed to allocate a page (8589934592 bytes), try again.
15:07:38.580 WARN org.apache.spark.memory.TaskMemoryManager: Failed to allocate a page (8589934592 bytes), try again.
15:07:38.613 WARN org.apache.spark.memory.TaskMemoryManager: Failed to allocate a page (8589934592 bytes), try again.
15:07:38.647 WARN org.apache.spark.memory.TaskMemoryManager: Failed to allocate a page (8589934592 bytes), try again.
...

After (OOM):

An exception or error caused a run to abort: [UNABLE_TO_ACQUIRE_MEMORY] Unable to acquire 8589934592 bytes of memory, got 7194909081. SQLSTATE: 53200 
org.apache.spark.memory.SparkOutOfMemoryError: [UNABLE_TO_ACQUIRE_MEMORY] Unable to acquire 8589934592 bytes of memory, got 7194909081. SQLSTATE: 53200
	at org.apache.spark.errors.SparkCoreErrors$.outOfMemoryError(SparkCoreErrors.scala:456)
	at org.apache.spark.errors.SparkCoreErrors.outOfMemoryError(SparkCoreErrors.scala)
	at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
	at org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:868)
	at org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:202)
	at org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:209)
	at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:464)
	at org.apache.spark.sql.execution.joins.HashedRelationSuite.$anonfun$new$90(HashedRelationSuite.scala:760)

How was this patch tested?

Added tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot removed the CORE label Nov 14, 2025
@zhztheplayer zhztheplayer changed the title [SPARK-54354][SQL] Spark hangs when there's not enough JVM heap memory for broadcast hashed relation [SPARK-54354][SQL] Fix Spark hanging when there's not enough JVM heap memory for broadcast hashed relation Nov 14, 2025
@zhztheplayer
Copy link
Member Author

@@HyukjinKwon @yaooqinn @dongjoon-hyun Thanks.

@zhztheplayer
Copy link
Member Author

cc @cloud-fan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant