Shuffle files written by native CometExchange operator cannot be cleaned #1567

Kontinuation · 2025-03-24T16:03:23Z

Describe the bug

Running TPC-H SF=100 on a single node repeatedly will eventually run out of disk when native or auto shuffle mode is enabled. The shuffle files generated when running the queries never gets deleted. Setting spark.cleaner.periodicGC.interval=60s or manually trigger driver GC does not help.

This problem only happens when spark.comet.exec.shuffle.mode is native or auto. It does not happen when shuffle mode is jvm.

Steps to reproduce

Run tpcbench.py with --iterations 10 will take hundreds of gigs of disk space. Here is an example:

spark-submit \
        --master local[8] \
        --conf spark.driver.memory=3g \
        --conf spark.memory.offHeap.enabled=true \
        --conf spark.memory.offHeap.size=16g \
        --conf spark.cleaner.periodicGC.interval=60s \
        --conf spark.jars=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar \
        --conf spark.driver.extraClassPath=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar \
        --conf spark.executor.extraClassPath=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar \
        --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
        --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
        --conf spark.comet.enabled=true \
        --conf spark.comet.exec.shuffle.enabled=true \
        --conf spark.comet.exec.shuffle.mode=native \
        --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
        --conf spark.comet.exec.shuffle.compression.codec=lz4 \
        --conf spark.comet.exec.replaceSortMergeJoin=true \
        tpcbench.py \
        --benchmark tpch \
        --data /path/to/tpch/sf100_parquet \
        --queries ../../tpch/queries \
        --output tpc-results \
        --iterations 10

Expected behavior

Unused shuffle files should be deleted when GC was triggered on Spark driver.

Additional context

No response

The text was updated successfully, but these errors were encountered:

comphead · 2025-03-25T20:25:08Z

@mbutrovich

Kontinuation added the bug Something isn't working label Mar 24, 2025

Kontinuation linked a pull request Mar 24, 2025 that will close this issue

fix: Making shuffle files generated in native shuffle mode reclaimable #1568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle files written by native CometExchange operator cannot be cleaned #1567

Shuffle files written by native CometExchange operator cannot be cleaned #1567

Kontinuation commented Mar 24, 2025

comphead commented Mar 25, 2025

Shuffle files written by native CometExchange operator cannot be cleaned #1567

Shuffle files written by native CometExchange operator cannot be cleaned #1567

Comments

Kontinuation commented Mar 24, 2025

Describe the bug

Steps to reproduce

Expected behavior

Additional context

comphead commented Mar 25, 2025