Skip to content

GC string views on hash join build side #16463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ctsk
Copy link
Contributor

@ctsk ctsk commented Jun 19, 2025

Which issue does this PR close?

What changes are included in this PR?

  • A utility function to garbage collect (gc) all view-type columns of a batch.
  • Perform gc on the build-side batch of a hash join

Are these changes tested?

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Jun 19, 2025
@ctsk
Copy link
Contributor Author

ctsk commented Jun 19, 2025

Benchmark results:

Comparing main and fix_build-side-gc
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ fix_build-side-gc ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 2273.99 ms │        2561.92 ms │  1.13x slower │
│ QQuery 2     │  414.94 ms │         420.20 ms │     no change │
│ QQuery 3     │  978.78 ms │         942.35 ms │     no change │
│ QQuery 4     │  424.50 ms │         479.00 ms │  1.13x slower │
│ QQuery 5     │ 1206.94 ms │        1402.47 ms │  1.16x slower │
│ QQuery 6     │  455.31 ms │         222.48 ms │ +2.05x faster │
│ QQuery 7     │ 1803.05 ms │        1657.75 ms │ +1.09x faster │
│ QQuery 8     │ 1069.79 ms │        1520.80 ms │  1.42x slower │
│ QQuery 9     │ 2336.36 ms │        2277.62 ms │     no change │
│ QQuery 10    │ 1061.73 ms │         948.20 ms │ +1.12x faster │
│ QQuery 11    │  288.41 ms │         294.14 ms │     no change │
│ QQuery 12    │  996.93 ms │         726.61 ms │ +1.37x faster │
│ QQuery 13    │  692.57 ms │         832.22 ms │  1.20x slower │
│ QQuery 14    │  315.32 ms │         489.95 ms │  1.55x slower │
│ QQuery 15    │  814.54 ms │         694.10 ms │ +1.17x faster │
│ QQuery 16    │  299.17 ms │         223.10 ms │ +1.34x faster │
│ QQuery 17    │ 2000.39 ms │        1892.36 ms │ +1.06x faster │
│ QQuery 18    │ 2443.55 ms │        2400.40 ms │     no change │
│ QQuery 19    │  532.30 ms │         796.33 ms │  1.50x slower │
│ QQuery 20    │  904.87 ms │         759.41 ms │ +1.19x faster │
│ QQuery 21    │ 2586.84 ms │        2944.05 ms │  1.14x slower │
│ QQuery 22    │  456.56 ms │         414.87 ms │ +1.10x faster │
└──────────────┴────────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                │ 24356.86ms │
│ Total Time (fix_build-side-gc)   │ 24900.32ms │
│ Average Time (main)              │  1107.13ms │
│ Average Time (fix_build-side-gc) │  1131.83ms │
│ Queries Faster                   │          9 │
│ Queries Slower                   │          8 │
│ Queries with No Change           │          5 │
│ Queries with Failure             │          0 │
└──────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ fix_build-side-gc ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 349.95 ms │         355.99 ms │     no change │
│ QQuery 2     │  74.36 ms │          43.87 ms │ +1.69x faster │
│ QQuery 3     │  55.24 ms │          59.94 ms │  1.09x slower │
│ QQuery 4     │  46.10 ms │          71.84 ms │  1.56x slower │
│ QQuery 5     │ 122.83 ms │         147.10 ms │  1.20x slower │
│ QQuery 6     │  60.77 ms │          30.01 ms │ +2.02x faster │
│ QQuery 7     │ 151.49 ms │         234.96 ms │  1.55x slower │
│ QQuery 8     │  79.57 ms │          99.71 ms │  1.25x slower │
│ QQuery 9     │ 210.99 ms │         202.58 ms │     no change │
│ QQuery 10    │ 132.90 ms │         144.02 ms │  1.08x slower │
│ QQuery 11    │  38.18 ms │          33.87 ms │ +1.13x faster │
│ QQuery 12    │ 118.61 ms │         167.74 ms │  1.41x slower │
│ QQuery 13    │  55.91 ms │          60.70 ms │  1.09x slower │
│ QQuery 14    │  76.92 ms │          62.10 ms │ +1.24x faster │
│ QQuery 15    │  84.87 ms │          60.17 ms │ +1.41x faster │
│ QQuery 16    │  23.91 ms │          25.35 ms │  1.06x slower │
│ QQuery 17    │ 145.09 ms │         182.89 ms │  1.26x slower │
│ QQuery 18    │ 228.53 ms │         244.52 ms │  1.07x slower │
│ QQuery 19    │  96.24 ms │          66.66 ms │ +1.44x faster │
│ QQuery 20    │  56.56 ms │         104.25 ms │  1.84x slower │
│ QQuery 21    │ 199.91 ms │         230.70 ms │  1.15x slower │
│ QQuery 22    │  45.35 ms │          49.77 ms │  1.10x slower │
└──────────────┴───────────┴───────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                │ 2454.30ms │
│ Total Time (fix_build-side-gc)   │ 2678.74ms │
│ Average Time (main)              │  111.56ms │
│ Average Time (fix_build-side-gc) │  121.76ms │
│ Queries Faster                   │         6 │
│ Queries Slower                   │        14 │
│ Queries with No Change           │         2 │
│ Queries with Failure             │         0 │
└──────────────────────────────────┴───────────┘

Seems like it slows down SF 1, 10 a bit :/

@ctsk
Copy link
Contributor Author

ctsk commented Jun 19, 2025

At SF=100, this PR is 10% faster:

--------------------
Benchmark tpch_sf100.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃         pr ┃         old ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 1728.00 ms │  1755.28 ms │     no change │
│ QQuery 2     │ 1356.73 ms │  1241.24 ms │ +1.09x faster │
│ QQuery 3     │ 1987.91 ms │  1995.85 ms │     no change │
│ QQuery 4     │ 1509.51 ms │  1593.78 ms │  1.06x slower │
│ QQuery 5     │ 3676.40 ms │  3770.53 ms │     no change │
│ QQuery 6     │  277.50 ms │   287.08 ms │     no change │
│ QQuery 7     │ 3542.47 ms │  3499.00 ms │     no change │
│ QQuery 8     │ 4037.40 ms │  4228.35 ms │     no change │
│ QQuery 9     │ 5968.77 ms │  5910.46 ms │     no change │
│ QQuery 10    │ 1716.65 ms │  1666.84 ms │     no change │
│ QQuery 11    │ 1059.97 ms │  1005.02 ms │ +1.05x faster │
│ QQuery 12    │ 1091.32 ms │  1110.42 ms │     no change │
│ QQuery 13    │ 1121.01 ms │  1234.91 ms │  1.10x slower │
│ QQuery 14    │  587.82 ms │   535.24 ms │ +1.10x faster │
│ QQuery 15    │  904.97 ms │   890.88 ms │     no change │
│ QQuery 16    │  731.39 ms │   745.86 ms │     no change │
│ QQuery 17    │ 5760.02 ms │  5919.10 ms │     no change │
│ QQuery 18    │ 6404.64 ms │  8073.58 ms │  1.26x slower │
│ QQuery 19    │  688.99 ms │   666.94 ms │     no change │
│ QQuery 20    │ 1725.25 ms │  1742.06 ms │     no change │
│ QQuery 21    │ 7427.94 ms │ 10795.91 ms │  1.45x slower │
│ QQuery 22    │ 1373.74 ms │  1370.31 ms │     no change │
└──────────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (fx)        │ 54678.41ms │
│ Total Time (HEAD)      │ 60038.64ms │
│ Average Time (fx)      │  2485.38ms │
│ Average Time (HEAD)    │  2729.03ms │
│ Queries Faster         │          3 │
│ Queries Slower         │          4 │
│ Queries with No Change │         15 │
│ Queries with Failure   │          0 │
└────────────────────────┴────────────┘

Perhaps a threshold is needed?

@2010YOUY01
Copy link
Contributor

Thank you! This is great. I got some minor suggestions:

  1. The issue explained the motivation of the change clearly, I recommend to add the same rationale to the code comment to make the code more understandable.
  2. We can unify gc_record_batch() and this existing function
    fn organize_stringview_arrays(
    I'm also happy to take this on as a follow-up task.

@ctsk
Copy link
Contributor Author

ctsk commented Jun 20, 2025

I will do both of these things later today. I am concerned about the performance impact for smaller-scale tasks. I suspect many users of datafusion are not doing such large joins....

@Dandandan
Copy link
Contributor

We have quite some implementations of gc-ing arrays. I am wondering in this case if the performance can be improved for smaller tables by this heuristic used here:

https://github.com/apache/datafusion/blob/3c4e39ac0cf83bd8ead45722a5873bac731b53f1/datafusion/physical-plan/src/coalesce/mod.rs#L222C4-L222C24

@ctsk ctsk marked this pull request as draft June 21, 2025 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excessive Arc-clone in HashJoinStream with StringView on build-side
3 participants