Question
When running #95 (Part B of the TBB refactor) I added tbb::parallel_for to the classic Patchwork main loop and got a clean 1.73× speedup. The same pattern applied to Patchwork++ was slower at every thread count. This issue records the measurement so future contributors don't re-do the experiment, and to consolidate the decision on whether Patchwork++ should grow a TBB mode.
Setup
Numbers
Classic Patchwork (pypatchworkpp.patchwork) — TBB helps
| Configuration |
Median ms/frame |
Median Hz |
taskset -c 0 (1 thread) |
8.31 |
120.4 |
| Default TBB scheduler (24 logical cores) |
4.81 |
207.8 |
Speedup: 1.73×.
Patchwork++ (pypatchworkpp.patchworkpp) — TBB hurts, monotonically
Same TBB pattern as classic (one tbb::parallel_for over all patches, then serial per-ring reduction for TGR / A-GLE), measured at every taskset size:
| Threads |
Median ms/frame |
Median Hz |
vs 1-thread |
| 1 (taskset -c 0) |
8.98 |
111.4 |
— |
| 2 (taskset -c 0,1) |
10.73 |
93.2 |
-16% |
| 4 |
11.03 |
90.6 |
-19% |
| 8 |
11.05 |
90.5 |
-19% |
| 16 |
11.80 |
84.7 |
-24% |
| 24 (full machine) |
14.55 |
68.7 |
-38% |
The slowdown is monotonic and reproducible across re-runs.
Why
Profiling showed the per-patch work in Patchwork++ averages ~14 µs and is dominated by short-lived std::vector<PointXYZ> and Eigen::Matrix allocations inside R-VPF and R-GPF. Specifically per patch we allocate / deallocate:
sorted_pts (full copy of the patch points).
src_wo_verticals (another full copy inside extract_piecewiseground).
src_tmp (yet another full copy inside the R-VPF iterations).
ground_pc_ (refilled and cleared num_iter times).
- A few Eigen temporaries inside
JacobiSVD per iteration.
That's ~5-10 short-lived heap allocations per patch × 648 patches × multiple worker threads. The glibc heap allocator serialises on its central arenas, so concurrent mallocs become the bottleneck. TBB's per-task scheduling overhead (a few microseconds) then dominates the remaining useful work.
Classic Patchwork avoids this because it has no R-VPF, fewer iterations, and writes outputs into caller-owned buffers — so the per-patch allocation count is much lower.
Current implementation
Patchwork++ (cpp/patchworkpp/src/patchworkpp.cpp) stays single-threaded in PR #95. The estimateGround main loop has a long-form code comment explaining the decision so the next person who looks at this with parallelisation in mind can find it immediately. The classic Patchwork (cpp/patchwork/src/patchwork.cpp) gets tbb::parallel_for. TBB is linked only by the classic library; patchworkpp does not depend on TBB.
Do we need to add TBB to Patchwork++?
Probably not, on current evidence.
- Patchwork++ single-thread on KITTI HDL-64E is already ~111 Hz on this machine. The paper reports 55 Hz on i7-7700K with TGR enabled (Sec. V.E). We are already 2× the paper.
- For 10 Hz LiDARs the algorithm has 100 ms of budget per frame; Patchwork++ uses 9. There is no real-time pressure to fight the allocator.
- For users who do hit a CPU-budget wall, the next move would be a thread-aware allocator (jemalloc / tbbmalloc / mimalloc), or refactoring R-VPF / R-GPF to use a slab of pre-allocated buffers per worker, before sprinkling
tbb::parallel_for on it. Either of those is an order of magnitude more work than this PR and should only be justified by a real perf complaint from a downstream user.
So my recommendation is: leave Patchwork++ single-threaded for now, keep the code comment + this issue as the reference if anyone asks. Reopen the discussion if a user reports Patchwork++ being CPU-bound in practice.
What would change my mind
Open to revisiting if any of these happen:
- A user reports Patchwork++ being too slow on a sensor we care about (very-dense multi-LiDAR fused clouds where
N > 1M is common, or 20 Hz LiDARs on a constrained CPU like a Jetson where we need both speed and headroom).
- We get a benchmark showing thread-aware allocator (jemalloc or tbbmalloc_proxy) materially closes the gap on Patchwork++.
- The algorithm itself changes to amortise the per-patch allocations (slab allocator per worker, reusable Eigen scratch, etc.) — at that point TBB on top would likely pay off.
Cross-refs: #89, #94, #95.
Question
When running #95 (Part B of the TBB refactor) I added
tbb::parallel_forto the classic Patchwork main loop and got a clean 1.73× speedup. The same pattern applied to Patchwork++ was slower at every thread count. This issue records the measurement so future contributors don't re-do the experiment, and to consolidate the decision on whether Patchwork++ should grow a TBB mode.Setup
pip installfrom clean conda env.python/examples/bench_hz.py(median of per-framegetTimeTaken()from C++, 20-frame warmup).Numbers
Classic Patchwork (
pypatchworkpp.patchwork) — TBB helpstaskset -c 0(1 thread)Speedup: 1.73×.
Patchwork++ (
pypatchworkpp.patchworkpp) — TBB hurts, monotonicallySame TBB pattern as classic (one
tbb::parallel_forover all patches, then serial per-ring reduction for TGR / A-GLE), measured at everytasksetsize:The slowdown is monotonic and reproducible across re-runs.
Why
Profiling showed the per-patch work in Patchwork++ averages ~14 µs and is dominated by short-lived
std::vector<PointXYZ>andEigen::Matrixallocations inside R-VPF and R-GPF. Specifically per patch we allocate / deallocate:sorted_pts(full copy of the patch points).src_wo_verticals(another full copy insideextract_piecewiseground).src_tmp(yet another full copy inside the R-VPF iterations).ground_pc_(refilled and clearednum_itertimes).JacobiSVDper iteration.That's ~5-10 short-lived heap allocations per patch × 648 patches × multiple worker threads. The glibc heap allocator serialises on its central arenas, so concurrent mallocs become the bottleneck. TBB's per-task scheduling overhead (a few microseconds) then dominates the remaining useful work.
Classic Patchwork avoids this because it has no R-VPF, fewer iterations, and writes outputs into caller-owned buffers — so the per-patch allocation count is much lower.
Current implementation
Patchwork++ (
cpp/patchworkpp/src/patchworkpp.cpp) stays single-threaded in PR #95. TheestimateGroundmain loop has a long-form code comment explaining the decision so the next person who looks at this with parallelisation in mind can find it immediately. The classic Patchwork (cpp/patchwork/src/patchwork.cpp) getstbb::parallel_for. TBB is linked only by the classic library;patchworkppdoes not depend on TBB.Do we need to add TBB to Patchwork++?
Probably not, on current evidence.
tbb::parallel_foron it. Either of those is an order of magnitude more work than this PR and should only be justified by a real perf complaint from a downstream user.So my recommendation is: leave Patchwork++ single-threaded for now, keep the code comment + this issue as the reference if anyone asks. Reopen the discussion if a user reports Patchwork++ being CPU-bound in practice.
What would change my mind
Open to revisiting if any of these happen:
N > 1Mis common, or 20 Hz LiDARs on a constrained CPU like a Jetson where we need both speed and headroom).Cross-refs: #89, #94, #95.