Skip to content

Why Patchwork++ has no TBB parallelisation mode #96

@LimHyungTae

Description

@LimHyungTae

Question

When running #95 (Part B of the TBB refactor) I added tbb::parallel_for to the classic Patchwork main loop and got a clean 1.73× speedup. The same pattern applied to Patchwork++ was slower at every thread count. This issue records the measurement so future contributors don't re-do the experiment, and to consolidate the decision on whether Patchwork++ should grow a TBB mode.

Setup

Numbers

Classic Patchwork (pypatchworkpp.patchwork) — TBB helps

Configuration Median ms/frame Median Hz
taskset -c 0 (1 thread) 8.31 120.4
Default TBB scheduler (24 logical cores) 4.81 207.8

Speedup: 1.73×.

Patchwork++ (pypatchworkpp.patchworkpp) — TBB hurts, monotonically

Same TBB pattern as classic (one tbb::parallel_for over all patches, then serial per-ring reduction for TGR / A-GLE), measured at every taskset size:

Threads Median ms/frame Median Hz vs 1-thread
1 (taskset -c 0) 8.98 111.4
2 (taskset -c 0,1) 10.73 93.2 -16%
4 11.03 90.6 -19%
8 11.05 90.5 -19%
16 11.80 84.7 -24%
24 (full machine) 14.55 68.7 -38%

The slowdown is monotonic and reproducible across re-runs.

Why

Profiling showed the per-patch work in Patchwork++ averages ~14 µs and is dominated by short-lived std::vector<PointXYZ> and Eigen::Matrix allocations inside R-VPF and R-GPF. Specifically per patch we allocate / deallocate:

  1. sorted_pts (full copy of the patch points).
  2. src_wo_verticals (another full copy inside extract_piecewiseground).
  3. src_tmp (yet another full copy inside the R-VPF iterations).
  4. ground_pc_ (refilled and cleared num_iter times).
  5. A few Eigen temporaries inside JacobiSVD per iteration.

That's ~5-10 short-lived heap allocations per patch × 648 patches × multiple worker threads. The glibc heap allocator serialises on its central arenas, so concurrent mallocs become the bottleneck. TBB's per-task scheduling overhead (a few microseconds) then dominates the remaining useful work.

Classic Patchwork avoids this because it has no R-VPF, fewer iterations, and writes outputs into caller-owned buffers — so the per-patch allocation count is much lower.

Current implementation

Patchwork++ (cpp/patchworkpp/src/patchworkpp.cpp) stays single-threaded in PR #95. The estimateGround main loop has a long-form code comment explaining the decision so the next person who looks at this with parallelisation in mind can find it immediately. The classic Patchwork (cpp/patchwork/src/patchwork.cpp) gets tbb::parallel_for. TBB is linked only by the classic library; patchworkpp does not depend on TBB.

Do we need to add TBB to Patchwork++?

Probably not, on current evidence.

  • Patchwork++ single-thread on KITTI HDL-64E is already ~111 Hz on this machine. The paper reports 55 Hz on i7-7700K with TGR enabled (Sec. V.E). We are already 2× the paper.
  • For 10 Hz LiDARs the algorithm has 100 ms of budget per frame; Patchwork++ uses 9. There is no real-time pressure to fight the allocator.
  • For users who do hit a CPU-budget wall, the next move would be a thread-aware allocator (jemalloc / tbbmalloc / mimalloc), or refactoring R-VPF / R-GPF to use a slab of pre-allocated buffers per worker, before sprinkling tbb::parallel_for on it. Either of those is an order of magnitude more work than this PR and should only be justified by a real perf complaint from a downstream user.

So my recommendation is: leave Patchwork++ single-threaded for now, keep the code comment + this issue as the reference if anyone asks. Reopen the discussion if a user reports Patchwork++ being CPU-bound in practice.

What would change my mind

Open to revisiting if any of these happen:

  • A user reports Patchwork++ being too slow on a sensor we care about (very-dense multi-LiDAR fused clouds where N > 1M is common, or 20 Hz LiDARs on a constrained CPU like a Jetson where we need both speed and headroom).
  • We get a benchmark showing thread-aware allocator (jemalloc or tbbmalloc_proxy) materially closes the gap on Patchwork++.
  • The algorithm itself changes to amortise the per-patch allocations (slab allocator per worker, reusable Eigen scratch, etc.) — at that point TBB on top would likely pay off.

Cross-refs: #89, #94, #95.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions