Why Patchwork++ has no TBB parallelisation mode

## Question

When running #95 (Part B of the TBB refactor) I added `tbb::parallel_for` to the classic Patchwork main loop and got a clean 1.73× speedup. The same pattern applied to Patchwork++ was **slower** at every thread count. This issue records the measurement so future contributors don't re-do the experiment, and to consolidate the decision on whether Patchwork++ should grow a TBB mode.

## Setup

- Machine: i7-12700, 24 logical cores, Ubuntu 22.04, GCC 11.
- Dataset: KITTI seq 00, 4541 frames, HDL-64E.
- Build: PR #94 (common library) + PR #95 (TBB on classic), `pip install` from clean conda env.
- Bench harness: `python/examples/bench_hz.py` (median of per-frame `getTimeTaken()` from C++, 20-frame warmup).

## Numbers

### Classic Patchwork (`pypatchworkpp.patchwork`) — TBB helps

| Configuration | Median ms/frame | Median Hz |
|---|---|---|
| `taskset -c 0` (1 thread) | 8.31 | 120.4 |
| Default TBB scheduler (24 logical cores) | **4.81** | **207.8** |

Speedup: **1.73×**.

### Patchwork++ (`pypatchworkpp.patchworkpp`) — TBB hurts, monotonically

Same TBB pattern as classic (one `tbb::parallel_for` over all patches, then serial per-ring reduction for TGR / A-GLE), measured at every `taskset` size:

| Threads | Median ms/frame | Median Hz | vs 1-thread |
|---|---|---|---|
| 1 (taskset -c 0)  | 8.98 | **111.4** | — |
| 2 (taskset -c 0,1) | 10.73 | 93.2 | -16% |
| 4 | 11.03 | 90.6 | -19% |
| 8 | 11.05 | 90.5 | -19% |
| 16 | 11.80 | 84.7 | -24% |
| 24 (full machine) | 14.55 | 68.7 | -38% |

The slowdown is monotonic and reproducible across re-runs.

## Why

Profiling showed the per-patch work in Patchwork++ averages ~14 µs and is dominated by short-lived `std::vector<PointXYZ>` and `Eigen::Matrix` allocations inside R-VPF and R-GPF. Specifically per patch we allocate / deallocate:

1. `sorted_pts` (full copy of the patch points).
2. `src_wo_verticals` (another full copy inside `extract_piecewiseground`).
3. `src_tmp` (yet another full copy inside the R-VPF iterations).
4. `ground_pc_` (refilled and cleared `num_iter` times).
5. A few Eigen temporaries inside `JacobiSVD` per iteration.

That's ~5-10 short-lived heap allocations per patch × 648 patches × multiple worker threads. The glibc heap allocator serialises on its central arenas, so concurrent mallocs become the bottleneck. TBB's per-task scheduling overhead (a few microseconds) then dominates the remaining useful work.

Classic Patchwork avoids this because it has no R-VPF, fewer iterations, and writes outputs into caller-owned buffers — so the per-patch allocation count is much lower.

## Current implementation

Patchwork++ (`cpp/patchworkpp/src/patchworkpp.cpp`) **stays single-threaded** in PR #95. The `estimateGround` main loop has a long-form code comment explaining the decision so the next person who looks at this with parallelisation in mind can find it immediately. The classic Patchwork (`cpp/patchwork/src/patchwork.cpp`) gets `tbb::parallel_for`. TBB is linked only by the classic library; `patchworkpp` does not depend on TBB.

## Do we need to add TBB to Patchwork++?

**Probably not, on current evidence.**

- Patchwork++ single-thread on KITTI HDL-64E is already ~111 Hz on this machine. The paper reports 55 Hz on i7-7700K with TGR enabled (Sec. V.E). We are already 2× the paper.
- For 10 Hz LiDARs the algorithm has 100 ms of budget per frame; Patchwork++ uses 9. There is no real-time pressure to fight the allocator.
- For users who do hit a CPU-budget wall, the next move would be a thread-aware allocator (jemalloc / tbbmalloc / mimalloc), or refactoring R-VPF / R-GPF to use a slab of pre-allocated buffers per worker, **before** sprinkling `tbb::parallel_for` on it. Either of those is an order of magnitude more work than this PR and should only be justified by a real perf complaint from a downstream user.

So my recommendation is: leave Patchwork++ single-threaded for now, keep the code comment + this issue as the reference if anyone asks. Reopen the discussion if a user reports Patchwork++ being CPU-bound in practice.

## What would change my mind

Open to revisiting if any of these happen:

- A user reports Patchwork++ being too slow on a sensor we care about (very-dense multi-LiDAR fused clouds where `N > 1M` is common, or 20 Hz LiDARs on a constrained CPU like a Jetson where we need both speed and headroom).
- We get a benchmark showing thread-aware allocator (jemalloc or tbbmalloc_proxy) materially closes the gap on Patchwork++.
- The algorithm itself changes to amortise the per-patch allocations (slab allocator per worker, reusable Eigen scratch, etc.) — at that point TBB on top would likely pay off.

Cross-refs: #89, #94, #95.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Patchwork++ has no TBB parallelisation mode #96

Question

Setup

Numbers

Classic Patchwork (`pypatchworkpp.patchwork`) — TBB helps

Patchwork++ (`pypatchworkpp.patchworkpp`) — TBB hurts, monotonically

Why

Current implementation

Do we need to add TBB to Patchwork++?

What would change my mind

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Configuration	Median ms/frame	Median Hz
`taskset -c 0` (1 thread)	8.31	120.4
Default TBB scheduler (24 logical cores)	4.81	207.8

Threads	Median ms/frame	Median Hz	vs 1-thread
1 (taskset -c 0)	8.98	111.4	—
2 (taskset -c 0,1)	10.73	93.2	-16%
4	11.03	90.6	-19%
8	11.05	90.5	-19%
16	11.80	84.7	-24%
24 (full machine)	14.55	68.7	-38%

Why Patchwork++ has no TBB parallelisation mode #96

Description

Question

Setup

Numbers

Classic Patchwork (pypatchworkpp.patchwork) — TBB helps

Patchwork++ (pypatchworkpp.patchworkpp) — TBB hurts, monotonically

Why

Current implementation

Do we need to add TBB to Patchwork++?

What would change my mind

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Classic Patchwork (`pypatchworkpp.patchwork`) — TBB helps

Patchwork++ (`pypatchworkpp.patchworkpp`) — TBB hurts, monotonically