-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Summary
The LastUsedVertexHintGenerator causes severe thread contention when performing natural neighbor interpolation in parallel, resulting in performance degradation compared to per-thread cloned triangulations. Other generators, such as HierarchyHintGenerator, can be executed in parallel without blocking, but I'd like the best of both worlds (if possible).
Use Case
I'm performing parallel natural neighbor interpolation for some geospatial data processing, where:
- Multiple threads interpolate independent image regions simultaneously
- Each thread creates its own
NaturalNeighborinstance viatriangulation.natural_neighbor() - The triangulation is read-only during interpolation
- Workload involves thousands of interpolations per second across 16+ threads
Problem Description
When sharing a single DelaunayTriangulation reference across threads:
std::thread::scope(|s| {
for thread_id in 0..NUM_THREADS {
let triangulation = &shared_triangulation; // Shared reference
s.spawn(move || {
let nn = triangulation.natural_neighbor();
// Severe contention here - way slower than expected
for point in points {
nn.interpolate(|v| v.data().value, point);
}
});
}
});Performance: no speedup compared to sequential (i.e. single-threaded).
When cloning the triangulation per thread:
std::thread::scope(|s| {
for thread_id in 0..NUM_THREADS {
let triangulation = shared_triangulation.clone(); // Per-thread copy
s.spawn(move || {
let nn = triangulation.natural_neighbor();
// Perfect scaling - no contention
for point in points {
nn.interpolate(|v| v.data().value, point);
}
});
}
});Performance: Expected linear speedup with thread count. The issue, though, is that cloning that the triangulation has a huge memory footprint (due to cloning vertices, edges etc).
Root Cause Analysis
The issue is in LastUsedVertexHintGenerator:
pub struct LastUsedVertexHintGenerator {
index: AtomicUsize, // <- Shared atomic causing contention
}Every interpolate() call triggers atomic operations:
triangulation.locate(point)callshint_generator().get_hint()→ Atomic load- After location, calls
hint_generator().notify_vertex_lookup()→ Atomic store
With 16 threads performing thousands of interpolations/second, this creates massive atomic contention on the shared AtomicUsize.
Evidence
- Execution times when using the shared reference to the triangulation (multithreaded) are the same as the execution times when running single-threaded
- Cloning eliminates contention (each clone has its own
AtomicUsize)
Impact
This severely impacts any application using spade for parallel spatial interpolation (i.e. computer graphics, scientific computing, etc)
Proposed Solutions
Option 1: Thread-Safe, non-blocking API for the LastUsedVertexHintGenerator
This is what I'd prefer, but not sure how to implement. My idea would be to clone the Hint Generator on a per-thread basis, but without cloning the other triagulation data (e.g. vertices, edges etc).
Option 2: Hint Generator suitable for Parallel Contexts (e.g. HierarchyHintGenerator)
This is what I'm currently doing, but I can notice some impact compared to cloning the triangulation and using LastUsedVertexHintGenerator.
Option 3: Bypass Hint Generator API
Add methods like locate_without_hint() or interpolate_parallel() that skip hint generation entirely. I haven't tried this yet, though.
Environment
- rustc version: 1.86.0
- Spade version: 2.13.1
- Platform: aarch64/macOS
Workaround
Currently I'm using the HierarchyHintGenerator, but it underperforms in terms of lookup performance as noted in the docs.
Expected Behavior
Thread-parallel interpolation should scale linearly with core count when using shared read-only triangulations, matching the performance of per-thread cloned triangulations.