Skip to content

Thread Contention in LastUsedVertexHintGenerator Severely Impacts Parallel Interpolation Performance #136

@vini-fda

Description

@vini-fda

Summary

The LastUsedVertexHintGenerator causes severe thread contention when performing natural neighbor interpolation in parallel, resulting in performance degradation compared to per-thread cloned triangulations. Other generators, such as HierarchyHintGenerator, can be executed in parallel without blocking, but I'd like the best of both worlds (if possible).

Use Case

I'm performing parallel natural neighbor interpolation for some geospatial data processing, where:

  • Multiple threads interpolate independent image regions simultaneously
  • Each thread creates its own NaturalNeighbor instance via triangulation.natural_neighbor()
  • The triangulation is read-only during interpolation
  • Workload involves thousands of interpolations per second across 16+ threads

Problem Description

When sharing a single DelaunayTriangulation reference across threads:

std::thread::scope(|s| {
    for thread_id in 0..NUM_THREADS {
        let triangulation = &shared_triangulation;  // Shared reference
        s.spawn(move || {
            let nn = triangulation.natural_neighbor();
            // Severe contention here - way slower than expected
            for point in points {
                nn.interpolate(|v| v.data().value, point);
            }
        });
    }
});

Performance: no speedup compared to sequential (i.e. single-threaded).

When cloning the triangulation per thread:

std::thread::scope(|s| {
    for thread_id in 0..NUM_THREADS {
        let triangulation = shared_triangulation.clone();  // Per-thread copy
        s.spawn(move || {
            let nn = triangulation.natural_neighbor();
            // Perfect scaling - no contention
            for point in points {
                nn.interpolate(|v| v.data().value, point);
            }
        });
    }
});

Performance: Expected linear speedup with thread count. The issue, though, is that cloning that the triangulation has a huge memory footprint (due to cloning vertices, edges etc).

Root Cause Analysis

The issue is in LastUsedVertexHintGenerator:

pub struct LastUsedVertexHintGenerator {
    index: AtomicUsize,  // <- Shared atomic causing contention
}

Every interpolate() call triggers atomic operations:

  1. triangulation.locate(point) calls hint_generator().get_hint()Atomic load
  2. After location, calls hint_generator().notify_vertex_lookup()Atomic store

With 16 threads performing thousands of interpolations/second, this creates massive atomic contention on the shared AtomicUsize.

Evidence

  1. Execution times when using the shared reference to the triangulation (multithreaded) are the same as the execution times when running single-threaded
  2. Cloning eliminates contention (each clone has its own AtomicUsize)

Impact

This severely impacts any application using spade for parallel spatial interpolation (i.e. computer graphics, scientific computing, etc)

Proposed Solutions

Option 1: Thread-Safe, non-blocking API for the LastUsedVertexHintGenerator

This is what I'd prefer, but not sure how to implement. My idea would be to clone the Hint Generator on a per-thread basis, but without cloning the other triagulation data (e.g. vertices, edges etc).

Option 2: Hint Generator suitable for Parallel Contexts (e.g. HierarchyHintGenerator)

This is what I'm currently doing, but I can notice some impact compared to cloning the triangulation and using LastUsedVertexHintGenerator.

Option 3: Bypass Hint Generator API

Add methods like locate_without_hint() or interpolate_parallel() that skip hint generation entirely. I haven't tried this yet, though.

Environment

  • rustc version: 1.86.0
  • Spade version: 2.13.1
  • Platform: aarch64/macOS

Workaround

Currently I'm using the HierarchyHintGenerator, but it underperforms in terms of lookup performance as noted in the docs.

Expected Behavior

Thread-parallel interpolation should scale linearly with core count when using shared read-only triangulations, matching the performance of per-thread cloned triangulations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions