Thread Contention in `LastUsedVertexHintGenerator` Severely Impacts Parallel Interpolation Performance

### Summary

The `LastUsedVertexHintGenerator` causes severe thread contention when performing natural neighbor interpolation in parallel, resulting in performance degradation compared to per-thread cloned triangulations. Other generators, such as `HierarchyHintGenerator`, can be executed in parallel without blocking, but I'd like the best of both worlds (if possible).

### Use Case

I'm performing parallel natural neighbor interpolation for some geospatial data processing, where:
- Multiple threads interpolate independent image regions simultaneously
- Each thread creates its own `NaturalNeighbor` instance via `triangulation.natural_neighbor()`
- The triangulation is **read-only** during interpolation
- Workload involves thousands of interpolations per second across 16+ threads

### Problem Description

When sharing a single `DelaunayTriangulation` reference across threads:
```rust
std::thread::scope(|s| {
    for thread_id in 0..NUM_THREADS {
        let triangulation = &shared_triangulation;  // Shared reference
        s.spawn(move || {
            let nn = triangulation.natural_neighbor();
            // Severe contention here - way slower than expected
            for point in points {
                nn.interpolate(|v| v.data().value, point);
            }
        });
    }
});
```

**Performance**: no speedup compared to sequential (i.e. single-threaded).

When cloning the triangulation per thread:
```rust
std::thread::scope(|s| {
    for thread_id in 0..NUM_THREADS {
        let triangulation = shared_triangulation.clone();  // Per-thread copy
        s.spawn(move || {
            let nn = triangulation.natural_neighbor();
            // Perfect scaling - no contention
            for point in points {
                nn.interpolate(|v| v.data().value, point);
            }
        });
    }
});
```

**Performance**: Expected linear speedup with thread count. The issue, though, is that cloning that the triangulation has a huge memory footprint (due to cloning vertices, edges etc).

### Root Cause Analysis

The issue is in `LastUsedVertexHintGenerator`:

```rust
pub struct LastUsedVertexHintGenerator {
    index: AtomicUsize,  // <- Shared atomic causing contention
}
```

**Every `interpolate()` call triggers atomic operations:**

1. `triangulation.locate(point)` calls `hint_generator().get_hint()` → **Atomic load**
2. After location, calls `hint_generator().notify_vertex_lookup()` → **Atomic store**

With 16 threads performing thousands of interpolations/second, this creates massive atomic contention on the shared `AtomicUsize`.

### Evidence

1. **Execution times when using the shared reference to the triangulation (multithreaded) are the same as the execution times when running single-threaded**
2. **Cloning eliminates contention** (each clone has its own `AtomicUsize`)

### Impact

This severely impacts any application using spade for parallel spatial interpolation (i.e. computer graphics, scientific computing, etc)

### Proposed Solutions

#### Option 1: Thread-Safe, non-blocking API for the `LastUsedVertexHintGenerator`

This is what I'd prefer, but not sure how to implement. My idea would be to clone the Hint Generator on a per-thread basis, but without cloning the other triagulation data (e.g. vertices, edges etc).

#### Option 2: Hint Generator suitable for Parallel Contexts (e.g. `HierarchyHintGenerator`)

This is what I'm currently doing, but I can notice some impact compared to cloning the triangulation and using `LastUsedVertexHintGenerator`.

#### Option 3: Bypass Hint Generator API
Add methods like `locate_without_hint()` or `interpolate_parallel()` that skip hint generation entirely. I haven't tried this yet, though.

### Environment

- **rustc version**: 1.86.0
- **Spade version**: 2.13.1
- **Platform**: aarch64/macOS

### Workaround

Currently I'm using the [`HierarchyHintGenerator`](https://docs.rs/spade/2.13.1/spade/type.HierarchyHintGenerator.html), but it underperforms in terms of lookup performance as noted in the docs. 

### Expected Behavior

Thread-parallel interpolation should scale linearly with core count when using shared read-only triangulations, matching the performance of per-thread cloned triangulations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread Contention in `LastUsedVertexHintGenerator` Severely Impacts Parallel Interpolation Performance #136

Summary

Use Case

Problem Description

Root Cause Analysis

Evidence

Impact

Proposed Solutions

Option 1: Thread-Safe, non-blocking API for the `LastUsedVertexHintGenerator`

Option 2: Hint Generator suitable for Parallel Contexts (e.g. `HierarchyHintGenerator`)

Option 3: Bypass Hint Generator API

Environment

Workaround

Expected Behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Thread Contention in LastUsedVertexHintGenerator Severely Impacts Parallel Interpolation Performance #136

Description

Summary

Use Case

Problem Description

Root Cause Analysis

Evidence

Impact

Proposed Solutions

Option 1: Thread-Safe, non-blocking API for the LastUsedVertexHintGenerator

Option 2: Hint Generator suitable for Parallel Contexts (e.g. HierarchyHintGenerator)

Option 3: Bypass Hint Generator API

Environment

Workaround

Expected Behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Thread Contention in `LastUsedVertexHintGenerator` Severely Impacts Parallel Interpolation Performance #136

Option 1: Thread-Safe, non-blocking API for the `LastUsedVertexHintGenerator`

Option 2: Hint Generator suitable for Parallel Contexts (e.g. `HierarchyHintGenerator`)