aerospike-client-go/v8 nodeStats.updateOrInsert consumes 44-60% of legacy CPU during mainnet IBD

## Summary

The `bsv-blockchain/aerospike-client-go/v8@v8.7.1-bsv3` fork's per-record stats tracking (`nodeStats.updateOrInsert`) consumes 44–60% of the legacy service's CPU during mainnet IBD. Every record in every batch response triggers an atomic-map update; map contention scales with batch size and concurrency, dominating the hot path that should be doing useful work.

## Observed

30-second CPU profiles, mainnet IBD on 2026-06-01:

**eu-2 legacy** — sync rate ~6.6s/block, mixed block sizes:

```
44.64s total samples
- 43.88% github.com/bsv-blockchain/aerospike-client-go/v8.batchCommandOperate.Execute
  - 43.37% baseMultiCommand.parseResult
    - 42.47% batchCommandOperate.nsIter
      - 42.32% nodeStats.updateOrInsert
        - 39.20% nodeStats.updateOrInsert-range1
          - 20.99% atomic/map.(*Map[int,uint64]).Set
```

**eu-3 legacy** — sync rate ~3.3s/block, smaller average blocks:

```
39.98s total samples
- 59.60% batchCommandOperate.Execute
  - 58.75% parseResult
    - 57.80% nsIter
      - 57.60% nodeStats.updateOrInsert
        - 54.65% nodeStats.updateOrInsert-range1
          - 30.47% atomic/map.Set
```

Both hosts running `teranode v0.15.2-beta-4` (commit `e125d1ef8`), aerospike client `v8.7.1-bsv3`.

## What the hot path is doing

Per the call graph: for every record returned in a batch response, the parse loop calls `nsIter`, which calls `nodeStats.updateOrInsert(...)`, which calls into the internal atomic-map `Set`. With 30+ concurrent batch operations (per-service batchers, partition workers, parallel txMap processing) and thousands of records per batch, the atomic map becomes the contention bottleneck — `atomic/map.Set` alone is 20–30% of legacy's CPU.

## Why this hurts more on BSV/teranode than typical aerospike workloads

- **Block-driven burst pattern.** Each block triggers `len(txs) × inputs` reads (batch-decorate previous outputs) + `len(txs)` writes (createUtxos) + `existing-tx-count` writes (SetMinedMulti merge). For a typical mainnet block this is tens of thousands of records flowing through one parseRecordResults loop.
- **High concurrency**, single namespace. Many batches from different goroutines all hammer the same per-node stats map.
- **Stats are per-record**, not per-batch. So a 1024-record batch incurs 1024 atomic ops to the map.

For a typical workload (smaller batches, mixed namespaces, fewer concurrent batchers) this would be hidden by network/disk latency. Teranode's workload exposes it because the network is local (single-node aerospike on the same host) and the batches are huge.

## Fix directions

In rough order of difficulty / impact:

1. **Sample stats instead of recording every record.** Update on the first record per batch (or 1-in-N) and extrapolate. Most stats consumers (latency histograms, health monitoring) don't need per-record granularity.

2. **Sharded counters with periodic aggregation.** Replace the global atomic map with one counter per goroutine (or per fixed shard count), aggregate on read. Removes the cross-goroutine atomic contention entirely.

3. **Optionally disable per-record stats for high-throughput paths.** Expose a client config flag (`DisablePerRecordStats: true`); teranode would set it for the legacy/pruner clients.

4. **Push stats update to the end of the batch loop.** If correctness allows: accumulate counts locally, single map update per batch instead of per-record. Cuts atomic ops by `batch_size`.

5. **Audit whether `updateOrInsert` is using the right structure.** If the map only ever has a small bounded set of keys (one per cluster node?), a fixed-size array indexed by node ID would be cheaper than an atomic hashmap.

## Verification

Once a candidate fix lands:

- [ ] Pre/post CPU profile (30s each) on a busy legacy: % time in `batchCommandOperate.parseRecordResults` should drop from ~58% to <10%
- [ ] Throughput: blocks/second sync rate on eu-2 / eu-3 should rise proportionally
- [ ] Latency: batch round-trip time should drop (or stay flat if network-bound)
- [ ] Race-test with `-race` to confirm no new correctness issues

## Captured profiles (local)

- `probe/eu2-mainnet-sync-2026-06-01/legacy/legacy-cpu30s.pb.gz`
- `probe/eu3-mainnet-sync-2026-06-01/eu3-prof-1780321364/legacy-cpu30s.pb.gz`

Available on request.

## Related

- #936 — `createUtxos` chunking (fixed). Without it, batch sizes were even larger and this hotspot would be worse.
- #941 — connection pool size. Larger pool = more concurrent batches in flight = more contention on the same stats map.
- #953 — spend circuit breaker counts wrong errors. Different layer, but in the same client.

## Context: where the fork sits

The fork is `github.com/bsv-blockchain/aerospike-client-go/v8` at tag `v8.7.1-bsv3`. The `-bsv3` suffix suggests local modifications layered on top of upstream `aerospike/aerospike-client-go/v8`. Worth confirming whether `nodeStats.updateOrInsert` was added by the BSV fork or is also in upstream — that determines whether the fix goes here or needs upstream coordination.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aerospike-client-go/v8 nodeStats.updateOrInsert consumes 44-60% of legacy CPU during mainnet IBD #1001

Summary

Observed

What the hot path is doing

Why this hurts more on BSV/teranode than typical aerospike workloads

Fix directions

Verification

Captured profiles (local)

Related

Context: where the fork sits

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

aerospike-client-go/v8 nodeStats.updateOrInsert consumes 44-60% of legacy CPU during mainnet IBD #1001

Description

Summary

Observed

What the hot path is doing

Why this hurts more on BSV/teranode than typical aerospike workloads

Fix directions

Verification

Captured profiles (local)

Related

Context: where the fork sits

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions