Skip to content

aerospike-client-go/v8 nodeStats.updateOrInsert consumes 44-60% of legacy CPU during mainnet IBD #1001

@oskarszoon

Description

@oskarszoon

Summary

The bsv-blockchain/aerospike-client-go/v8@v8.7.1-bsv3 fork's per-record stats tracking (nodeStats.updateOrInsert) consumes 44–60% of the legacy service's CPU during mainnet IBD. Every record in every batch response triggers an atomic-map update; map contention scales with batch size and concurrency, dominating the hot path that should be doing useful work.

Observed

30-second CPU profiles, mainnet IBD on 2026-06-01:

eu-2 legacy — sync rate ~6.6s/block, mixed block sizes:

44.64s total samples
- 43.88% github.com/bsv-blockchain/aerospike-client-go/v8.batchCommandOperate.Execute
  - 43.37% baseMultiCommand.parseResult
    - 42.47% batchCommandOperate.nsIter
      - 42.32% nodeStats.updateOrInsert
        - 39.20% nodeStats.updateOrInsert-range1
          - 20.99% atomic/map.(*Map[int,uint64]).Set

eu-3 legacy — sync rate ~3.3s/block, smaller average blocks:

39.98s total samples
- 59.60% batchCommandOperate.Execute
  - 58.75% parseResult
    - 57.80% nsIter
      - 57.60% nodeStats.updateOrInsert
        - 54.65% nodeStats.updateOrInsert-range1
          - 30.47% atomic/map.Set

Both hosts running teranode v0.15.2-beta-4 (commit e125d1ef8), aerospike client v8.7.1-bsv3.

What the hot path is doing

Per the call graph: for every record returned in a batch response, the parse loop calls nsIter, which calls nodeStats.updateOrInsert(...), which calls into the internal atomic-map Set. With 30+ concurrent batch operations (per-service batchers, partition workers, parallel txMap processing) and thousands of records per batch, the atomic map becomes the contention bottleneck — atomic/map.Set alone is 20–30% of legacy's CPU.

Why this hurts more on BSV/teranode than typical aerospike workloads

  • Block-driven burst pattern. Each block triggers len(txs) × inputs reads (batch-decorate previous outputs) + len(txs) writes (createUtxos) + existing-tx-count writes (SetMinedMulti merge). For a typical mainnet block this is tens of thousands of records flowing through one parseRecordResults loop.
  • High concurrency, single namespace. Many batches from different goroutines all hammer the same per-node stats map.
  • Stats are per-record, not per-batch. So a 1024-record batch incurs 1024 atomic ops to the map.

For a typical workload (smaller batches, mixed namespaces, fewer concurrent batchers) this would be hidden by network/disk latency. Teranode's workload exposes it because the network is local (single-node aerospike on the same host) and the batches are huge.

Fix directions

In rough order of difficulty / impact:

  1. Sample stats instead of recording every record. Update on the first record per batch (or 1-in-N) and extrapolate. Most stats consumers (latency histograms, health monitoring) don't need per-record granularity.

  2. Sharded counters with periodic aggregation. Replace the global atomic map with one counter per goroutine (or per fixed shard count), aggregate on read. Removes the cross-goroutine atomic contention entirely.

  3. Optionally disable per-record stats for high-throughput paths. Expose a client config flag (DisablePerRecordStats: true); teranode would set it for the legacy/pruner clients.

  4. Push stats update to the end of the batch loop. If correctness allows: accumulate counts locally, single map update per batch instead of per-record. Cuts atomic ops by batch_size.

  5. Audit whether updateOrInsert is using the right structure. If the map only ever has a small bounded set of keys (one per cluster node?), a fixed-size array indexed by node ID would be cheaper than an atomic hashmap.

Verification

Once a candidate fix lands:

  • Pre/post CPU profile (30s each) on a busy legacy: % time in batchCommandOperate.parseRecordResults should drop from ~58% to <10%
  • Throughput: blocks/second sync rate on eu-2 / eu-3 should rise proportionally
  • Latency: batch round-trip time should drop (or stay flat if network-bound)
  • Race-test with -race to confirm no new correctness issues

Captured profiles (local)

  • probe/eu2-mainnet-sync-2026-06-01/legacy/legacy-cpu30s.pb.gz
  • probe/eu3-mainnet-sync-2026-06-01/eu3-prof-1780321364/legacy-cpu30s.pb.gz

Available on request.

Related

Context: where the fork sits

The fork is github.com/bsv-blockchain/aerospike-client-go/v8 at tag v8.7.1-bsv3. The -bsv3 suffix suggests local modifications layered on top of upstream aerospike/aerospike-client-go/v8. Worth confirming whether nodeStats.updateOrInsert was added by the BSV fork or is also in upstream — that determines whether the fix goes here or needs upstream coordination.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions