Summary
The bsv-blockchain/aerospike-client-go/v8@v8.7.1-bsv3 fork's per-record stats tracking (nodeStats.updateOrInsert) consumes 44–60% of the legacy service's CPU during mainnet IBD. Every record in every batch response triggers an atomic-map update; map contention scales with batch size and concurrency, dominating the hot path that should be doing useful work.
Observed
30-second CPU profiles, mainnet IBD on 2026-06-01:
eu-2 legacy — sync rate ~6.6s/block, mixed block sizes:
44.64s total samples
- 43.88% github.com/bsv-blockchain/aerospike-client-go/v8.batchCommandOperate.Execute
- 43.37% baseMultiCommand.parseResult
- 42.47% batchCommandOperate.nsIter
- 42.32% nodeStats.updateOrInsert
- 39.20% nodeStats.updateOrInsert-range1
- 20.99% atomic/map.(*Map[int,uint64]).Set
eu-3 legacy — sync rate ~3.3s/block, smaller average blocks:
39.98s total samples
- 59.60% batchCommandOperate.Execute
- 58.75% parseResult
- 57.80% nsIter
- 57.60% nodeStats.updateOrInsert
- 54.65% nodeStats.updateOrInsert-range1
- 30.47% atomic/map.Set
Both hosts running teranode v0.15.2-beta-4 (commit e125d1ef8), aerospike client v8.7.1-bsv3.
What the hot path is doing
Per the call graph: for every record returned in a batch response, the parse loop calls nsIter, which calls nodeStats.updateOrInsert(...), which calls into the internal atomic-map Set. With 30+ concurrent batch operations (per-service batchers, partition workers, parallel txMap processing) and thousands of records per batch, the atomic map becomes the contention bottleneck — atomic/map.Set alone is 20–30% of legacy's CPU.
Why this hurts more on BSV/teranode than typical aerospike workloads
- Block-driven burst pattern. Each block triggers
len(txs) × inputs reads (batch-decorate previous outputs) + len(txs) writes (createUtxos) + existing-tx-count writes (SetMinedMulti merge). For a typical mainnet block this is tens of thousands of records flowing through one parseRecordResults loop.
- High concurrency, single namespace. Many batches from different goroutines all hammer the same per-node stats map.
- Stats are per-record, not per-batch. So a 1024-record batch incurs 1024 atomic ops to the map.
For a typical workload (smaller batches, mixed namespaces, fewer concurrent batchers) this would be hidden by network/disk latency. Teranode's workload exposes it because the network is local (single-node aerospike on the same host) and the batches are huge.
Fix directions
In rough order of difficulty / impact:
-
Sample stats instead of recording every record. Update on the first record per batch (or 1-in-N) and extrapolate. Most stats consumers (latency histograms, health monitoring) don't need per-record granularity.
-
Sharded counters with periodic aggregation. Replace the global atomic map with one counter per goroutine (or per fixed shard count), aggregate on read. Removes the cross-goroutine atomic contention entirely.
-
Optionally disable per-record stats for high-throughput paths. Expose a client config flag (DisablePerRecordStats: true); teranode would set it for the legacy/pruner clients.
-
Push stats update to the end of the batch loop. If correctness allows: accumulate counts locally, single map update per batch instead of per-record. Cuts atomic ops by batch_size.
-
Audit whether updateOrInsert is using the right structure. If the map only ever has a small bounded set of keys (one per cluster node?), a fixed-size array indexed by node ID would be cheaper than an atomic hashmap.
Verification
Once a candidate fix lands:
Captured profiles (local)
probe/eu2-mainnet-sync-2026-06-01/legacy/legacy-cpu30s.pb.gz
probe/eu3-mainnet-sync-2026-06-01/eu3-prof-1780321364/legacy-cpu30s.pb.gz
Available on request.
Related
Context: where the fork sits
The fork is github.com/bsv-blockchain/aerospike-client-go/v8 at tag v8.7.1-bsv3. The -bsv3 suffix suggests local modifications layered on top of upstream aerospike/aerospike-client-go/v8. Worth confirming whether nodeStats.updateOrInsert was added by the BSV fork or is also in upstream — that determines whether the fix goes here or needs upstream coordination.
Summary
The
bsv-blockchain/aerospike-client-go/v8@v8.7.1-bsv3fork's per-record stats tracking (nodeStats.updateOrInsert) consumes 44–60% of the legacy service's CPU during mainnet IBD. Every record in every batch response triggers an atomic-map update; map contention scales with batch size and concurrency, dominating the hot path that should be doing useful work.Observed
30-second CPU profiles, mainnet IBD on 2026-06-01:
eu-2 legacy — sync rate ~6.6s/block, mixed block sizes:
eu-3 legacy — sync rate ~3.3s/block, smaller average blocks:
Both hosts running
teranode v0.15.2-beta-4(commite125d1ef8), aerospike clientv8.7.1-bsv3.What the hot path is doing
Per the call graph: for every record returned in a batch response, the parse loop calls
nsIter, which callsnodeStats.updateOrInsert(...), which calls into the internal atomic-mapSet. With 30+ concurrent batch operations (per-service batchers, partition workers, parallel txMap processing) and thousands of records per batch, the atomic map becomes the contention bottleneck —atomic/map.Setalone is 20–30% of legacy's CPU.Why this hurts more on BSV/teranode than typical aerospike workloads
len(txs) × inputsreads (batch-decorate previous outputs) +len(txs)writes (createUtxos) +existing-tx-countwrites (SetMinedMulti merge). For a typical mainnet block this is tens of thousands of records flowing through one parseRecordResults loop.For a typical workload (smaller batches, mixed namespaces, fewer concurrent batchers) this would be hidden by network/disk latency. Teranode's workload exposes it because the network is local (single-node aerospike on the same host) and the batches are huge.
Fix directions
In rough order of difficulty / impact:
Sample stats instead of recording every record. Update on the first record per batch (or 1-in-N) and extrapolate. Most stats consumers (latency histograms, health monitoring) don't need per-record granularity.
Sharded counters with periodic aggregation. Replace the global atomic map with one counter per goroutine (or per fixed shard count), aggregate on read. Removes the cross-goroutine atomic contention entirely.
Optionally disable per-record stats for high-throughput paths. Expose a client config flag (
DisablePerRecordStats: true); teranode would set it for the legacy/pruner clients.Push stats update to the end of the batch loop. If correctness allows: accumulate counts locally, single map update per batch instead of per-record. Cuts atomic ops by
batch_size.Audit whether
updateOrInsertis using the right structure. If the map only ever has a small bounded set of keys (one per cluster node?), a fixed-size array indexed by node ID would be cheaper than an atomic hashmap.Verification
Once a candidate fix lands:
batchCommandOperate.parseRecordResultsshould drop from ~58% to <10%-raceto confirm no new correctness issuesCaptured profiles (local)
probe/eu2-mainnet-sync-2026-06-01/legacy/legacy-cpu30s.pb.gzprobe/eu3-mainnet-sync-2026-06-01/eu3-prof-1780321364/legacy-cpu30s.pb.gzAvailable on request.
Related
createUtxoschunking (fixed). Without it, batch sizes were even larger and this hotspot would be worse.Context: where the fork sits
The fork is
github.com/bsv-blockchain/aerospike-client-go/v8at tagv8.7.1-bsv3. The-bsv3suffix suggests local modifications layered on top of upstreamaerospike/aerospike-client-go/v8. Worth confirming whethernodeStats.updateOrInsertwas added by the BSV fork or is also in upstream — that determines whether the fix goes here or needs upstream coordination.