Skip to content

Export RocksDB internal metrics via caller-driven…#3296

Open
awatts73 wants to merge 2 commits into
stagingfrom
feature/rocksdb-metrics
Open

Export RocksDB internal metrics via caller-driven…#3296
awatts73 wants to merge 2 commits into
stagingfrom
feature/rocksdb-metrics

Conversation

@awatts73

@awatts73 awatts73 commented Jun 8, 2026

Copy link
Copy Markdown

Motivation

RocksDB internals are currently invisible to operators. This became apparent during a recent incident where a major compaction caused a ~1 TB disk spike over two hours — with no metrics to explain what was happening in real time.

This PR adds the key RocksDB properties to Prometheus, gated behind the existing metrics feature with zero overhead when disabled.

What's added

Compaction pressure — early warning for write stalls:

Metric Description
snarkvm_rocksdb_compaction_pending 1 if compaction is queued but not yet running
snarkvm_rocksdb_estimate_pending_compaction_bytes Bytes waiting to be compacted; a rising trend signals backpressure before a stall hits
snarkvm_rocksdb_num_running_compactions Active background compaction threads
snarkvm_rocksdb_num_running_flushes Active memtable flushes
snarkvm_rocksdb_mem_table_flush_pending 1 if a flush is queued but not yet started

Disk footprint — explains why disk grows unexpectedly:

Metric Description
snarkvm_rocksdb_total_sst_files_size_bytes All SST files on disk, including those pending deletion
snarkvm_rocksdb_live_sst_files_size_bytes Live (referenced) SST files only

The gap between total and live is disk held by open snapshots or checkpoints blocking file deletion. This gap is what caused the recent disk spike.

General state:

Metric Description
snarkvm_rocksdb_num_snapshots Open snapshots — non-zero is what causes the total - live gap
snarkvm_rocksdb_estimate_num_keys Estimated key count

Per-level SST file counts (snarkvm_rocksdb_num_files_at_level0level6):

L0 is the critical one — writes land here first. When L0 file count climbs RocksDB throttles writes and eventually stalls. Levels 1–6 confirm compaction is draining the LSM tree normally.

Implementation

All values come from RocksDB's in-memory property counters (DB::property_int_value) — no disk I/O, no Statistics overhead.

Rather than spawning a background thread inside library code, RocksDB::export_rocksdb_metrics() is a plain public method. The caller decides when to poll. The delegation chain follows the same pattern as backup_database:

RocksDB::export_rocksdb_metrics()
  ↑ InnerDataMap::export_rocksdb_metrics()
    ↑ BlockDB::export_rocksdb_metrics()
      ↑ BlockStore<N, BlockDB<N>>::export_rocksdb_metrics()   ← called by snarkOS

Feature propagation: snarkvm/metricssnarkvm-ledger/metricssnarkvm-ledger-store/metrics.

A companion snarkOS PR calls export_rocksdb_metrics() every ~15 s from the existing auto-checkpoint polling loop (no new thread).

Test Plan

cargo check -p snarkvm-ledger-store --features rocks,metrics

… polling

Adds RocksDB property-based metrics (compaction pressure, SST file sizes,
snapshot count, per-level file counts) gated behind the existing `metrics`
feature. No background threads: a new `export_rocksdb_metrics()` public
method on `RocksDB`, `BlockDB`, and `BlockStore<N, BlockDB<N>>` lets the
caller decide when to poll (e.g. the auto-checkpoint loop in snarkOS).

Propagates `snarkvm-ledger-store/metrics` through `snarkvm-ledger/metrics`
so the feature chain is complete when snarkOS enables `snarkvm/metrics`.
@raychu86

raychu86 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

I would want this to be it's own feature flag for now.

This would allow us to test this feature properly, before we set it to be part of the default metrics feature.

@vicsn

vicsn commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Alternatively, out of an abundance of caution, roll this out on testnet client and validator before merging. Just because we are not yet able to simulate such large ledgers in our test environment.

I dont think mainnet state will be materially different so testnet should be sufficient.

@vicsn

vicsn commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Most failing CI was spurious, pushed an fmt fix

@vicsn

vicsn commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

testnet-based snarkOS branch for you to deploy and test: https://github.com/ProvableHQ/snarkOS/tree/update_snarkvm_rocksdb_metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants