feat: add blocked peer metrics #708

petarjuki7 · 2025-10-23T13:16:36Z

Issue Addressed

Addresses Issue #653

Proposed Changes

Adds a few more metrics about blocked peers.
Tracks the number of blocked peers, and how many times a particular blocked peer tried to dial us.

Additional Info

Can add more metrics if we think anything else is particularly useful

claude-code-actions-sigp · 2025-10-23T13:16:56Z

Claude finished @petarjuki7's task —— View job

Code Review Complete

Summary

This PR adds metrics for tracking blocked peers in the network layer. The implementation is straightforward and functionally correct, but there are several considerations around metric design, naming consistency, and observability best practices.

Key Findings

Positive Aspects:

Clean integration with existing BlockingManager code
Proper metric increment/decrement in block/unblock operations
Comprehensive test coverage already exists for the blocking logic
Follows existing patterns for metric declaration using LazyLock

Areas for Improvement:

High Cardinality Risk ⚠️ - The PEER_BLOCKED_INBOUND_CONNECTIONS metric uses peer IDs as labels, which creates a new time series for each unique blocked peer. In large networks, this could impact Prometheus performance. Consider whether aggregate metrics would suffice.
Metric Naming Inconsistency - The new blocked_peers_connecting metric doesn't follow the libp2p_ prefix convention used by other metrics in the file.
Semantic Clarity - The blocked_peers_connecting metric name suggests active connections, but it's actually a monotonically increasing counter of connection attempts (only reset on peer unblock). Consider naming it to reflect cumulative attempts rather than current state.
Missing Error Handling - Metric operations don't handle or log errors from the LazyLock<Result<T>> pattern.
Minor Documentation Gaps - Metric descriptions could be more specific about gauge vs counter semantics.

Detailed Review

See inline comments for specific technical details on:

Metric increment timing and semantics (blocking.rs:117)
High cardinality concerns with peer ID labels (metrics.rs:19)
Naming consistency issues (metrics.rs:18)
Error handling patterns (blocking.rs:44)
Potential race conditions during unblock (blocking.rs:58)
Documentation improvements (metrics.rs:12)

Testing

The existing test coverage in blocking.rs is comprehensive (lines 173-414), covering:

Basic block/unblock operations
Timeout-based unblocking
Multiple peer scenarios
Edge cases like double blocking

Recommendation: Consider adding a test that verifies the metrics are updated correctly during block/unblock operations.

Pre-Merge Checklist

Before merging, ensure you've run:

make cargo-fmt      # Format code
make lint           # Run clippy
make test           # Run test suite

Recommendations

High Priority:

Consider the cardinality implications of per-peer metrics - evaluate if aggregate metrics meet your observability needs
Decide on metric naming: either make it cumulative (libp2p_blocked_peer_connection_attempts_total) or implement proper increment/decrement for "current" semantics

Medium Priority:
3. Align metric names with the libp2p_ prefix convention
4. Add metric update verification to tests
5. Improve metric descriptions for clarity

Low Priority:
6. Add error logging for metric operations
7. Document the intended semantics of the per-peer connection metric

Note: I cannot run the linter, formatter, or tests directly due to tool access limitations. Please run these commands locally before merging.

anchor/network/src/peer_manager/blocking.rs

anchor/network/src/metrics.rs

claude-code-actions-sigp · 2025-10-23T13:19:06Z

anchor/network/src/metrics.rs

-        "Count of successful handshakes by number of matching subnets",
-        &["match_count"],
+        "blocked_peers_connecting",
+        "Count of blocked peers trying to reconnect",


High Cardinality Concern

Using peer_id as a label creates high cardinality, which can cause performance issues in Prometheus:

Each unique blocked peer creates a new time series

In a large network, this could result in thousands of time series

High cardinality can slow down queries and increase memory usage

Considerations:

Is per-peer granularity necessary? Consider if aggregate metrics (total blocked connection attempts) would suffice

If per-peer tracking is essential, consider:

Implementing a limit on tracked peers

Using a shorter peer ID representation (first 8 chars)

Adding documentation about the cardinality implications

Consider whether this data might be better suited for structured logs rather than metrics

The aggregate PEERS_BLOCKED gauge already tracks the count of blocked peers, which may be sufficient for most monitoring needs.

Not sure if its an issue.

anchor/network/src/peer_manager/blocking.rs

anchor/network/src/metrics.rs

petarjuki7 requested review from AgeManning, diegomrsantos and dknopik October 23, 2025 13:16

petarjuki7 self-assigned this Oct 23, 2025