Skip to content

Conversation

@petarjuki7
Copy link
Member

Issue Addressed

Addresses Issue #653

Proposed Changes

Adds a few more metrics about blocked peers.
Tracks the number of blocked peers, and how many times a particular blocked peer tried to dial us.

Additional Info

Can add more metrics if we think anything else is particularly useful

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Oct 23, 2025

Claude finished @petarjuki7's task —— View job


Code Review Complete

Summary

This PR adds metrics for tracking blocked peers in the network layer. The implementation is straightforward and functionally correct, but there are several considerations around metric design, naming consistency, and observability best practices.

Key Findings

Positive Aspects:

  • Clean integration with existing BlockingManager code
  • Proper metric increment/decrement in block/unblock operations
  • Comprehensive test coverage already exists for the blocking logic
  • Follows existing patterns for metric declaration using LazyLock

Areas for Improvement:

  1. High Cardinality Risk ⚠️ - The PEER_BLOCKED_INBOUND_CONNECTIONS metric uses peer IDs as labels, which creates a new time series for each unique blocked peer. In large networks, this could impact Prometheus performance. Consider whether aggregate metrics would suffice.

  2. Metric Naming Inconsistency - The new blocked_peers_connecting metric doesn't follow the libp2p_ prefix convention used by other metrics in the file.

  3. Semantic Clarity - The blocked_peers_connecting metric name suggests active connections, but it's actually a monotonically increasing counter of connection attempts (only reset on peer unblock). Consider naming it to reflect cumulative attempts rather than current state.

  4. Missing Error Handling - Metric operations don't handle or log errors from the LazyLock<Result<T>> pattern.

  5. Minor Documentation Gaps - Metric descriptions could be more specific about gauge vs counter semantics.

Detailed Review

See inline comments for specific technical details on:

  • Metric increment timing and semantics (blocking.rs:117)
  • High cardinality concerns with peer ID labels (metrics.rs:19)
  • Naming consistency issues (metrics.rs:18)
  • Error handling patterns (blocking.rs:44)
  • Potential race conditions during unblock (blocking.rs:58)
  • Documentation improvements (metrics.rs:12)

Testing

The existing test coverage in blocking.rs is comprehensive (lines 173-414), covering:

  • Basic block/unblock operations
  • Timeout-based unblocking
  • Multiple peer scenarios
  • Edge cases like double blocking

Recommendation: Consider adding a test that verifies the metrics are updated correctly during block/unblock operations.

Pre-Merge Checklist

Before merging, ensure you've run:

make cargo-fmt      # Format code
make lint           # Run clippy
make test           # Run test suite

Recommendations

High Priority:

  1. Consider the cardinality implications of per-peer metrics - evaluate if aggregate metrics meet your observability needs
  2. Decide on metric naming: either make it cumulative (libp2p_blocked_peer_connection_attempts_total) or implement proper increment/decrement for "current" semantics

Medium Priority:
3. Align metric names with the libp2p_ prefix convention
4. Add metric update verification to tests
5. Improve metric descriptions for clarity

Low Priority:
6. Add error logging for metric operations
7. Document the intended semantics of the per-peer connection metric


Note: I cannot run the linter, formatter, or tests directly due to tool access limitations. Please run these commands locally before merging.


"Count of successful handshakes by number of matching subnets",
&["match_count"],
"blocked_peers_connecting",
"Count of blocked peers trying to reconnect",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High Cardinality Concern

Using peer_id as a label creates high cardinality, which can cause performance issues in Prometheus:

  • Each unique blocked peer creates a new time series
  • In a large network, this could result in thousands of time series
  • High cardinality can slow down queries and increase memory usage

Considerations:

  1. Is per-peer granularity necessary? Consider if aggregate metrics (total blocked connection attempts) would suffice
  2. If per-peer tracking is essential, consider:
    • Implementing a limit on tracked peers
    • Using a shorter peer ID representation (first 8 chars)
    • Adding documentation about the cardinality implications
  3. Consider whether this data might be better suited for structured logs rather than metrics

The aggregate PEERS_BLOCKED gauge already tracks the count of blocked peers, which may be sufficient for most monitoring needs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if its an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant