Skip to content

Add Circuit Breaker metrics to OpenTelemetry instrumentation #150

@jether2011

Description

@jether2011

Problem Statement

OJP currently implements a built-in circuit breaker feature that protects the system by blocking failing queries after a threshold is reached. However, there are no dedicated metrics exposed for monitoring circuit breaker state changes and behavior, making it difficult for operators to:

  • Detect when circuits are tripping in production
  • Understand the frequency and patterns of circuit breaker activations
  • Correlate circuit breaker events with application performance issues
  • Set up alerting for circuit breaker state transitions

Current State

  • Circuit breaker functionality exists in ojp-server (as documented in presentation slides showing 5 circuit breaker states)
  • OpenTelemetry instrumentation is already integrated (ADR-005, telemetry documentation)
  • Prometheus metrics endpoint is exposed on port 9159
  • No circuit breaker-specific metrics are currently collected or exposed

Proposed Solution

Add circuit breaker metrics to the existing OpenTelemetry instrumentation:

Metrics to Add:

  1. Circuit State Gauge (ojp.circuit_breaker.state)

    • Type: Gauge (0=closed, 1=open, 2=half-open)
    • Labels: query_hash, datasource
    • Tracks the current state of each monitored query's circuit
  2. Circuit Transitions Counter (ojp.circuit_breaker.transitions.total)

    • Type: Counter
    • Labels: from_state, to_state, query_hash, datasource
    • Increments on every state transition (e.g., closed→open, open→half-open)
  3. Circuit Trips Counter (ojp.circuit_breaker.trips.total)

    • Type: Counter
    • Labels: query_hash, datasource, reason
    • Increments each time a circuit opens due to failures
  4. Time in Open State (ojp.circuit_breaker.open_duration.seconds)

    • Type: Histogram
    • Labels: query_hash, datasource
    • Records how long circuits remain open before attempting recovery
  5. Failed Attempts While Open (ojp.circuit_breaker.blocked_calls.total)

    • Type: Counter
    • Labels: query_hash, datasource
    • Counts requests blocked due to open circuit

Benefits

  • Operational Visibility: Real-time insight into circuit breaker behavior
  • Proactive Monitoring: Set up alerts for circuit trips to catch issues early
  • Performance Analysis: Correlate circuit breaker activity with query performance
  • Capacity Planning: Understand which queries are problematic and need optimization
  • Debugging: Easier troubleshooting of application resilience patterns

Implementation Notes

  • Leverage existing OpenTelemetry integration (already using io.opentelemetry:opentelemetry-api)
  • Metrics should be exposed via the existing Prometheus endpoint (:9159/metrics)
  • Should respect the ojp.opentelemetry.enabled configuration flag
  • Add example Grafana dashboard queries in documentation

Configuration

No new configuration properties required - metrics will be automatically collected when OpenTelemetry is enabled.

Documentation Updates Needed

  • Update documents/telemetry/README.md with new circuit breaker metrics
  • Add example Prometheus queries for common monitoring scenarios
  • Include Grafana dashboard snippet for circuit breaker visualization

Acceptance Criteria

  • Circuit breaker state changes emit metrics
  • Metrics are exposed via Prometheus endpoint
  • Metrics include all relevant labels (query_hash, datasource)
  • Documentation updated with metric descriptions and examples
  • Example Grafana queries provided
  • Metrics respect ojp.opentelemetry.enabled flag
  • No performance regression in circuit breaker operations

Related Documentation

  • [Telemetry Documentation](documents/telemetry/README.md)
  • [ADR-005: Use OpenTelemetry](documents/ADRs/adr-005-use-opentelemetry.md)
  • [Circuit Breaker Presentation Slides](Proxy Power_ Boosting Java App Performance with Open J Proxy.pdf) - Slides 12-16

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions