Problem Statement
OJP currently implements a built-in circuit breaker feature that protects the system by blocking failing queries after a threshold is reached. However, there are no dedicated metrics exposed for monitoring circuit breaker state changes and behavior, making it difficult for operators to:
- Detect when circuits are tripping in production
- Understand the frequency and patterns of circuit breaker activations
- Correlate circuit breaker events with application performance issues
- Set up alerting for circuit breaker state transitions
Current State
- Circuit breaker functionality exists in
ojp-server (as documented in presentation slides showing 5 circuit breaker states)
- OpenTelemetry instrumentation is already integrated (ADR-005, telemetry documentation)
- Prometheus metrics endpoint is exposed on port 9159
- No circuit breaker-specific metrics are currently collected or exposed
Proposed Solution
Add circuit breaker metrics to the existing OpenTelemetry instrumentation:
Metrics to Add:
-
Circuit State Gauge (ojp.circuit_breaker.state)
- Type: Gauge (0=closed, 1=open, 2=half-open)
- Labels:
query_hash, datasource
- Tracks the current state of each monitored query's circuit
-
Circuit Transitions Counter (ojp.circuit_breaker.transitions.total)
- Type: Counter
- Labels:
from_state, to_state, query_hash, datasource
- Increments on every state transition (e.g., closed→open, open→half-open)
-
Circuit Trips Counter (ojp.circuit_breaker.trips.total)
- Type: Counter
- Labels:
query_hash, datasource, reason
- Increments each time a circuit opens due to failures
-
Time in Open State (ojp.circuit_breaker.open_duration.seconds)
- Type: Histogram
- Labels:
query_hash, datasource
- Records how long circuits remain open before attempting recovery
-
Failed Attempts While Open (ojp.circuit_breaker.blocked_calls.total)
- Type: Counter
- Labels:
query_hash, datasource
- Counts requests blocked due to open circuit
Benefits
- Operational Visibility: Real-time insight into circuit breaker behavior
- Proactive Monitoring: Set up alerts for circuit trips to catch issues early
- Performance Analysis: Correlate circuit breaker activity with query performance
- Capacity Planning: Understand which queries are problematic and need optimization
- Debugging: Easier troubleshooting of application resilience patterns
Implementation Notes
- Leverage existing OpenTelemetry integration (already using
io.opentelemetry:opentelemetry-api)
- Metrics should be exposed via the existing Prometheus endpoint (
:9159/metrics)
- Should respect the
ojp.opentelemetry.enabled configuration flag
- Add example Grafana dashboard queries in documentation
Configuration
No new configuration properties required - metrics will be automatically collected when OpenTelemetry is enabled.
Documentation Updates Needed
- Update
documents/telemetry/README.md with new circuit breaker metrics
- Add example Prometheus queries for common monitoring scenarios
- Include Grafana dashboard snippet for circuit breaker visualization
Acceptance Criteria
Related Documentation
- [Telemetry Documentation](documents/telemetry/README.md)
- [ADR-005: Use OpenTelemetry](documents/ADRs/adr-005-use-opentelemetry.md)
- [Circuit Breaker Presentation Slides](Proxy Power_ Boosting Java App Performance with Open J Proxy.pdf) - Slides 12-16
Problem Statement
OJP currently implements a built-in circuit breaker feature that protects the system by blocking failing queries after a threshold is reached. However, there are no dedicated metrics exposed for monitoring circuit breaker state changes and behavior, making it difficult for operators to:
Current State
ojp-server(as documented in presentation slides showing 5 circuit breaker states)Proposed Solution
Add circuit breaker metrics to the existing OpenTelemetry instrumentation:
Metrics to Add:
Circuit State Gauge (
ojp.circuit_breaker.state)query_hash,datasourceCircuit Transitions Counter (
ojp.circuit_breaker.transitions.total)from_state,to_state,query_hash,datasourceCircuit Trips Counter (
ojp.circuit_breaker.trips.total)query_hash,datasource,reasonTime in Open State (
ojp.circuit_breaker.open_duration.seconds)query_hash,datasourceFailed Attempts While Open (
ojp.circuit_breaker.blocked_calls.total)query_hash,datasourceBenefits
Implementation Notes
io.opentelemetry:opentelemetry-api):9159/metrics)ojp.opentelemetry.enabledconfiguration flagConfiguration
No new configuration properties required - metrics will be automatically collected when OpenTelemetry is enabled.
Documentation Updates Needed
documents/telemetry/README.mdwith new circuit breaker metricsAcceptance Criteria
ojp.opentelemetry.enabledflagRelated Documentation