-
Notifications
You must be signed in to change notification settings - Fork 606
refactor: standardize Prometheus metric naming conventions (part 1) #3035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughRenames inflight metrics to use an _total suffix across frontend and work handler; adds a new CLIENT_DISCONNECTS_TOTAL metric for frontend. Updates code to use the new constants and registers the new metric. Adjusts documentation and tests to reflect renamed metrics and added metrics. Changes
Sequence Diagram(s)Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
deploy/metrics/README.md
(2 hunks)lib/llm/src/http/service/metrics.rs
(2 hunks)lib/runtime/examples/system_metrics/README.md
(2 hunks)lib/runtime/examples/system_metrics/tests/integration_test.rs
(2 hunks)lib/runtime/src/metrics.rs
(1 hunks)lib/runtime/src/metrics/prometheus_names.rs
(2 hunks)lib/runtime/src/pipeline/network/ingress/push_handler.rs
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-11T03:24:47.746Z
Learnt from: kthui
PR: ai-dynamo/dynamo#3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.746Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.
Applied to files:
lib/runtime/src/pipeline/network/ingress/push_handler.rs
🧬 Code graph analysis (1)
lib/runtime/src/metrics.rs (1)
lib/runtime/src/metrics/prometheus_names.rs (1)
build_component_metric_name
(390-394)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build and Test - sglang
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (2)
lib/runtime/examples/system_metrics/tests/integration_test.rs (1)
121-123
: Comments reference inflight_requests_total; keep consistent with final naming decision.If we revert to a gauge named
inflight_requests
, update these comment references accordingly.Also applies to: 146-147
lib/runtime/src/metrics.rs (1)
1634-1637
: Post-activity check uses INFLIGHT_REQUESTS_TOTAL; adjust if inflight gauge is renamed.Update to
work_handler::INFLIGHT_REQUESTS
if you adopt the gauge name fix.- build_component_metric_name(work_handler::INFLIGHT_REQUESTS_TOTAL), + build_component_metric_name(work_handler::INFLIGHT_REQUESTS),
Code Rabbit making some interesting points. |
191b201
to
6298aff
Compare
6298aff
to
2d4ecf4
Compare
- Rename connections_total to current_connections (gauge for active connections) - Rename client_disconnects_total to disconnected_clients_total (better ordering) - Rename PROCESSING_TIME_MS_TOTAL to PROCESSING_MS_TOTAL (more concise) - Apply unit_aggregation pattern: AVG_PROCESSING_MS -> PROCESSING_MS_AVG - Sync ComponentNatsServerPrometheusMetrics variable names with metric constants - Update documentation with comprehensive naming transformation rules - Add units _messages and _connections to naming conventions - Update all code references, documentation, and test comments consistently These changes follow Prometheus best practices by distinguishing gauge vs counter metrics and using consistent {unit}_{aggregation} naming patterns. Signed-off-by: Keiven Chang <[email protected]>
2d4ecf4
to
7a707c7
Compare
- Revert INFLIGHT_REQUESTS back to INFLIGHT_REQUESTS_TOTAL in prometheus_names.rs - Update all code references to use INFLIGHT_REQUESTS_TOTAL constant - Update documentation to reflect inflight_requests_total metric names - Maintain consistency across frontend and work handler metrics - Fix compilation errors after constant name changes Signed-off-by: Keiven Chang <[email protected]>
- Update conventions to clarify gauges (up/down metrics) should not have _total suffix - Revert INFLIGHT_REQUESTS_TOTAL back to INFLIGHT_REQUESTS in prometheus_names.rs - Add comments explaining gauge vs counter distinction for inflight metrics - Update all code references to use INFLIGHT_REQUESTS constant - Update documentation to use inflight_requests without _total suffix - Maintain consistency: counters use _total, gauges do not Gauges measure current state (can go up/down), counters measure cumulative totals (only increase). Signed-off-by: Keiven Chang <[email protected]>
… metric) - Update DISCONNECTED_CLIENTS_TOTAL to DISCONNECTED_CLIENTS in prometheus_names.rs - Remove _total suffix since this is a gauge metric (current state) not a counter - Update documentation to clarify this tracks current disconnected clients count - Gauges measure current state that can go up/down, counters measure cumulative totals - Keep implementation as IntGauge since the metric can increase/decrease over time Signed-off-by: Keiven Chang <[email protected]>
- Update QUEUED_REQUESTS constant from queued_requests_total to queued_requests - Remove _total suffix since this tracks current queue size (gauge), not cumulative total - Update documentation in deploy/metrics/README.md with new metric name - Update example curl commands to show correct metric name - Gauges measure current state that can go up/down, not cumulative totals Signed-off-by: Keiven Chang <[email protected]>
…ments - Add note to ComponentNatsServerPrometheusMetrics explaining why _total metrics use IntGauge - Clarify that we copy counter values from underlying services rather than increment directly - Update test comments to use inflight_requests instead of inflight_requests_total - This explains why Counter-named metrics are implemented as Gauges (need set() method) Signed-off-by: Keiven Chang <[email protected]>
Overview:
This PR makes Prometheus metric naming conventions more consistent across the codebase by applying standardized patterns and improving gauge vs counter distinctions.
Details:
Where should the reviewer start?
lib/runtime/src/metrics/prometheus_names.rs - Core naming convention changes and transformation rules
lib/runtime/src/service.rs - Variable name synchronization with metric constants