Skip to content

Conversation

keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Sep 15, 2025

Overview:

This PR makes Prometheus metric naming conventions more consistent across the codebase by applying standardized patterns and improving gauge vs counter distinctions.

Details:

  • Rename connections_total to current_connections (gauge for active connections)
  • Rename client_disconnects_total to disconnected_clients_total (better ordering)
  • Rename PROCESSING_TIME_MS_TOTAL to PROCESSING_MS_TOTAL (more concise)
  • Apply unit_aggregation pattern: AVG_PROCESSING_MS -> PROCESSING_MS_AVG
  • Sync ComponentNatsServerPrometheusMetrics variable names with metric constants
  • Update documentation with comprehensive naming transformation rules
  • Add units _messages and _connections to naming conventions
  • Update all code references, documentation, and test comments consistently

Where should the reviewer start?

lib/runtime/src/metrics/prometheus_names.rs - Core naming convention changes and transformation rules
lib/runtime/src/service.rs - Variable name synchronization with metric constants

Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

Walkthrough

Renames inflight metrics to use an _total suffix across frontend and work handler; adds a new CLIENT_DISCONNECTS_TOTAL metric for frontend. Updates code to use the new constants and registers the new metric. Adjusts documentation and tests to reflect renamed metrics and added metrics.

Changes

Cohort / File(s) Summary of changes
Docs: Metrics READMEs
deploy/metrics/README.md, lib/runtime/examples/system_metrics/README.md
Update metric names: inflight → inflight_requests_total; document new/expanded metrics; refresh sample Prometheus outputs.
Frontend service metrics
lib/llm/src/http/service/metrics.rs
Rename inflight gauge to INFLIGHT_REQUESTS_TOTAL; add and register CLIENT_DISCONNECTS_TOTAL gauge; update docs/comments; no control-flow changes.
Runtime metric constants
lib/runtime/src/metrics/prometheus_names.rs
Rename INFLIGHT_REQUESTS → INFLIGHT_REQUESTS_TOTAL in frontend_service and work_handler; add CLIENT_DISCONNECTS_TOTAL in frontend_service with doc comment.
Runtime metrics usage/tests
lib/runtime/src/metrics.rs, lib/runtime/examples/system_metrics/tests/integration_test.rs
Update references and test expectations from inflight_requests to inflight_requests_total; comments and assertions adjusted; logic unchanged.
Work handler pipeline
lib/runtime/src/pipeline/network/ingress/push_handler.rs
Switch gauge creation to work_handler::INFLIGHT_REQUESTS_TOTAL; no other modifications.

Sequence Diagram(s)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

I count the clouds of calls that roll,
From inflight now to totals whole.
A nibble, hop—disconnects noted,
Bytes and tokens duly quoted.
Prom labels trimmed, my ears held high—
Metrics march; the graphs comply.
Thump-thump: dashboards never lie.

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description includes Overview, Details, and "Where should the reviewer start?" sections but omits the required "Related Issues" section and, more importantly, its listed renames and transformations do not match the actual diffs. For example, the description claims renames such as connections_total → current_connections and client_disconnects_total → disconnected_clients_total, whereas the changes in the repository rename inflight_requests → inflight_requests_total and add a CLIENT_DISCONNECTS_TOTAL constant, update README/test comments, and adjust prometheus name constants. Because the description is inconsistent with both the template and the concrete file changes, it is not sufficiently accurate for reviewers. Please update the PR description to include the "Related Issues" section (or state none), and correct the "Details" to exactly list the actual changes (e.g., inflight_requests → inflight_requests_total, addition of CLIENT_DISCONNECTS_TOTAL, README and test comment updates, and files changed). Also adjust the "Where should the reviewer start?" file list to match the modified files (for example lib/runtime/src/metrics/prometheus_names.rs, lib/llm/src/http/service/metrics.rs, deploy/metrics/README.md, lib/runtime/examples/system_metrics/README.md) and indicate whether the other naming changes mentioned are planned in follow-up PRs.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "refactor: standardize Prometheus metric naming conventions (part 1)" succinctly and accurately summarizes the primary change—standardizing Prometheus metric names—and indicates this is the first in a series. It is concise, follows conventional commit-style prefixing, and contains no unrelated details or noise.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a61e29 and 191b201.

📒 Files selected for processing (7)
  • deploy/metrics/README.md (2 hunks)
  • lib/llm/src/http/service/metrics.rs (2 hunks)
  • lib/runtime/examples/system_metrics/README.md (2 hunks)
  • lib/runtime/examples/system_metrics/tests/integration_test.rs (2 hunks)
  • lib/runtime/src/metrics.rs (1 hunks)
  • lib/runtime/src/metrics/prometheus_names.rs (2 hunks)
  • lib/runtime/src/pipeline/network/ingress/push_handler.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-11T03:24:47.746Z
Learnt from: kthui
PR: ai-dynamo/dynamo#3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.746Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/push_handler.rs
🧬 Code graph analysis (1)
lib/runtime/src/metrics.rs (1)
lib/runtime/src/metrics/prometheus_names.rs (1)
  • build_component_metric_name (390-394)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build and Test - sglang
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (2)
lib/runtime/examples/system_metrics/tests/integration_test.rs (1)

121-123: Comments reference inflight_requests_total; keep consistent with final naming decision.

If we revert to a gauge named inflight_requests, update these comment references accordingly.

Also applies to: 146-147

lib/runtime/src/metrics.rs (1)

1634-1637: Post-activity check uses INFLIGHT_REQUESTS_TOTAL; adjust if inflight gauge is renamed.

Update to work_handler::INFLIGHT_REQUESTS if you adopt the gauge name fix.

-                build_component_metric_name(work_handler::INFLIGHT_REQUESTS_TOTAL),
+                build_component_metric_name(work_handler::INFLIGHT_REQUESTS),

@grahamking
Copy link
Contributor

Code Rabbit making some interesting points.

- Rename connections_total to current_connections (gauge for active connections)
- Rename client_disconnects_total to disconnected_clients_total (better ordering)
- Rename PROCESSING_TIME_MS_TOTAL to PROCESSING_MS_TOTAL (more concise)
- Apply unit_aggregation pattern: AVG_PROCESSING_MS -> PROCESSING_MS_AVG
- Sync ComponentNatsServerPrometheusMetrics variable names with metric constants
- Update documentation with comprehensive naming transformation rules
- Add units _messages and _connections to naming conventions
- Update all code references, documentation, and test comments consistently

These changes follow Prometheus best practices by distinguishing gauge vs
counter metrics and using consistent {unit}_{aggregation} naming patterns.

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang force-pushed the keivenchang/prometheus_names_sync branch from 2d4ecf4 to 7a707c7 Compare September 19, 2025 18:08
- Revert INFLIGHT_REQUESTS back to INFLIGHT_REQUESTS_TOTAL in prometheus_names.rs
- Update all code references to use INFLIGHT_REQUESTS_TOTAL constant
- Update documentation to reflect inflight_requests_total metric names
- Maintain consistency across frontend and work handler metrics
- Fix compilation errors after constant name changes

Signed-off-by: Keiven Chang <[email protected]>
- Update conventions to clarify gauges (up/down metrics) should not have _total suffix
- Revert INFLIGHT_REQUESTS_TOTAL back to INFLIGHT_REQUESTS in prometheus_names.rs
- Add comments explaining gauge vs counter distinction for inflight metrics
- Update all code references to use INFLIGHT_REQUESTS constant
- Update documentation to use inflight_requests without _total suffix
- Maintain consistency: counters use _total, gauges do not

Gauges measure current state (can go up/down), counters measure cumulative totals (only increase).

Signed-off-by: Keiven Chang <[email protected]>
… metric)

- Update DISCONNECTED_CLIENTS_TOTAL to DISCONNECTED_CLIENTS in prometheus_names.rs
- Remove _total suffix since this is a gauge metric (current state) not a counter
- Update documentation to clarify this tracks current disconnected clients count
- Gauges measure current state that can go up/down, counters measure cumulative totals
- Keep implementation as IntGauge since the metric can increase/decrease over time

Signed-off-by: Keiven Chang <[email protected]>
- Update QUEUED_REQUESTS constant from queued_requests_total to queued_requests
- Remove _total suffix since this tracks current queue size (gauge), not cumulative total
- Update documentation in deploy/metrics/README.md with new metric name
- Update example curl commands to show correct metric name
- Gauges measure current state that can go up/down, not cumulative totals

Signed-off-by: Keiven Chang <[email protected]>
…ments

- Add note to ComponentNatsServerPrometheusMetrics explaining why _total metrics use IntGauge
- Clarify that we copy counter values from underlying services rather than increment directly
- Update test comments to use inflight_requests instead of inflight_requests_total
- This explains why Counter-named metrics are implemented as Gauges (need set() method)

Signed-off-by: Keiven Chang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants