-
Notifications
You must be signed in to change notification settings - Fork 646
feat: DIS-678 kvbm modularity: standalone metrics endpoint #3433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces standalone KVBM metrics mode served on a single port (default 6881) gated by environment variables. Updates Rust leaders/recorders and Python connectors to branch on this mode, removes worker-side metric instrumentation, adjusts Grafana queries and panels, drops Prometheus leader scrape, and revises docs accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User/Operator
participant Env as Env Vars
participant Py as Python Connector (Leader)
participant Rust as Rust Leader
participant Reg as KvbmMetricsRegistry
participant Srv as Metrics HTTP Server
participant Prom as Prometheus
User->>Env: Set DYN_KVBM_METRICS_STANDALONE [=1] and optional DYN_KVBM_METRICS_PORT
Py->>Env: Read standalone flag
alt Standalone enabled
Py->>Rust: KvConnectorLeader::new(..., metrics_standalone=true)
Rust->>Env: parse_dyn_kvbm_metrics_port() (default 6881)
Rust->>Reg: Init registry and counters (kvbm_* prefixed)
Reg-->>Rust: Registry handle
Rust->>Srv: Spawn /metrics on selected port (background)
Prom-->>Srv: Scrape /metrics
Srv-->>Prom: Expose kvbm_* metrics
else Namespace mode
Py->>Rust: KvConnectorLeader::new(..., metrics_standalone=false)
Rust->>Rust: Use namespace-based metrics
end
Note over Rust,Srv: Drop KvbmMetrics triggers graceful shutdown of server
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/bindings/python/src/dynamo/llm/utils.py (1)
36-44
: Rename helper to fix typoLine 36 spells the helper
is_standslone_kvbm_metrics_enabled
, which is easy to misread and will propagate the typo across future call sites. Rename it (and the imports) tois_standalone_kvbm_metrics_enabled
while the surface area is still tiny.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
deploy/metrics/grafana_dashboards/grafana-kvbm-dashboard.json
(7 hunks)deploy/metrics/prometheus.yml
(0 hunks)docs/guides/run_kvbm_in_trtllm.md
(2 hunks)docs/guides/run_kvbm_in_vllm.md
(1 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs
(5 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs
(2 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs
(4 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs
(0 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs
(0 hunks)lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_leader.py
(2 hunks)lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_worker.py
(0 hunks)lib/bindings/python/src/dynamo/llm/utils.py
(1 hunks)lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py
(2 hunks)lib/bindings/python/src/dynamo/llm/vllm_integration/connector_worker.py
(0 hunks)lib/llm/src/block_manager/metrics_kvbm.rs
(3 hunks)
💤 Files with no reviewable changes (5)
- lib/bindings/python/src/dynamo/llm/vllm_integration/connector_worker.py
- lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs
- lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs
- lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_worker.py
- deploy/metrics/prometheus.yml
🧰 Additional context used
🧬 Code graph analysis (6)
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1)
lib/bindings/python/src/dynamo/llm/utils.py (2)
find_and_set_available_port_from_env
(9-31)is_standslone_kvbm_metrics_enabled
(36-44)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (4)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)
parse_dyn_kvbm_metrics_port
(637-651)new
(91-184)new
(559-590)lib/llm/src/block_manager/metrics_kvbm.rs (3)
new_with_standalone
(85-163)default
(241-243)new
(41-81)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
new
(88-206)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (7)
new
(190-227)new
(340-371)new
(1085-1098)new
(1109-1122)new
(1132-1142)new
(1458-1463)new
(1467-1472)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)
lib/llm/src/block_manager/metrics_kvbm.rs (4)
new_with_standalone
(85-163)default
(241-243)new
(41-81)new
(211-216)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
new
(88-206)lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (2)
new
(65-148)new
(455-473)
lib/llm/src/block_manager/metrics_kvbm.rs (3)
lib/runtime/src/metrics/prometheus_names.rs (1)
sanitize_prometheus_name
(381-407)lib/runtime/src/pipeline/context.rs (1)
registry
(237-239)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)
new
(91-184)new
(559-590)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (2)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)
parse_dyn_kvbm_metrics_port
(637-651)new
(91-184)new
(559-590)lib/llm/src/block_manager/metrics_kvbm.rs (4)
new_with_standalone
(85-163)default
(241-243)new
(41-81)new
(211-216)
lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_leader.py (1)
lib/bindings/python/src/dynamo/llm/utils.py (2)
find_and_set_available_port_from_env
(9-31)is_standslone_kvbm_metrics_enabled
(36-44)
🪛 markdownlint-cli2 (0.18.1)
docs/guides/run_kvbm_in_vllm.md
91-91: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: tests (.)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the metric names, can you use the common prometheus_names.rs?
/ok to test 30d911d |
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs
Outdated
Show resolved
Hide resolved
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py
Outdated
Show resolved
Hide resolved
/ok to test 4b9b9dd |
sure, added |
/ok to test 3416649 |
3416649
to
51a52d4
Compare
Signed-off-by: richardhuo-nv <[email protected]> metrics implementation Signed-off-by: richardhuo-nv <[email protected]> add env for metrics Signed-off-by: richardhuo-nv <[email protected]> add port handling Signed-off-by: richardhuo-nv <[email protected]> add port handling Signed-off-by: richardhuo-nv <[email protected]> fix fix fix fix fix Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]> fix port Signed-off-by: richardhuo-nv <[email protected]> fix docs Signed-off-by: richardhuo-nv <[email protected]> fix port Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]>
Signed-off-by: richardhuo-nv <[email protected]>
51a52d4
to
e6607d8
Compare
/ok to test e6607d8 |
Signed-off-by: richardhuo-nv <[email protected]>
/ok to test 765c697 |
Signed-off-by: richardhuo-nv <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm from deploy
prespective
Overview:
KVBM need a standalone metrics port and registry to achieve the modularity. So that user could bypass the distributed runtime requirements for getting the metrics.
Also, removed the kvbm worker metrics, since it will cause port conflict issue when we have multiple TPs.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit