-
Notifications
You must be signed in to change notification settings - Fork 164
Open
Description
Background
The current live statistics system uses a custom implementation where:
- Agents collect metrics via StatContext and store them in Valkey (Redis) with
short TTLs (8-120 seconds) - Manager serves these stats through GraphQL live_stat fields (kernel.live_stat,
agent.live_stat) - Data is serialized with msgpack and queried via ValkeyStatClient
Meanwhile, a separate Prometheus integration already exists(ContainerMetricService) that queries metrics from Prometheus but is not used for the live_stat API.
Problem
- Duplicate data paths: Metrics are sent to both Valkey (for live stats) and
Prometheus (for monitoring/dashboards) - No historical data: Valkey-based stats are ephemeral with short TTLs, making
trend analysis impossible - Maintenance burden: Two separate metric pipelines to maintain
- Inconsistency risk: Valkey stats and Prometheus metrics may diverge
Proposed Solution
Replace the Valkey-based live stats system with a Prometheus-backed implementation:
- Agent side: Remove Valkey stat publishing; ensure metrics are exported to
Prometheus - Manager side: Modify live_stat GraphQL resolvers to query Prometheus via
ContainerMetricService instead of ValkeyStatClient - Deprecate: Phase out ValkeyStatClient for statistics (keep ValkeyLiveClient for
service discovery)
Benefits
- Single source of truth: All metrics flow through Prometheus
- Historical queries: Access to time-series data for trends and analysis
- Ecosystem integration: Native compatibility with Grafana, alerting, etc.
- Reduced complexity: Eliminate Valkey stat storage and TTL management
- Lower Valkey load: Remove high-frequency stat read/write operations
JIRA Issue: BA-4039
Metadata
Metadata
Assignees
Labels
No labels