[RFC] OpenViking Metrics Architecture Design / OpenViking 指标体系设计 #1219
Replies: 1 comment 1 reply
-
Review from 外部 Contributor 🐻❄️Overall solid RFC — the problem diagnosis is accurate and the recommendations are reasonable. A few suggestions to strengthen it: 1. Missing Performance Budget DiscussionMulti-tenant + label dimension expansion → Cardinality explosion is a classic Prometheus pitfall. Suggest adding a section discussing:
2. Gauge Refresh Strategy Needs Nuance"Refresh on scrape" works for simple metrics, but if Suggest: scrape-triggered + local cache (TTL 5-10s) hybrid approach. 3. Metric Naming Convention Not SpecifiedThe RFC identifies "missing unified naming convention" as Problem 3, but doesn't propose one. Suggest defining it directly: 4. Missing Migration StrategyRenaming existing metrics = breaking change. Suggest:
Summary7.5/10 — Clear structure and accurate problem identification, but lacks implementation details (performance budget, naming convention, migration plan). Good enough to spark discussion, but needs another iteration before it's directly implementable. Great work on the bilingual format — very appropriate for an open-source project targeting both Chinese and international contributors. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[RFC] OpenViking Metrics Architecture
This RFC is bilingual (English + Chinese). English version first, Chinese version below.
本 RFC 为双语版本(英文 + 中文)。英文版在前,中文版在后。
English Version / 英文版
🔍 Review Items
/metricsremain limited to system/runtime metrics, while business analytics metrics continue to stay in/api/v1/stats? Recommended: Yesaccount_iddimension? Recommended: Supported, but disabled by default and enabled only for low-cardinality metricsSummarybe supported in the first release? Recommended: Support only Counter / Gauge / Histogram in the first release; do not introduce Summary yetConfirmed Assumptions
BaseObserverfamily is responsible for reading instantaneous state and does not store historical metrics./metricsis intended for Prometheus and similar monitoring systems, and should focus on runtime health and service quality./api/v1/statsis intended for business analytics and content-quality analysis. It does not need Prometheus compatibility and should not sacrifice readability for scrape-oriented access patterns.1. Background and Current State
1.1 Current Observation Entrypoints
OpenViking currently exposes three observation-related entrypoints:
/api/v1/observer/*ObserverServiceassemblesQueueObserver,VikingDBObserver,ModelsObserver,LockObserver, andRetrievalObserver/api/v1/stats/*StatsAggregatordynamically queries memory categories, hotness, staleness, and session extraction results/metricsPrometheusObserver.render_metrics()to produce text output1.2 Current Code Analysis
Based on the current codebase, the current state can be summarized as follows:
BaseObserveropenviking/storage/observers/base_observer.pyget_status_table / is_healthy / has_errorsObserverServiceopenviking/service/debug_service.pyPrometheusObserveropenviking/storage/observers/prometheus_observer.py/metricstextRetrievalStatsCollectoropenviking/retrieve/retrieval_stats.pyPrometheusObservercollection_schemasembedding pathopenviking/storage/collection_schemas.pyPrometheusObserverVLMBase.update_token_usageopenviking/models/vlm/base.py/metricsrouteropenviking/server/routers/metrics.pyapp.state.prometheus_observer1.3 Prometheus Metrics Already Supported
Today,
/metricsonly exposes the following groups of metrics:openviking_retrieval_requests_totalRetrievalStatsCollector.record_query()openviking_retrieval_latency_secondsRetrievalStatsCollector.record_query()openviking_embedding_requests_totalopenviking_embedding_latency_secondsopenviking_vlm_calls_totalVLMBase.update_token_usage()openviking_vlm_call_duration_secondsVLMBase.update_token_usage()openviking_cache_hits_total{level=...}PrometheusObserver.record_cache_hit()openviking_cache_misses_total{level=...}PrometheusObserver.record_cache_miss()These metrics prove that the pipeline already works end to end, but the coverage is still narrow and mostly limited to the retrieval / embedding / VLM event flows.
1.4 Main Problems
Problem 1:
PrometheusObserverbreaks consistency inside theObserverfamilyOther Observer implementations share the following characteristics:
By contrast,
PrometheusObserver:record_*write APIs;This means it is not really an Observer. It is a mixed “metric registry + Prometheus exporter”.
Problem 2: Strong coupling between collection and export layers
Current retrieval, embedding, and VLM instrumentation all depend directly on
get_prometheus_observer(). This leads to three consequences:Problem 3: Incomplete metric type coverage
Only Counter and Histogram exist today, while the following are still missing:
levellabel exists;Problem 4: Insufficient multi-tenant observability
OpenViking is a multi-tenant system, but current metrics are almost entirely process-level totals. Missing visibility includes:
At the same time, once dimensions expand to
user_id / session_id / resource_uri, the system will quickly run into cardinality explosion. Strict label boundaries are therefore required.Problem 5:
/metricsand/api/v1/statsare easy to confuseThe memory category, hotness, and staleness outputs provided by
StatsAggregatorare much closer to analytical statistics than to natural inputs for a high-frequency Prometheus scraping system. If such queries are moved directly into/metrics, several issues appear:The design must therefore make a hard separation between online metrics and analytical statistics.
2. Goals and Non-goals
2.1 Goals
Observerto its intended role as instantaneous state observation.MetricRegistry./metrics./metricsand/api/v1/stats.2.2 Non-goals
3. New Architecture Overview
3.1 Core Principles
This section defines the foundational principles that must continue to hold across the metrics design. They constrain the later abstract layering, module boundaries, telemetry relationships, and the responsibilities of external observation interfaces.
Principle A: Abstract responsibilities first, concrete types later
At the overview layer, only the following four core abstractions are kept:
MetricDataSourceBaseMetricCollectorMetricRegistryBaseMetricExporterPrinciple B: Decouple data sources from the metrics system
Existing Observer outputs, telemetry, TaskTracker state, and business event instrumentation are all “data origins”, but they are not the metrics system itself.
Therefore:
This split prevents retrieval, embedding, VLM, and similar business code from directly coupling to any one exporter.
Principle C: Registry is the single source of truth for metrics
MetricRegistryacts as the sole in-process source of truth for metrics, responsible for unified reads, writes, and constraints.Its responsibilities are:
The “current view” here only means reading registry state at access time. It does not introduce an extra Snapshot architecture layer.
Principle D: Exporter is responsible only for protocol export
Exporter is responsible only for protocol output, not for metric semantics.
Exporter is not responsible for:
Exporter is only responsible for:
/metricsor other monitoring backends.Prometheus is therefore only the first concrete implementation, not the center of the overall metrics architecture.
Principle E: Preserve the responsibility boundaries of the three observation entrypoints
/metrics,/api/v1/observer/*, and/api/v1/stats/*should continue to coexist, but with clear division of responsibilities./metricsis machine-oriented, emphasizing low cardinality, low cost, and sustainable aggregation;/api/v1/observer/*is human-oriented, emphasizing readability of instantaneous state;/api/v1/stats/*is analytics-oriented, allowing heavier queries and statistical logic.These three interfaces may share some data sources, but they must not share one output model.
3.2 Abstract Layering
This section gives the abstract main pipeline of the metrics system and keeps only the minimal stable roles, without expanding into concrete implementations too early.
The abstract main pipeline is:
graph LR A["MetricDataSource"] B["BaseMetricCollector"] C["MetricRegistry"] D["BaseMetricExporter"] A --> B B --> C C --> DThe responsibilities of the four abstract roles are:
MetricDataSourceBaseMetricCollectorMetricRegistryBaseMetricExporterThe data flow corresponding to this abstract pipeline is:
MetricDataSourceprovides instantaneous state, request-end summaries, or runtime events;BaseMetricCollectormaps these inputs into Counter / Gauge / Histogram and other unified metrics;MetricRegistrystores all current in-process metric state;BaseMetricExporterreads the registry when needed and outputs to external systems.3.3 Design Boundaries
After the abstract layering is defined, design boundaries are still needed to ensure the following:
Therefore, explicit constraints are required for each role:
MetricDataSource.Collector,MetricRegistry, andExportermust not directly access business services, business storage, or business context.MetricRegistry, without owning real business state./metricsis scraped. Reads at that point must be limited to lightweight snapshots, existing aggregated results, or lightweight probe results.3.4 Directory and Module Boundaries
Once the above boundaries are in place, the module layout must remain consistent with them, instead of mixing responsibilities back together through directory structure. To avoid continued blending with
storage/observers/, the new metrics system should live underopenviking/metrics/, grouped by registry / collectors / exporters / naming / bootstrap concerns.Suggested logical groups:
registrycollectorsexportersnamingbootstrap3.5 Relationship with Existing Telemetry
Operation telemetry and metrics are not two mutually replacing systems. Telemetry remains a request-level structured summary used for per-call explanation and troubleshooting. Metrics only extract a whitelist of low-cardinality fields for continuous scraping, aggregation, and alerting.
Operation telemetry already carries many valuable fields, such as:
duration_mstokens.*vector.*queue.*semantic_nodes.*memory.extract.*errors.*However, not all fields are suitable for
/metrics, so only a whitelist is metricized:duration_mstokens.total / llm / embeddingvector.searches / scored / returnedqueue.*semantic_nodes.*memory.extract.*errors.message3.6 Responsibility Boundaries of
/metrics,/api/v1/observer, and/api/v1/statsThe three external observation interfaces may share some data sources, but they do not share a single output model, nor should a single interface be expected to carry all observation needs. Making this boundary explicit avoids future drift between machine scraping, human diagnosis, and business analytics.
/metrics/api/v1/observer/*/api/v1/stats/*4. Core Design Details
4.1
MetricDataSourceDesignAt the implementation layer, the abstract role
MetricDataSourceshould land as a unified base classBaseMetricDataSource, further divided into four intermediate abstractions:EventMetricDataSource,StateMetricDataSource,DomainStatsMetricDataSource, andProbeMetricDataSource. These four are not just logical tags; each corresponds to a different data access contract and refresh pattern, so the architecture should distinguish them explicitly.These inputs may be:
In OpenViking, considering the current functional coverage and existing code sampling points, a two-layer structure of “unified base class + intermediate contract layer” is recommended. The inheritance relationship is:
graph LR A["BaseMetricDataSource"] B["EventMetricDataSource"] C["StateMetricDataSource"] D["DomainStatsMetricDataSource"] E["ProbeMetricDataSource"] A --> B A --> C A --> D A --> EThe data access contracts of these four intermediate abstractions are:
EventMetricDataSourceStateMetricDataSourceDomainStatsMetricDataSourceProbeMetricDataSourceObserver, Telemetry, TaskTracker, HTTP Router, and all business services keep their existing roles. The metrics system only treats them as data sources and does not transform them into exporters or collectors.
4.2
BaseMetricCollectorDesignCollectors sit at the semantic mapping layer of the metrics system. They receive different kinds of observable input and convert them into stable, unified writes into
MetricRegistry. SinceMetricDataSourceis divided into Event, State, DomainStats, and Probe categories, the Collector side should adopt the same layered organization instead of remaining in a simplified Event / State-only model.Under this design, a Collector is no longer just a “metric writer”. It becomes the unified convergence layer: on one side it shields upstream differences in access patterns and update cadence, and on the other side it exposes consistent write semantics to the registry. This keeps future extension points concentrated in the Collector layer rather than scattering source-specific logic into registry or exporter layers.
The recommended organization is “base class + four child abstractions”:
graph LR A["BaseMetricCollector"] B["EventMetricCollector"] C["StateMetricCollector"] D["DomainStatsMetricCollector"] E["ProbeMetricCollector"] A --> B A --> C A --> D A --> EThe responsibility split is:
BaseMetricCollectorEventMetricCollectorStateMetricCollector/metricsscrapeDomainStatsMetricCollectorProbeMetricCollectorThe primary mapping between Collector and DataSource is:
EventMetricDataSourceEventMetricCollectorStateMetricDataSourceStateMetricCollectorDomainStatsMetricDataSourceDomainStatsMetricCollectorProbeMetricDataSourceProbeMetricCollectorDesign rationale: after introducing the four Collector categories, semantic boundaries become clearer:
4.3
MetricRegistryDesignMetricRegistryis the stable center of the whole system. It handles unified registration, validation, storage, and reads, but it does not decide collection timing and does not perform protocol export. It maintains a unified external interface and does not split separate write APIs for Event / State / DomainStats / Probe. Semantic branching belongs in the Collector layer.MetricRegistrymust satisfy the following capabilities:/metricsscraping must not hold heavy locks for longRegistry only solves “how metrics are stored uniformly”. It does not solve:
The unified interface strategy is:
inc_counterorobserve_histogram;set_gauge;In other words, Registry does not know whether the current write comes from an event, state, stats, or probe. It only knows which metric type, name, labels, and value are being written.
This allows the registry to remain the stable core of the metrics system without changing frequently with collector or exporter evolution.
4.4
BaseMetricExporterDesignExporter sits at the downstream end of the metrics system and is only responsible for converting current registry state into external protocols. Like Registry, Exporter also follows a unified-interface principle: it depends only on unified read APIs and does not care whether the data came from Event, State, DomainStats, or Probe flows.
The recommended inheritance structure is:
graph TB A["BaseMetricExporter"] B["PrometheusExporter"] C["OtelExporter"] D["InfluxDBExporter"] A --> B A --> C A --> DEach exporter is positioned as follows:
PrometheusExporter/metricsexposition outputOtelExporterInfluxDBExporterThe unified read strategy is:
4.5 Prometheus Runtime Data Path
This section describes the end-to-end runtime path from a
/metricsrequest entering the service to the final response, and clarifies the boundary of refresh-before-export. The recommended approach is: when Prometheus scrapes/metrics, required collectors are refreshed first, then the unified registry snapshot is read, and finally the exporter serializes the result into protocol text.Recommended flow when Prometheus scrapes
/metrics:sequenceDiagram participant P as Prometheus participant R as /metrics Router participant E as PrometheusExporter participant C as StateCollectors participant G as MetricRegistry P->>R: GET /metrics R->>E: export() E->>C: refresh() C->>G: set Gauge values E->>G: read unified metric snapshot G-->>E: metric samples E-->>R: text exposition R-->>P: 200 text/plainWhy refresh StateCollectors before export:
5. Metric Strategy
5.1 Metric Mapping Overview
Before detailing metric lists, the mapping “metric family → DataSource → Collector → metric type” should be unified into one view. This prevents later lists from showing only metric names without their collection paths. Every first-release metric should be traceable to a concrete DataSource and Collector. If a metric cannot explain its input source or collection responsibility, it should not be included in the first-release scope.
HttpRequestLifecycleDataSourceEventMetricCollector(can later be refined intoHTTPCollector)RetrievalStatsDataSource, retrieval completion eventsRetrievalCollectorModelUsageDataSource, model-call eventsVLMCollector,EmbeddingCollector,ModelUsageCollectorResourceIngestionEventDataSourceEventMetricCollector(can later be refined intoResourceIngestionCollector)SessionLifecycleDataSource,TaskStateDataSource,QueuePipelineStateDataSourceTaskTrackerCollector,QueueCollectorObserverStateDataSource,QueuePipelineStateDataSourceObserverHealthCollector,LockCollector,VikingDBCollectorEncryptionEventDataSource,EncryptionProbeDataSourceEncryptionCollector,EncryptionProbeCollector*ProbeDataSource*ProbeCollectorTelemetryBridgeCollector5.2 Metric Object Model
The first-release metric object model should stay compact and stable, keeping only the three most necessary base types: Counter, Gauge, and Histogram.
Summaryis intentionally excluded from the first-release scope to avoid unnecessary complexity and semantic overlap.The first release uniformly supports three metric types:
Summaryis not recommended in the first release for the following reasons:5.3 Label Strategy
The core of label strategy is not “express as much business information as possible”, but finding a stable balance between observability value and cardinality risk. For this reason, label design must follow a “low-cardinality first” principle and only allow a small, enumerable, controllable set of labels by default.
Allowed Common Labels
operationsearch.findorresources.add_resourcestatusok/errorqueueEmbedding/Semantic/ other queue nameslevelL0/L1/L2context_typememory/resourcecomponenttask_typesession_commitaccount_idprovidermodel_nameTenant Label Strategy
account_idis only recommended for:Three guardrails are recommended:
High-cardinality Labels to Avoid
user_idsession_idresource_urierror_messagequery5.4 Naming Convention
The goal of the naming convention is to keep the same class of metrics expressed consistently across collectors and flows, reducing naming drift and semantic ambiguity. All metric names therefore follow the
openviking_<domain>_<metric>_<unit>template, with additional constraints for Counter / Histogram / Health metrics.Recommended unified metric naming:
openviking_<domain>_<metric>_<unit>Examples:
openviking_retrieval_requests_totalopenviking_queue_pendingopenviking_task_runningopenviking_operation_duration_secondsDesign requirements:
_total;_seconds;0/1numeric values.5.5 Metrics Explicitly Excluded from
/metricsThe following outputs are better kept in
/api/v1/statsor telemetry JSON:5.6 Bucket Strategy
All latency Histograms should use a unified second-based bucket strategy to make cross-module comparison easier:
0.005/0.01/0.0250.05/0.1/0.250.5/1.0/2.55.0/10.0/30.0If a particular path needs a custom bucket configuration, that should be configured at the metric definition layer rather than hardcoded in business instrumentation.
5.7 Telemetry Metricization Details
For operation telemetry, only fields that already exist at completion time and do not create high cardinality are extracted. This is aligned with the current capabilities in
docs/zh/guides/05-observability.md,docs/zh/guides/07-operation-telemetry.md, andopenviking/telemetry/operation.py.summary.operation+summary.statusopenviking_operation_requests_totalsummary.duration_msopenviking_operation_duration_secondssummary.tokens.totalopenviking_operation_tokens_total{token_type="all"}summary.tokens.llm.inputopenviking_operation_tokens_total{token_type="llm_input"}summary.tokens.llm.outputopenviking_operation_tokens_total{token_type="llm_output"}summary.tokens.embedding.totalopenviking_operation_tokens_total{token_type="embedding"}summary.vector.searchesopenviking_vector_searches_totalsummary.vector.scoredopenviking_vector_scored_totalsummary.vector.passedopenviking_vector_passed_totalsummary.vector.returnedopenviking_vector_returned_totalsummary.vector.scannedopenviking_vector_scanned_totalsummary.memory.extractedopenviking_memory_extracted_total6. Final Recommendations
Recommended Conclusion
PrometheusObservercompletely from theObserverfamily.MetricRegistry + Collector + Exportersystem underopenviking/metrics/.Event / State / DomainStats / Probe; split concrete probe subclasses by dependency type./metricsfocused on low-cardinality, low-cost online monitoring metrics; continue to keep memory health, staleness, session extraction, and similar analytical outputs in/api/v1/stats.Expected Benefits
observer,stats, andmetricsare clear.Chinese Version / 中文版
[RFC] OpenViking 指标体系设计
🔍 评审项
/metrics仅承载系统与运行时指标,业务分析类指标继续保留在/api/v1/stats?推荐:是account_id维度?推荐:支持,但默认关闭,仅在低基数指标上启用Summary?推荐:首版只支持 Counter / Gauge / Histogram,暂不引入 Summary已确认前提
BaseObserver体系的职责是“读取瞬时状态”,不负责存储历史指标。/metrics面向 Prometheus 等监控系统,应该偏系统运行态与服务运行质量。/api/v1/stats面向业务分析与内容质量分析,不要求 Prometheus 兼容,也不应为适配抓取模型而牺牲可读性。1. 背景与现状
1.1 当前观测入口
OpenViking 当前已经存在三类与“观测”相关的入口:
/api/v1/observer/*ObserverService组装QueueObserver、VikingDBObserver、ModelsObserver、LockObserver、RetrievalObserver/api/v1/stats/*StatsAggregator动态查询 memory 分类、热度、陈旧度、session extraction 结果/metricsPrometheusObserver.render_metrics()输出文本1.2 当前代码实现分析
结合现有代码,可以把现状归纳为下面几个事实:
BaseObserveropenviking/storage/observers/base_observer.pyget_status_table / is_healthy / has_errors抽象接口ObserverServiceopenviking/service/debug_service.pyPrometheusObserveropenviking/storage/observers/prometheus_observer.py/metrics文本RetrievalStatsCollectoropenviking/retrieve/retrieval_stats.pyPrometheusObservercollection_schemasembedding 路径openviking/storage/collection_schemas.pyPrometheusObserverVLMBase.update_token_usageopenviking/models/vlm/base.py/metrics路由openviking/server/routers/metrics.pyapp.state.prometheus_observer直接取渲染结果1.3 当前已支持的 Prometheus 指标
当前
/metrics实际只支持以下几类指标:openviking_retrieval_requests_totalRetrievalStatsCollector.record_query()openviking_retrieval_latency_secondsRetrievalStatsCollector.record_query()openviking_embedding_requests_totalopenviking_embedding_latency_secondsopenviking_vlm_calls_totalVLMBase.update_token_usage()openviking_vlm_call_duration_secondsVLMBase.update_token_usage()openviking_cache_hits_total{level=...}PrometheusObserver.record_cache_hit()openviking_cache_misses_total{level=...}PrometheusObserver.record_cache_miss()这些指标证明链路已经打通,但覆盖范围仍然偏窄,主要集中在 retrieval / embedding / vlm 三条事件流。
1.4 主要问题
问题 1:
PrometheusObserver破坏了Observer体系的一致性其他 Observer 的共同特点是:
而
PrometheusObserver的共同特点是:record_*写接口;这说明它本质上不是 Observer,而是“指标注册中心 + Prometheus 导出器”的混合体。
问题 2:采集层与导出层强耦合
当前 retrieval、embedding、vlm 的埋点代码都直接依赖
get_prometheus_observer()。这会带来三个后果:问题 3:指标类型不完整
当前仅有 Counter 与 Histogram,缺少以下能力:
level一个标签;问题 4:多租户可观测性不足
OpenViking 是多租户系统,但当前指标几乎全部是进程级总量,缺少:
但同时也要注意,租户维度一旦扩大到
user_id / session_id / resource_uri,会快速演化成高基数问题,因此必须设计严格的标签边界。问题 5:
/metrics与/api/v1/stats的定位容易混淆StatsAggregator当前提供的 memory 分类、hotness、staleness 更像分析型统计,而不是 Prometheus 这种高频抓取系统的天然输入。如果直接把这类查询迁入/metrics,会产生:因此设计上必须把“在线指标”与“分析统计”明确分层。
2. 设计目标与非目标
2.1 设计目标
Observer回归“瞬时状态观测”本位。MetricRegistry。/metrics提供系统化的 Gauge / Counter / Histogram 支持。/metrics与/api/v1/stats的边界。2.2 非目标
3. 新架构概览
3.1 核心原则
本节明确指标体系设计中需要持续成立的几项基础原则。这些原则用于约束后续的抽象分层、模块边界、Telemetry 关系以及对外观测接口的职责划分。
原则 A:先抽象职责,再落实现类型
概览层保留以下四个核心抽象:
MetricDataSourceBaseMetricCollectorMetricRegistryBaseMetricExporter原则 B:数据源与指标体系解耦
现有 Observer、Telemetry、TaskTracker、业务事件埋点,本质上都属于“数据来源”,但不等同于指标系统本身。
因此:
这种拆分可以避免 retrieval、embedding、vlm 等业务代码直接耦合某一个 exporter。
原则 C:Registry 是唯一真实指标存储
MetricRegistry作为进程内指标的唯一真实来源,负责承接统一的读写与约束。职责范围如下:
注意:这里的“当前视图”仅指读取时获取 registry 内部状态,不引入独立的 Snapshot 架构层。
原则 D:Exporter 只负责协议导出
Exporter 只负责协议导出,不负责指标语义生成。
Exporter 不负责:
Exporter 只负责:
/metrics或其他监控后端。这意味着 Prometheus 只是首个落地实现,而不是整个 metrics 架构的中心。
原则 E:保留三类观测入口的职责边界
/metrics、/api/v1/observer/*、/api/v1/stats/*三类入口继续并存,但必须保持清晰分工。/metrics面向机器抓取,强调低基数、低成本、可持续聚合;/api/v1/observer/*面向人工诊断,强调瞬时状态可读性;/api/v1/stats/*面向业务分析,允许更重的查询与统计逻辑。这三个入口共享部分数据来源,但不共享同一种输出模型。
3.2 抽象分层设计
本节给出指标体系的抽象主链路,只保留最小且稳定的四类角色,不提前展开任何具体实现类。
抽象主链路如下:
graph LR A["MetricDataSource"] B["BaseMetricCollector"] C["MetricRegistry"] D["BaseMetricExporter"] A --> B B --> C C --> D四类抽象角色的职责如下:
MetricDataSourceBaseMetricCollectorMetricRegistryBaseMetricExporter抽象主链路对应的数据流顺序如下:
MetricDataSource提供瞬时状态、请求结束摘要或运行时事件;BaseMetricCollector将这些输入映射为 Counter / Gauge / Histogram 等统一指标;MetricRegistry保存当前进程内全部指标状态;BaseMetricExporter在需要时读取 registry 并输出给外部系统。3.3 设计边界
抽象分层确定之后,还需要从设计边界上进一步保证:
因此,需要对相应角色施加明确约束:
MetricDataSource接触和读取。Collector、MetricRegistry、Exporter不再直接访问业务服务、业务存储或业务上下文。MetricRegistry,不拥有业务真实状态。/metrics抓取前刷新,对应读取只能是轻量快照、已有聚合结果或轻量探针结果。3.4 目录与模块边界
在边界约束成立之后,模块划分也需要与之保持一致,避免目录结构重新把已经划清的职责边界混回同一层。为避免新指标体系继续与
storage/observers/的职责混杂,集中放入openviking/metrics/,并按“注册中心 / 采集器 / 导出器 / 规则 / 启动装配”进行分组。建议的逻辑分组如下:
registrycollectorsexportersnamingbootstrap3.5 与现有 telemetry 的关系
operation telemetry 与 metrics 并不是两套彼此替代的系统。前者继续作为请求级结构化摘要存在,服务于单次调用的解释与排障;后者则只对白名单字段做低基数抽取,用于持续抓取、聚合与告警。
operation telemetry 已经拥有很多有价值的数据字段,如:
duration_mstokens.*vector.*queue.*semantic_nodes.*memory.extract.*errors.*但并非所有字段都适合直接进入
/metrics,因此这里只对白名单字段进行指标化抽取:duration_mstokens.total / llm / embeddingvector.searches / scored / returnedqueue.*semantic_nodes.*memory.extract.*errors.message3.6
/metrics、/api/v1/observer、/api/v1/stats的职责边界三类对外观测接口共享部分数据来源,但并不共享同一种输出模型,也不应追求由同一套接口承担全部观测需求。明确这一边界,是为了避免后续设计在机器抓取、人工诊断与业务分析之间发生职责漂移。
/metrics/api/v1/observer/*/api/v1/stats/*4. 核心设计细节
4.1
MetricDataSource设计在实现层,抽象角色
MetricDataSource建议落为统一基类BaseMetricDataSource,并在其下进一步划分EventMetricDataSource、StateMetricDataSource、DomainStatsMetricDataSource、ProbeMetricDataSource四类中间抽象。这四类中间抽象并非单纯的逻辑标签,而是分别对应不同的数据访问契约与刷新方式,因此需要在架构层被明确区分。这些输入可能是:
在 OpenViking 中,结合当前文档覆盖的功能面与现有代码采样点,推荐采用“统一基类 + 中间契约层”的两层结构。对应继承关系如下:
graph LR A["BaseMetricDataSource"] B["EventMetricDataSource"] C["StateMetricDataSource"] D["DomainStatsMetricDataSource"] E["ProbeMetricDataSource"] A --> B A --> C A --> D A --> E这四类中间抽象对应的数据访问契约如下:
EventMetricDataSourceStateMetricDataSourceDomainStatsMetricDataSourceProbeMetricDataSourceObserver、Telemetry、TaskTracker、HTTP Router 与各类业务服务继续保持原有职责;指标系统只把它们视为数据源,而不将其改造成 exporter 或 collector。
4.2
BaseMetricCollector设计Collector 位于指标体系中的语义转换层,负责接收不同类型的可观测输入,并将其稳定映射为对
MetricRegistry的统一写入操作。随着MetricDataSource被进一步划分为 Event、State、DomainStats 与 Probe 四类,Collector 侧也应采用对应的分层组织,而不宜继续停留在仅区分 Event / State 的简化模型。在这一设计下,Collector 不再只是“埋点写入器”,而是承担统一收口职责:一方面屏蔽上游数据源在访问方式与更新节奏上的差异,另一方面对下游 registry 暴露一致的写入语义。这样可以保证新增指标链路时,扩展点仍然集中在 Collector 层,而不会把 source-specific 逻辑扩散到 registry 或 exporter。
推荐采用“基类 + 四类子抽象”的组织方式:
graph LR A["BaseMetricCollector"] B["EventMetricCollector"] C["StateMetricCollector"] D["DomainStatsMetricCollector"] E["ProbeMetricCollector"] A --> B A --> C A --> D A --> E抽象层次的分工如下:
BaseMetricCollectorEventMetricCollectorStateMetricCollector/metrics抓取前刷新DomainStatsMetricCollectorProbeMetricCollectorCollector 与 DataSource 的主映射关系如下:
EventMetricDataSourceEventMetricCollectorStateMetricDataSourceStateMetricCollectorDomainStatsMetricDataSourceDomainStatsMetricCollectorProbeMetricDataSourceProbeMetricCollector设计理由:在引入四类 Collector 之后,整个体系的语义边界更清晰:
4.3
MetricRegistry设计MetricRegistry是整个体系的稳定中心,负责统一注册、校验、存储和读取指标,但不负责采集触发,也不承担协议导出职责。它对外维持统一接口,不针对 Event / State / DomainStats / Probe 四类 Collector 再拆分多套写入 API,语义分流应由 Collector 自身完成。MetricRegistry必须满足以下能力:/metrics抓取不应持有长时间大锁Registry 只解决“如何统一保存指标”,不解决:
统一接口策略如下:
inc_counter或observe_histogram;set_gauge;也就是说,Registry 不感知“当前写入的是事件、状态、统计还是探针”,它只感知“要写入哪种指标类型、哪个指标名、哪些标签和值”。
由此,registry 可以作为整个指标系统的稳定核心,而不随 exporter 或 collector 的演化频繁变动。
4.4
BaseMetricExporter设计Exporter 位于指标体系最下游,只负责把 registry 中的当前指标状态转换成外部协议。与 Registry 相同,Exporter 也遵循统一接口原则:它只依赖统一的读接口,不感知指标来自 Event / State / DomainStats / Probe 中的哪一路。
推荐采用如下继承关系:
graph TB A["BaseMetricExporter"] B["PrometheusExporter"] C["OtelExporter"] D["InfluxDBExporter"] A --> B A --> C A --> D各 exporter 的定位如下:
PrometheusExporter/metricsexposition 输出OtelExporterInfluxDBExporter统一读取策略如下:
4.5 Prometheus 方式数据路径
本节讨论
/metrics请求从进入服务到返回结果之间的完整运行时序,并明确导出前刷新策略的边界。推荐做法是:Prometheus 抓取/metrics时,先执行必要的 collector 刷新,再统一读取 registry 快照,最后由 exporter 完成协议序列化。Prometheus 抓取
/metrics时,推荐执行以下流程:sequenceDiagram participant P as Prometheus participant R as /metrics Router participant E as PrometheusExporter participant C as StateCollectors participant G as MetricRegistry P->>R: GET /metrics R->>E: export() E->>C: refresh() C->>G: set Gauge values E->>G: read unified metric snapshot G-->>E: metric samples E-->>R: text exposition R-->>P: 200 text/plain设计理由:采用“导出前刷新 StateCollector”的原因是:
5. 指标策略
5.1 指标映射总览
在展开具体指标清单之前,先把“指标族 → DataSource → Collector → Metric Type”的关系收敛成统一视图,可以避免后续列表只见指标名、不见来源链路。对于首版范围内的所有指标,都应能够映射回明确的 DataSource 与 Collector;如果某个指标无法说明其输入来源或采集责任,就不应直接进入首版清单。
HttpRequestLifecycleDataSourceEventMetricCollector(后续可细化为 HTTPCollector)RetrievalStatsDataSource、retrieval 完成事件RetrievalCollectorModelUsageDataSource、模型调用事件VLMCollector、EmbeddingCollector、ModelUsageCollectorResourceIngestionEventDataSourceEventMetricCollector(后续可细化为 ResourceIngestionCollector)SessionLifecycleDataSource、TaskStateDataSource、QueuePipelineStateDataSourceTaskTrackerCollector、QueueCollectorObserverStateDataSource、QueuePipelineStateDataSourceObserverHealthCollector、LockCollector、VikingDBCollectorEncryptionEventDataSource、EncryptionProbeDataSourceEncryptionCollector、EncryptionProbeCollector*ProbeDataSource*ProbeCollectorTelemetryBridgeCollector5.2 指标对象模型
首版的指标对象模型应尽量收敛,只保留当前最必要、最稳定的三类基础指标:Counter、Gauge 与 Histogram。
Summary暂不纳入首版范围,以避免在聚合语义、实现复杂度与使用收益之间引入不必要的失衡。首版统一支持三种指标类型:
不推荐首版支持
Summary,原因如下:5.3 标签策略
标签策略的核心,不在于“尽可能表达更多业务信息”,而在于在可观测性价值与高基数风险之间取得稳定平衡。为此,标签设计必须遵守“低基数优先”原则,默认只允许有限、可枚举、可控的标签集合。
允许的常用标签
operationsearch.find、resources.add_resourcestatusok/errorqueueEmbedding/Semantic/ 其他队列名levelL0/L1/L2context_typememory/resourcecomponenttask_typesession_commitaccount_idprovidermodel_name租户标签策略
account_id仅建议用于:推荐增加三条保护规则:
建议避免的高基数标签
user_idsession_idresource_urierror_messagequery5.4 命名规范
命名规范的目标,是让同一类指标在不同 collector、不同链路中保持一致的表达方式,减少命名漂移与语义歧义。为此,指标命名统一采用
openviking_<domain>_<metric>_<unit>模板,并对 Counter / Histogram / Health 指标施加额外约束。指标命名建议统一采用:
openviking_<domain>_<metric>_<unit>示例:
openviking_retrieval_requests_totalopenviking_queue_pendingopenviking_task_runningopenviking_operation_duration_seconds设计要求:
_total结尾;_seconds;0/1数值。5.5 明确不纳入
/metrics的指标以下指标保留在
/api/v1/stats或 telemetry JSON 中更合适:5.6 指标桶策略
所有耗时类 Histogram 建议统一使用秒级桶,便于跨模块比较:
0.005/0.01/0.0250.05/0.1/0.250.5/1.0/2.55.0/10.0/30.0如果某条链路需要独立桶配置,应通过指标定义层配置,而不是在业务埋点中写死。
5.7 Telemetry 指标化细则
对于 operation telemetry,只抽取“结束时就已具备、并且不会导致高基数”的字段。结合
docs/zh/guides/05-observability.md、docs/zh/guides/07-operation-telemetry.md与openviking/telemetry/operation.py的当前能力。summary.operation+summary.statusopenviking_operation_requests_totalsummary.duration_msopenviking_operation_duration_secondssummary.tokens.totalopenviking_operation_tokens_total{token_type="all"}summary.tokens.llm.inputopenviking_operation_tokens_total{token_type="llm_input"}summary.tokens.llm.outputopenviking_operation_tokens_total{token_type="llm_output"}summary.tokens.embedding.totalopenviking_operation_tokens_total{token_type="embedding"}summary.vector.searchesopenviking_vector_searches_totalsummary.vector.scoredopenviking_vector_scored_totalsummary.vector.passedopenviking_vector_passed_totalsummary.vector.returnedopenviking_vector_returned_totalsummary.vector.scannedopenviking_vector_scanned_totalsummary.memory.extractedopenviking_memory_extracted_total6. 最终建议
推荐结论
PrometheusObserver从Observer体系中彻底剥离。openviking/metrics/下建立统一的MetricRegistry + Collector + Exporter体系。Event / State / DomainStats / Probe四类中间契约,Probe 侧按依赖类型直接拆分具体 probe 子类。/metrics只承载在线监控所需的低基数、低成本指标;/api/v1/stats继续承载 memory health、staleness、session extraction 等分析型统计。预期收益
observer、stats、metrics三套接口边界清晰。Beta Was this translation helpful? Give feedback.
All reactions