[RFC] OpenViking Metrics Architecture Design / OpenViking 指标体系设计 #1219

baojun-zhang · 2026-04-04T13:16:56Z

baojun-zhang
Apr 4, 2026
Maintainer

[RFC] OpenViking Metrics Architecture

This RFC is bilingual (English + Chinese). English version first, Chinese version below.
本 RFC 为双语版本（英文 + 中文）。英文版在前，中文版在后。

English Version / 英文版

🔍 Review Items

#	Review Item	Section	Decision Needed
1	Metrics boundary	§2 / §5	Should `/metrics` remain limited to system/runtime metrics, while business analytics metrics continue to stay in `/api/v1/stats`? Recommended: Yes
2	Tenant label strategy	§4.5 / §5.6	Should selected metrics introduce the `account_id` dimension? Recommended: Supported, but disabled by default and enabled only for low-cardinality metrics
3	Metric type scope	§4.3 / §5	Should `Summary` be supported in the first release? Recommended: Support only Counter / Gauge / Histogram in the first release; do not introduce Summary yet
4	Gauge refresh strategy	§4.4	For queue / lock / VikingDB / task and other instantaneous states, should refresh happen on scrape or via background periodic refresh? Recommended: Refresh on Prometheus scrape to avoid an additional always-on scheduler
5	Telemetry metricization scope	§3.4 / §5.5	Should the existing operation telemetry be fully converted into Prometheus metrics? Recommended: Extract only low-cost, low-cardinality, strongly reusable fields

Confirmed Assumptions

The BaseObserver family is responsible for reading instantaneous state and does not store historical metrics.
/metrics is intended for Prometheus and similar monitoring systems, and should focus on runtime health and service quality.
/api/v1/stats is intended for business analytics and content-quality analysis. It does not need Prometheus compatibility and should not sacrifice readability for scrape-oriented access patterns.
The metrics system does not introduce an extra snapshot layer. Direct reads and writes on a unified registry are sufficient.
Prometheus is the first exporter to land. OTel / InfluxDB may be added later, but Prometheus-specific details must not leak back into the core collection layer.

1. Background and Current State

1.1 Current Observation Entrypoints

OpenViking currently exposes three observation-related entrypoints:

Entrypoint	Current Positioning	Current Implementation Characteristics
`/api/v1/observer/*`	Instantaneous component state query	`ObserverService` assembles `QueueObserver`, `VikingDBObserver`, `ModelsObserver`, `LockObserver`, and `RetrievalObserver`
`/api/v1/stats/*`	Business statistics and content-quality analysis	`StatsAggregator` dynamically queries memory categories, hotness, staleness, and session extraction results
`/metrics`	Prometheus metrics export	Currently depends directly on `PrometheusObserver.render_metrics()` to produce text output

1.2 Current Code Analysis

Based on the current codebase, the current state can be summarized as follows:

Module	Location	Current Responsibility	Assessment
`BaseObserver`	`openviking/storage/observers/base_observer.py`	Defines abstract interfaces such as `get_status_table / is_healthy / has_errors`	Clear semantics; naturally suited for instantaneous state reads
`ObserverService`	`openviking/service/debug_service.py`	Builds queue / vikingdb / models / lock / retrieval state views in a unified way	Reflects the true role of Observer: aggregating and returning current state
`PrometheusObserver`	`openviking/storage/observers/prometheus_observer.py`	Stores Counter / Histogram in memory and renders `/metrics` text	Inconsistent with the rest of the Observer family; the architectural outlier
`RetrievalStatsCollector`	`openviking/retrieve/retrieval_stats.py`	Accumulates retrieval stats and additionally notifies `PrometheusObserver`	Already has the shape of a Collector, but still depends on a global observer singleton
`collection_schemas` embedding path	`openviking/storage/collection_schemas.py`	Reports to Prometheus after embedding completes	Business logic is directly coupled to `PrometheusObserver`
`VLMBase.update_token_usage`	`openviking/models/vlm/base.py`	Reports token usage to Prometheus after VLM calls	Same issue: business logic talks directly to the exporter
`/metrics` router	`openviking/server/routers/metrics.py`	Directly pulls rendered output from `app.state.prometheus_observer`	The routing layer is bound to a concrete implementation instead of an exporter abstraction

1.3 Prometheus Metrics Already Supported

Today, /metrics only exposes the following groups of metrics:

Metric Name	Type	Source
`openviking_retrieval_requests_total`	Counter	`RetrievalStatsCollector.record_query()`
`openviking_retrieval_latency_seconds`	Histogram	`RetrievalStatsCollector.record_query()`
`openviking_embedding_requests_total`	Counter	Embedding processing flow
`openviking_embedding_latency_seconds`	Histogram	Embedding processing flow
`openviking_vlm_calls_total`	Counter	`VLMBase.update_token_usage()`
`openviking_vlm_call_duration_seconds`	Histogram	`VLMBase.update_token_usage()`
`openviking_cache_hits_total{level=...}`	Counter	`PrometheusObserver.record_cache_hit()`
`openviking_cache_misses_total{level=...}`	Counter	`PrometheusObserver.record_cache_miss()`

These metrics prove that the pipeline already works end to end, but the coverage is still narrow and mostly limited to the retrieval / embedding / VLM event flows.

1.4 Main Problems

Problem 1: `PrometheusObserver` breaks consistency inside the `Observer` family

Other Observer implementations share the following characteristics:

They read the current system state.
They do not hold historical metrics.
They can be created and read on demand.
They behave more like diagnostic views than metric storage.

By contrast, PrometheusObserver:

Keeps long-lived accumulated state in process memory;
Exposes record_* write APIs;
Handles Prometheus text export;
Is accessed by business code through a global singleton.

This means it is not really an Observer. It is a mixed “metric registry + Prometheus exporter”.

Problem 2: Strong coupling between collection and export layers

Current retrieval, embedding, and VLM instrumentation all depend directly on get_prometheus_observer(). This leads to three consequences:

Business code needs to know whether Prometheus is enabled;
Adding OTel / InfluxDB exporters in the future would still require changes to business instrumentation;
Tests easily degrade into “was some global object called?” instead of “was the intended metric semantic recorded correctly?”.

Problem 3: Incomplete metric type coverage

Only Counter and Histogram exist today, while the following are still missing:

Gauge: suitable for queue backlog, running task count, lock holders, and VikingDB collection state;
A unified Health metric family: suitable for turning Observer health results into numeric monitoring signals;
A unified label model: currently only the level label exists;
A unified naming convention: there is still no domain-based metric catalog.

Problem 4: Insufficient multi-tenant observability

OpenViking is a multi-tenant system, but current metrics are almost entirely process-level totals. Missing visibility includes:

Per-tenant request volume;
Per-tenant retrieval latency;
Per-tenant task backlog;
Per-tenant resource ingestion throughput.

At the same time, once dimensions expand to user_id / session_id / resource_uri, the system will quickly run into cardinality explosion. Strict label boundaries are therefore required.

Problem 5: `/metrics` and `/api/v1/stats` are easy to confuse

The memory category, hotness, and staleness outputs provided by StatsAggregator are much closer to analytical statistics than to natural inputs for a high-frequency Prometheus scraping system. If such queries are moved directly into /metrics, several issues appear:

Expensive queries get triggered on scrape;
Metrics fluctuate with scrape cadence;
Additional pressure is placed on VikingDB;
“Operational monitoring” and “business analytics” become mixed.

The design must therefore make a hard separation between online metrics and analytical statistics.

2. Goals and Non-goals

2.1 Goals

Return Observer to its intended role as instantaneous state observation.
Build an independent metrics core layer in which collection, registration, and export are decoupled.
Connect the currently scattered instrumentation to a single MetricRegistry.
Provide systematic Gauge / Counter / Histogram support for /metrics.
Allow low-cost extension to additional exporters without affecting the first Prometheus release.
Remain thread-safe and compatible with queue threads, asynchronous tasks, and FastAPI request coroutines.
Clarify the boundary between /metrics and /api/v1/stats.

2.2 Non-goals

This proposal does not redesign the current operation telemetry response structure.
This proposal does not require every business statistic to become a Prometheus metric.
This proposal does not introduce a distributed aggregation layer, nor does it solve cross-process aggregation.
This proposal does not require an OTel exporter in the first release.

3. New Architecture Overview

3.1 Core Principles

This section defines the foundational principles that must continue to hold across the metrics design. They constrain the later abstract layering, module boundaries, telemetry relationships, and the responsibilities of external observation interfaces.

Principle A: Abstract responsibilities first, concrete types later

At the overview layer, only the following four core abstractions are kept:

MetricDataSource
BaseMetricCollector
MetricRegistry
BaseMetricExporter

Principle B: Decouple data sources from the metrics system

Existing Observer outputs, telemetry, TaskTracker state, and business event instrumentation are all “data origins”, but they are not the metrics system itself.

Therefore:

Data sources provide raw state or raw events;
Collectors turn raw semantics into metrics;
Registry stores the metrics;
Exporters convert metrics into external protocols.

This split prevents retrieval, embedding, VLM, and similar business code from directly coupling to any one exporter.

Principle C: Registry is the single source of truth for metrics

MetricRegistry acts as the sole in-process source of truth for metrics, responsible for unified reads, writes, and constraints.

Its responsibilities are:

Metric definition registration;
Label validation;
Current value reads and writes;
Concurrency safety;
A readable current view for exporters.

The “current view” here only means reading registry state at access time. It does not introduce an extra Snapshot architecture layer.

Principle D: Exporter is responsible only for protocol export

Exporter is responsible only for protocol output, not for metric semantics.

Exporter is not responsible for:

Business instrumentation;
State collection;
Semantic mapping into metrics.

Exporter is only responsible for:

Reading the registry;
Formatting according to the target protocol;
Outputting to /metrics or other monitoring backends.

Prometheus is therefore only the first concrete implementation, not the center of the overall metrics architecture.

Principle E: Preserve the responsibility boundaries of the three observation entrypoints

/metrics, /api/v1/observer/*, and /api/v1/stats/* should continue to coexist, but with clear division of responsibilities.

/metrics is machine-oriented, emphasizing low cardinality, low cost, and sustainable aggregation;
/api/v1/observer/* is human-oriented, emphasizing readability of instantaneous state;
/api/v1/stats/* is analytics-oriented, allowing heavier queries and statistical logic.

These three interfaces may share some data sources, but they must not share one output model.

3.2 Abstract Layering

This section gives the abstract main pipeline of the metrics system and keeps only the minimal stable roles, without expanding into concrete implementations too early.

The abstract main pipeline is:

graph LR
    A["MetricDataSource"]
    B["BaseMetricCollector"]
    C["MetricRegistry"]
    D["BaseMetricExporter"]

    A --> B
    B --> C
    C --> D

The responsibilities of the four abstract roles are:

Abstract Role	Purpose	Responsible For	Not Responsible For
`MetricDataSource`	Provides raw state or raw events	Producing observable inputs	Does not generate Prometheus text directly
`BaseMetricCollector`	Converts inputs into unified metric semantics	Collecting, normalizing, and writing into registry	Does not expose HTTP directly
`MetricRegistry`	Unified in-process metric storage	Registration, validation, read/write, concurrency control	Does not care where the data came from
`BaseMetricExporter`	Exports metrics through a protocol	Formatting and exporting	Does not directly understand business-flow semantics

The data flow corresponding to this abstract pipeline is:

MetricDataSource provides instantaneous state, request-end summaries, or runtime events;
BaseMetricCollector maps these inputs into Counter / Gauge / Histogram and other unified metrics;
MetricRegistry stores all current in-process metric state;
BaseMetricExporter reads the registry when needed and outputs to external systems.

3.3 Design Boundaries

After the abstract layering is defined, design boundaries are still needed to ensure the following:

The metrics system may observe the core business path, but must not become part of the core business path itself;
The metrics path may be absent or degraded, but it must never change the behavioral semantics of retrieval, embedding, VLM, resource, session, task, or observer logic in the reverse direction.

Therefore, explicit constraints are required for each role:

DataSource: The business-facing boundary is consolidated into DataSource. Business objects, business state, business events, existing statistical snapshots, and probe execution results may only be touched and read through MetricDataSource. Collector, MetricRegistry, and Exporter must not directly access business services, business storage, or business context.
Collector: Collectors must not re-enter the business workflow. Their only responsibility is to map standardized input into metric semantics and write into MetricRegistry, without owning real business state.
- Event-style flows, such as HTTP API calls, are triggered at the business event point. Only lightweight events are passed from DataSource to Collector;
- State-style flows are triggered at collection time, for example right before /metrics is scraped. Reads at that point must be limited to lightweight snapshots, existing aggregated results, or lightweight probe results.

3.4 Directory and Module Boundaries

Once the above boundaries are in place, the module layout must remain consistent with them, instead of mixing responsibilities back together through directory structure. To avoid continued blending with storage/observers/, the new metrics system should live under openviking/metrics/, grouped by registry / collectors / exporters / naming / bootstrap concerns.

Suggested logical groups:

Module Group	Responsibility
`registry`	Metric storage, metric definitions, label validation
`collectors`	Collectors for retrieval / embedding / vlm / queue / task / observer / telemetry and others
`exporters`	Prometheus / future OTel / InfluxDB
`naming`	Metric naming rules, label rules, bucket conventions
`bootstrap`	Initialize registry / exporters / collector manager during app startup

3.5 Relationship with Existing Telemetry

Operation telemetry and metrics are not two mutually replacing systems. Telemetry remains a request-level structured summary used for per-call explanation and troubleshooting. Metrics only extract a whitelist of low-cardinality fields for continuous scraping, aggregation, and alerting.

Operation telemetry already carries many valuable fields, such as:

duration_ms
tokens.*
vector.*
queue.*
semantic_nodes.*
memory.extract.*
errors.*

However, not all fields are suitable for /metrics, so only a whitelist is metricized:

Telemetry Group	Recommended for Metricization	Reason
`duration_ms`	Yes	Low cardinality and high reuse value
`tokens.total / llm / embedding`	Yes	Clear capacity and cost value
`vector.searches / scored / returned`	Yes	Core runtime metrics of the retrieval path
`queue.*`	Yes	Suitable for task throughput and error metrics
`semantic_nodes.*`	Conditional	Better suited to resource ingestion flows and should not be allowed to expand labels without control
`memory.extract.*`	Partially kept in stats / telemetry	More analytics-oriented; not fully metricized in the first release
`errors.message`	No	High cardinality and may leak context

3.6 Responsibility Boundaries of `/metrics`, `/api/v1/observer`, and `/api/v1/stats`

The three external observation interfaces may share some data sources, but they do not share a single output model, nor should a single interface be expected to carry all observation needs. Making this boundary explicit avoids future drift between machine scraping, human diagnosis, and business analytics.

Interface	Positioning	Data Characteristics	Output Style
`/metrics`	Machine-consumed monitoring interface	Low cardinality, continuously scrapeable, aggregatable	Prometheus exposition
`/api/v1/observer/*`	Human-oriented instantaneous diagnosis interface	Current component state, tables, or structured descriptions	JSON + readable status text
`/api/v1/stats/*`	Business analytics interface	Potentially expensive, scannable, aggregatable, but not high-frequency	JSON statistical results

4. Core Design Details

4.1 `MetricDataSource` Design

At the implementation layer, the abstract role MetricDataSource should land as a unified base class BaseMetricDataSource, further divided into four intermediate abstractions: EventMetricDataSource, StateMetricDataSource, DomainStatsMetricDataSource, and ProbeMetricDataSource. These four are not just logical tags; each corresponds to a different data access contract and refresh pattern, so the architecture should distinguish them explicitly.

These inputs may be:

An instantaneous state of some component;
A runtime event;
An accumulated statistic maintained within a domain;
A probe execution result.

In OpenViking, considering the current functional coverage and existing code sampling points, a two-layer structure of “unified base class + intermediate contract layer” is recommended. The inheritance relationship is:

graph LR
    A["BaseMetricDataSource"]

    B["EventMetricDataSource"]
    C["StateMetricDataSource"]
    D["DomainStatsMetricDataSource"]
    E["ProbeMetricDataSource"]

    A --> B
    A --> C
    A --> D
    A --> E

The data access contracts of these four intermediate abstractions are:

Intermediate Abstraction	Contract Semantics	Typical Read Pattern	Typical Metric Types
`EventMetricDataSource`	Provides incremental event streams or lifecycle events	Read event batches, consume new events, incrementally pull by cursor	Counter, Histogram
`StateMetricDataSource`	Provides current state snapshots	Read current state on demand; repeated reads return the latest value	Gauge
`DomainStatsMetricDataSource`	Provides accumulated statistics already maintained inside a domain	Read cumulative counts, summaries, and statistical snapshots	Counter, Gauge, and some Histogram inputs
`ProbeMetricDataSource`	Provides probe results or health-check outcomes	Execute probes or read the latest probe results	Gauge, Health

Observer, Telemetry, TaskTracker, HTTP Router, and all business services keep their existing roles. The metrics system only treats them as data sources and does not transform them into exporters or collectors.

4.2 `BaseMetricCollector` Design

Collectors sit at the semantic mapping layer of the metrics system. They receive different kinds of observable input and convert them into stable, unified writes into MetricRegistry. Since MetricDataSource is divided into Event, State, DomainStats, and Probe categories, the Collector side should adopt the same layered organization instead of remaining in a simplified Event / State-only model.

Under this design, a Collector is no longer just a “metric writer”. It becomes the unified convergence layer: on one side it shields upstream differences in access patterns and update cadence, and on the other side it exposes consistent write semantics to the registry. This keeps future extension points concentrated in the Collector layer rather than scattering source-specific logic into registry or exporter layers.

The recommended organization is “base class + four child abstractions”:

graph LR
    A["BaseMetricCollector"]
    B["EventMetricCollector"]
    C["StateMetricCollector"]
    D["DomainStatsMetricCollector"]
    E["ProbeMetricCollector"]

    A --> B
    A --> C
    A --> D
    A --> E

The responsibility split is:

Abstraction Layer	Responsibility	Typical Trigger
`BaseMetricCollector`	Defines common collector behavior and registry write constraints	Shared by all collectors
`EventMetricCollector`	Handles “an event happens once, write once” scenarios	HTTP request completion, resource processing completion, VLM call completion, encryption operation completion
`StateMetricCollector`	Handles “read current state and refresh gauges” scenarios	Refresh before `/metrics` scrape
`DomainStatsMetricCollector`	Handles “read existing accumulated stats and write them into registry” scenarios	Retrieval stats, model usage stats, Observer aggregated view refresh
`ProbeMetricCollector`	Handles “execute probes and map them into health-style metrics” scenarios	Readiness / dependency probe refresh

The primary mapping between Collector and DataSource is:

DataSource Category	Preferred Collector Category	Explanation
`EventMetricDataSource`	`EventMetricCollector`	Event inputs naturally fit Counter / Histogram
`StateMetricDataSource`	`StateMetricCollector`	Current-state snapshots naturally fit Gauge
`DomainStatsMetricDataSource`	`DomainStatsMetricCollector`	Already aggregated statistics should be bridged directly into the registry
`ProbeMetricDataSource`	`ProbeMetricCollector`	Health-check outputs should map into readiness / health metrics uniformly

Design rationale: after introducing the four Collector categories, semantic boundaries become clearer:

EventCollector focuses on incremental writes and naturally fits Counter / Histogram;
StateCollector focuses on overwrite-style writes and naturally fits Gauge;
DomainStatsCollector focuses on bridging already aggregated statistics and avoids re-splitting them back into events;
ProbeCollector focuses on health checks and readiness mapping, avoiding probe logic being stuffed into StateCollector or EventCollector.

4.3 `MetricRegistry` Design

MetricRegistry is the stable center of the whole system. It handles unified registration, validation, storage, and reads, but it does not decide collection timing and does not perform protocol export. It maintains a unified external interface and does not split separate write APIs for Event / State / DomainStats / Probe. Semantic branching belongs in the Collector layer.

MetricRegistry must satisfy the following capabilities:

Capability	Requirement
Unified registration	Each metric is registered only once per process to prevent duplicate definitions
Type constraint	The same metric name cannot be Counter in one place and Gauge in another
Label constraint	Each metric must have a fixed set of label keys; runtime writes may fill only declared dimensions
Concurrency safety	FastAPI coroutines, queue threads, and background cleanup threads may read/write concurrently
Low-cost reads	`/metrics` scraping must not hold heavy locks for long
Extensibility	Avoid direct Prometheus-only concepts inside registry; no exposition text logic in this layer
Unified write interface	Keep unified counter / gauge / histogram write APIs externally; do not split by source type

Registry only solves “how metrics are stored uniformly”. It does not solve:

Where the data comes from;
Who triggers collection;
Which protocol is used for export.

The unified interface strategy is:

EventCollector decides when to call inc_counter or observe_histogram;
StateCollector decides when to call set_gauge;
DomainStatsCollector decides how existing accumulated stats are mapped into the unified write API;
ProbeCollector decides how probe results are translated into readiness / health gauges.

In other words, Registry does not know whether the current write comes from an event, state, stats, or probe. It only knows which metric type, name, labels, and value are being written.

This allows the registry to remain the stable core of the metrics system without changing frequently with collector or exporter evolution.

4.4 `BaseMetricExporter` Design

Exporter sits at the downstream end of the metrics system and is only responsible for converting current registry state into external protocols. Like Registry, Exporter also follows a unified-interface principle: it depends only on unified read APIs and does not care whether the data came from Event, State, DomainStats, or Probe flows.

The recommended inheritance structure is:

graph TB
    A["BaseMetricExporter"]
    B["PrometheusExporter"]
    C["OtelExporter"]
    D["InfluxDBExporter"]

    A --> B
    A --> C
    A --> D

Each exporter is positioned as follows:

Exporter	Positioning	First-release Status
`PrometheusExporter`	Responsible for `/metrics` exposition output	Must land in the first release
`OtelExporter`	Reserved for future OTel metric export	Reserved
`InfluxDBExporter`	Reserved for future InfluxDB export	Reserved

The unified read strategy is:

Exporter obtains the current metric sample set through a unified registry read API;
Exporter does not care which kind of DataSource or Collector produced a given sample;
Exporter only cares about metric definitions, label sets, current values, and serialization for the target protocol;
Source/category semantics should already be converged before writes reach the registry and must not leak into the exporter layer.

4.5 Prometheus Runtime Data Path

This section describes the end-to-end runtime path from a /metrics request entering the service to the final response, and clarifies the boundary of refresh-before-export. The recommended approach is: when Prometheus scrapes /metrics, required collectors are refreshed first, then the unified registry snapshot is read, and finally the exporter serializes the result into protocol text.

Recommended flow when Prometheus scrapes /metrics:

sequenceDiagram
    participant P as Prometheus
    participant R as /metrics Router
    participant E as PrometheusExporter
    participant C as StateCollectors
    participant G as MetricRegistry

    P->>R: GET /metrics
    R->>E: export()
    E->>C: refresh()
    C->>G: set Gauge values
    E->>G: read unified metric snapshot
    G-->>E: metric samples
    E-->>R: text exposition
    R-->>P: 200 text/plain

Why refresh StateCollectors before export:

No additional background sampling coroutine is needed;
Metrics stay aligned with the scrape point in time;
No resource is spent when nobody is scraping;
This matches the current on-demand read model of Observer;
Exporter remains protocol-pure and does not expand branching logic as source/category varieties grow.

5. Metric Strategy

5.1 Metric Mapping Overview

Before detailing metric lists, the mapping “metric family → DataSource → Collector → metric type” should be unified into one view. This prevents later lists from showing only metric names without their collection paths. Every first-release metric should be traceable to a concrete DataSource and Collector. If a metric cannot explain its input source or collection responsibility, it should not be included in the first-release scope.

Metric Family	Primary DataSource	Primary Collector	Main Metric Types	Representative Metrics
HTTP request monitoring	`HttpRequestLifecycleDataSource`	`EventMetricCollector` (can later be refined into `HTTPCollector`)	Counter, Histogram	request total, duration, status code
Retrieval path monitoring	`RetrievalStatsDataSource`, retrieval completion events	`RetrievalCollector`	Counter, Histogram	retrieval requests, zero results, latency
Model path monitoring	`ModelUsageDataSource`, model-call events	`VLMCollector`, `EmbeddingCollector`, `ModelUsageCollector`	Counter, Histogram	model calls, tokens, duration
Resource ingestion monitoring	`ResourceIngestionEventDataSource`	`EventMetricCollector` (can later be refined into `ResourceIngestionCollector`)	Counter, Histogram	parse / finalize / summarize / wait duration
Session and async task monitoring	`SessionLifecycleDataSource`, `TaskStateDataSource`, `QueuePipelineStateDataSource`	`TaskTrackerCollector`, `QueueCollector`	Gauge, Counter	task pending, queue backlog, session lifecycle count
Diagnostic and state monitoring	`ObserverStateDataSource`, `QueuePipelineStateDataSource`	`ObserverHealthCollector`, `LockCollector`, `VikingDBCollector`	Gauge	component health, lock active, collection health
Encryption monitoring	`EncryptionEventDataSource`, `EncryptionProbeDataSource`	`EncryptionCollector`, `EncryptionProbeCollector`	Counter, Histogram, Gauge	encrypt count, decrypt duration, root key readiness
System probe monitoring	various `*ProbeDataSource`	various `*ProbeCollector`	Gauge, Health	service readiness, storage readiness, kms availability
Operation-level telemetry metricization	telemetry adapter / bridge	`TelemetryBridgeCollector`	Counter, Histogram, Gauge	operation requests, vector scanned, memory extracted

5.2 Metric Object Model

The first-release metric object model should stay compact and stable, keeping only the three most necessary base types: Counter, Gauge, and Histogram. Summary is intentionally excluded from the first-release scope to avoid unnecessary complexity and semantic overlap.

The first release uniformly supports three metric types:

Type	Purpose	Typical Scenarios
Counter	Monotonic increasing count	Request volume, error count, cache hit count, completed task count
Gauge	Current value	Queue backlog, running task count, lock holders, health state
Histogram	Distribution statistics	Request latency, embedding latency, VLM call latency, resource-processing latency

Summary is not recommended in the first release for the following reasons:

It is not friendly to multi-instance Prometheus aggregation;
Its responsibility boundary overlaps with Histogram;
It increases registry and exporter complexity;
What OpenViking is missing right now is Gauge, not Summary.

5.3 Label Strategy

The core of label strategy is not “express as much business information as possible”, but finding a stable balance between observability value and cardinality risk. For this reason, label design must follow a “low-cardinality first” principle and only allow a small, enumerable, controllable set of labels by default.

Allowed Common Labels

Label	Description	Recommendation
`operation`	Operation name, such as `search.find` or `resources.add_resource`	Recommended
`status`	`ok` / `error`	Recommended
`queue`	`Embedding` / `Semantic` / other queue names	Recommended
`level`	Cache level, such as `L0` / `L1` / `L2`	Recommended
`context_type`	Retrieval context, such as `memory` / `resource`	Recommended
`component`	queue / models / lock / retrieval / vikingdb	Recommended
`task_type`	Such as `session_commit`	Recommended
`account_id`	Tenant ID	Conditionally recommended
`provider`	Model provider, such as openai / volcengine	Conditionally recommended
`model_name`	Model name	Use carefully; should be normalized into a bounded enum if enabled

Tenant Label Strategy

account_id is only recommended for:

Request volume;
Request latency;
Task backlog;
Resource ingestion throughput.

Three guardrails are recommended:

Disabled by default;
Enabled only for metrics explicitly placed on the whitelist;
Cap the number of active tenants and fall back to unlabeled aggregation beyond the limit.

High-cardinality Labels to Avoid

user_id
session_id
resource_uri
Raw error_message
Raw query
File path
request id / telemetry id

5.4 Naming Convention

The goal of the naming convention is to keep the same class of metrics expressed consistently across collectors and flows, reducing naming drift and semantic ambiguity. All metric names therefore follow the openviking_<domain>_<metric>_<unit> template, with additional constraints for Counter / Histogram / Health metrics.

Recommended unified metric naming:

openviking_<domain>_<metric>_<unit>

Examples:

openviking_retrieval_requests_total
openviking_queue_pending
openviking_task_running
openviking_operation_duration_seconds

Design requirements:

Counter names must end with _total;
Histograms and duration metrics should consistently use _seconds;
Gauges do not require a unit suffix, but explicit unit expression is recommended when useful;
Health metrics should always use 0/1 numeric values.

5.5 Metrics Explicitly Excluded from `/metrics`

The following outputs are better kept in /api/v1/stats or telemetry JSON:

Data Item	Reason
Memory category distribution	Requires scanning or query aggregation and is closer to business analytics
Hotness / staleness distribution	Not suitable for high-frequency scraping
Per-session extraction details	High cardinality and more diagnostic in nature
Raw error text	High cardinality and may contain sensitive information
Any resource / URI-level statistics	High cardinality

5.6 Bucket Strategy

All latency Histograms should use a unified second-based bucket strategy to make cross-module comparison easier:

Bucket Value (seconds)	Typical Scenario
`0.005` / `0.01` / `0.025`	Extremely short local operations
`0.05` / `0.1` / `0.25`	Retrieval / cache / lightweight API
`0.5` / `1.0` / `2.5`	Embedding / medium-weight resource operations
`5.0` / `10.0` / `30.0`	VLM / wait mode / heavy processing tasks

If a particular path needs a custom bucket configuration, that should be configured at the metric definition layer rather than hardcoded in business instrumentation.

5.7 Telemetry Metricization Details

For operation telemetry, only fields that already exist at completion time and do not create high cardinality are extracted. This is aligned with the current capabilities in docs/zh/guides/05-observability.md, docs/zh/guides/07-operation-telemetry.md, and openviking/telemetry/operation.py.

Telemetry Field	Mapped Metric
`summary.operation` + `summary.status`	`openviking_operation_requests_total`
`summary.duration_ms`	`openviking_operation_duration_seconds`
`summary.tokens.total`	`openviking_operation_tokens_total{token_type="all"}`
`summary.tokens.llm.input`	`openviking_operation_tokens_total{token_type="llm_input"}`
`summary.tokens.llm.output`	`openviking_operation_tokens_total{token_type="llm_output"}`
`summary.tokens.embedding.total`	`openviking_operation_tokens_total{token_type="embedding"}`
`summary.vector.searches`	`openviking_vector_searches_total`
`summary.vector.scored`	`openviking_vector_scored_total`
`summary.vector.passed`	`openviking_vector_passed_total`
`summary.vector.returned`	`openviking_vector_returned_total`
`summary.vector.scanned`	`openviking_vector_scanned_total`
`summary.semantic_nodes.{total	done
`summary.memory.extracted`	`openviking_memory_extracted_total`

6. Final Recommendations

Recommended Conclusion

Remove PrometheusObserver completely from the Observer family.
Build a unified MetricRegistry + Collector + Exporter system under openviking/metrics/.
Use the four intermediate DataSource contracts: Event / State / DomainStats / Probe; split concrete probe subclasses by dependency type.
Use the corresponding four intermediate Collector contracts, with semantic branching handled by Collectors.
Keep unified write APIs in Registry and unified read APIs in Exporter, without exposing source/category semantics there.
Support only Counter / Gauge / Histogram in the first release, not Summary.
Keep /metrics focused on low-cardinality, low-cost online monitoring metrics; continue to keep memory health, staleness, session extraction, and similar analytical outputs in /api/v1/stats.
Do not model operation telemetry as a first-level DataSource. If reused, access it only through a telemetry bridge / adapter.
Apply a cautious multi-tenant label strategy: supported, but disabled by default.

Expected Benefits

Clearer architectural responsibilities: Observer no longer stores metrics;
More stable instrumentation: business code no longer talks directly to a concrete exporter, and event / state / stats / probe responsibilities are clearly separated;
Broader metric coverage: adds HTTP, Resource, Session, Encryption, Probe, and Gauge / Health / Queue / Task / Lock / VikingDB visibility;
Stronger extensibility: future OTel / InfluxDB integration becomes straightforward;
Lower cognitive load: the boundaries among observer, stats, and metrics are clear.

Chinese Version / 中文版

[RFC] OpenViking 指标体系设计

🔍 评审项

#	评审项	章节	需要决策
1	指标边界	§2 / §5	是否继续坚持 `/metrics` 仅承载系统与运行时指标，业务分析类指标继续保留在 `/api/v1/stats`？推荐：是
2	租户标签策略	§4.5 / §5.6	是否为部分指标引入 `account_id` 维度？推荐：支持，但默认关闭，仅在低基数指标上启用
3	指标类型范围	§4.3 / §5	是否在首版同时支持 `Summary`？推荐：首版只支持 Counter / Gauge / Histogram，暂不引入 Summary
4	Gauge 刷新策略	§4.4	Queue / Lock / VikingDB / Task 等瞬时状态，采用抓取时刷新还是后台周期刷新？推荐：Prometheus 抓取时刷新，避免额外常驻调度器
5	Telemetry 指标化范围	§3.4 / §5.5	现有 operation telemetry 是否全部转成 Prometheus 指标？推荐：仅抽取低成本、低基数、强通用字段

已确认前提

BaseObserver 体系的职责是“读取瞬时状态”，不负责存储历史指标。
/metrics 面向 Prometheus 等监控系统，应该偏系统运行态与服务运行质量。
/api/v1/stats 面向业务分析与内容质量分析，不要求 Prometheus 兼容，也不应为适配抓取模型而牺牲可读性。
监控指标体系不引入独立 Snapshot 层，统一注册中心的直接读写即可满足观测需求。
首个落地导出器为 Prometheus；后续可以扩展 OTel / InfluxDB，但不应把 Prometheus 细节反向污染核心采集层。

1. 背景与现状

1.1 当前观测入口

OpenViking 当前已经存在三类与“观测”相关的入口：

入口	当前定位	当前实现特征
`/api/v1/observer/*`	组件瞬时状态查询	通过 `ObserverService` 组装 `QueueObserver`、`VikingDBObserver`、`ModelsObserver`、`LockObserver`、`RetrievalObserver`
`/api/v1/stats/*`	业务统计与内容质量分析	通过 `StatsAggregator` 动态查询 memory 分类、热度、陈旧度、session extraction 结果
`/metrics`	Prometheus 指标导出	当前直接依赖 `PrometheusObserver.render_metrics()` 输出文本

1.2 当前代码实现分析

结合现有代码，可以把现状归纳为下面几个事实：

模块	位置	当前职责	现状评价
`BaseObserver`	`openviking/storage/observers/base_observer.py`	定义 `get_status_table / is_healthy / has_errors` 抽象接口	语义明确，天然适合瞬时状态读取
`ObserverService`	`openviking/service/debug_service.py`	统一构造 queue / vikingdb / models / lock / retrieval 状态	体现了 Observer 的真实定位：聚合并返回当前状态
`PrometheusObserver`	`openviking/storage/observers/prometheus_observer.py`	在内存中保存 Counter / Histogram，并渲染 `/metrics` 文本	与其他 Observer 职责不一致，是当前架构中的“异类”
`RetrievalStatsCollector`	`openviking/retrieve/retrieval_stats.py`	累积检索统计，并额外通知 `PrometheusObserver`	已经具备 Collector 雏形，但依赖全局单例 observer
`collection_schemas` embedding 路径	`openviking/storage/collection_schemas.py`	embedding 完成后上报 Prometheus	通过业务代码直接耦合到 `PrometheusObserver`
`VLMBase.update_token_usage`	`openviking/models/vlm/base.py`	VLM token 记录后上报 Prometheus	同样存在“业务逻辑直连导出器”的问题
`/metrics` 路由	`openviking/server/routers/metrics.py`	从 `app.state.prometheus_observer` 直接取渲染结果	路由层绑定了具体实现对象，而不是 exporter 抽象

1.3 当前已支持的 Prometheus 指标

当前 /metrics 实际只支持以下几类指标：

指标名	类型	来源
`openviking_retrieval_requests_total`	Counter	`RetrievalStatsCollector.record_query()`
`openviking_retrieval_latency_seconds`	Histogram	`RetrievalStatsCollector.record_query()`
`openviking_embedding_requests_total`	Counter	embedding 处理流程
`openviking_embedding_latency_seconds`	Histogram	embedding 处理流程
`openviking_vlm_calls_total`	Counter	`VLMBase.update_token_usage()`
`openviking_vlm_call_duration_seconds`	Histogram	`VLMBase.update_token_usage()`
`openviking_cache_hits_total{level=...}`	Counter	`PrometheusObserver.record_cache_hit()`
`openviking_cache_misses_total{level=...}`	Counter	`PrometheusObserver.record_cache_miss()`

这些指标证明链路已经打通，但覆盖范围仍然偏窄，主要集中在 retrieval / embedding / vlm 三条事件流。

1.4 主要问题

问题 1：`PrometheusObserver` 破坏了 `Observer` 体系的一致性

其他 Observer 的共同特点是：

读取系统当前状态；
不持有历史指标；
可按需构造、按需输出；
更接近“诊断视图”而不是“指标存储”。

而 PrometheusObserver 的共同特点是：

在进程内保存长期累计状态；
提供 record_* 写接口；
负责 Prometheus 文本导出；
通过全局单例供业务代码回调。

这说明它本质上不是 Observer，而是“指标注册中心 + Prometheus 导出器”的混合体。

问题 2：采集层与导出层强耦合

当前 retrieval、embedding、vlm 的埋点代码都直接依赖 get_prometheus_observer()。这会带来三个后果：

业务侧必须知道“Prometheus 是否启用”；
如果未来新增 OTel / InfluxDB 导出器，需要继续修改业务埋点；
单测很容易退化成“是否调用了某个全局对象”，而不是“是否正确记录了指标语义”。

问题 3：指标类型不完整

当前仅有 Counter 与 Histogram，缺少以下能力：

Gauge：适合 queue backlog、正在运行任务数、锁持有数、VikingDB collection 当前状态；
统一的 Health 指标：适合把 Observer 的健康结果转为数值化监控；
统一标签模型：当前只有 level 一个标签；
统一命名规范：尚未形成按领域分组的指标目录。

问题 4：多租户可观测性不足

OpenViking 是多租户系统，但当前指标几乎全部是进程级总量，缺少：

租户维度请求量；
租户维度检索延迟；
租户维度任务堆积；
租户维度资源处理吞吐。

但同时也要注意，租户维度一旦扩大到 user_id / session_id / resource_uri，会快速演化成高基数问题，因此必须设计严格的标签边界。

问题 5：`/metrics` 与 `/api/v1/stats` 的定位容易混淆

StatsAggregator 当前提供的 memory 分类、hotness、staleness 更像分析型统计，而不是 Prometheus 这种高频抓取系统的天然输入。如果直接把这类查询迁入 /metrics，会产生：

抓取时触发昂贵查询；
指标随抓取周期抖动；
增加 VikingDB 压力；
混淆“在线监控”与“运营分析”。

因此设计上必须把“在线指标”与“分析统计”明确分层。

2. 设计目标与非目标

2.1 设计目标

让 Observer 回归“瞬时状态观测”本位。
建立独立的 metrics 核心层：采集、注册、导出三者解耦。
把当前零散埋点统一接入同一个 MetricRegistry。
为 /metrics 提供系统化的 Gauge / Counter / Histogram 支持。
支持低成本扩展更多导出器，但不影响现有 Prometheus 首发能力。
保持线程安全，并兼容 Queue 线程、异步任务、FastAPI 请求协程。
明确 /metrics 与 /api/v1/stats 的边界。

2.2 非目标

本方案不重写现有 operation telemetry 响应结构。
本方案不要求把所有业务统计都变为 Prometheus 指标。
本方案不引入分布式聚合层，也不解决跨进程汇总问题。
本方案不要求在首版落地 OTel exporter。

3. 新架构概览

3.1 核心原则

本节明确指标体系设计中需要持续成立的几项基础原则。这些原则用于约束后续的抽象分层、模块边界、Telemetry 关系以及对外观测接口的职责划分。

原则 A：先抽象职责，再落实现类型

概览层保留以下四个核心抽象：

MetricDataSource
BaseMetricCollector
MetricRegistry
BaseMetricExporter

原则 B：数据源与指标体系解耦

现有 Observer、Telemetry、TaskTracker、业务事件埋点，本质上都属于“数据来源”，但不等同于指标系统本身。

因此：

数据源负责提供原始状态或原始事件；
Collector 负责把这些原始语义转换为指标；
Registry 负责保存指标；
Exporter 负责把指标转换为外部协议。

这种拆分可以避免 retrieval、embedding、vlm 等业务代码直接耦合某一个 exporter。

原则 C：Registry 是唯一真实指标存储

MetricRegistry 作为进程内指标的唯一真实来源，负责承接统一的读写与约束。

职责范围如下：

指标定义注册；
标签规范校验；
当前值读写；
并发安全；
向 exporter 提供可读取的当前视图。

注意：这里的“当前视图”仅指读取时获取 registry 内部状态，不引入独立的 Snapshot 架构层。

原则 D：Exporter 只负责协议导出

Exporter 只负责协议导出，不负责指标语义生成。

Exporter 不负责：

业务埋点；
状态采集；
指标语义转换。

Exporter 只负责：

读取 registry；
根据目标协议格式化；
输出给 /metrics 或其他监控后端。

这意味着 Prometheus 只是首个落地实现，而不是整个 metrics 架构的中心。

原则 E：保留三类观测入口的职责边界

/metrics、/api/v1/observer/*、/api/v1/stats/* 三类入口继续并存，但必须保持清晰分工。

/metrics 面向机器抓取，强调低基数、低成本、可持续聚合；
/api/v1/observer/* 面向人工诊断，强调瞬时状态可读性；
/api/v1/stats/* 面向业务分析，允许更重的查询与统计逻辑。

这三个入口共享部分数据来源，但不共享同一种输出模型。

3.2 抽象分层设计

本节给出指标体系的抽象主链路，只保留最小且稳定的四类角色，不提前展开任何具体实现类。

抽象主链路如下：

graph LR
    A["MetricDataSource"]
    B["BaseMetricCollector"]
    C["MetricRegistry"]
    D["BaseMetricExporter"]

    A --> B
    B --> C
    C --> D

四类抽象角色的职责如下：

抽象角色	作用	只负责什么	不负责什么
`MetricDataSource`	提供原始状态或原始事件	产生可观测输入	不直接生成 Prometheus 文本
`BaseMetricCollector`	将输入转为统一指标语义	采集、归一化、写 registry	不直接暴露 HTTP
`MetricRegistry`	统一存储进程内指标	注册、校验、读写、并发控制	不关心数据源来自哪里
`BaseMetricExporter`	按协议导出指标	格式化与导出	不直接理解业务链路语义

抽象主链路对应的数据流顺序如下：

MetricDataSource 提供瞬时状态、请求结束摘要或运行时事件；
BaseMetricCollector 将这些输入映射为 Counter / Gauge / Histogram 等统一指标；
MetricRegistry 保存当前进程内全部指标状态；
BaseMetricExporter 在需要时读取 registry 并输出给外部系统。

3.3 设计边界

抽象分层确定之后，还需要从设计边界上进一步保证：

指标体系可以观测业务主链路，但不成为业务主链路的组成部分；
指标链路可以缺失或降级，但不能反向改变 retrieval、embedding、VLM、resource、session、task 与 observer 等核心功能的行为语义。

因此，需要对相应角色施加明确约束：

DataSource：业务接触面收敛到 DataSource。业务对象、业务状态、业务事件、已有统计快照与探针执行结果，只允许由 MetricDataSource 接触和读取。Collector、MetricRegistry、Exporter 不再直接访问业务服务、业务存储或业务上下文。
Collector：Collector 禁止重新参与业务流程。Collector 的职责仅为把标准化输入映射为指标语义，并写入 MetricRegistry，不拥有业务真实状态。
- 事件类（如 HTTP API 调用）链路由业务事件发生点触发，DataSource 到 Collector 之间只传递轻量事件；
- 状态类（如系统状态）链路由采集时机触发，例如 /metrics 抓取前刷新，对应读取只能是轻量快照、已有聚合结果或轻量探针结果。

3.4 目录与模块边界

在边界约束成立之后，模块划分也需要与之保持一致，避免目录结构重新把已经划清的职责边界混回同一层。为避免新指标体系继续与 storage/observers/ 的职责混杂，集中放入 openviking/metrics/，并按“注册中心 / 采集器 / 导出器 / 规则 / 启动装配”进行分组。

建议的逻辑分组如下：

模块分组	职责
`registry`	指标存储、指标定义、标签校验
`collectors`	retrieval / embedding / vlm / queue / task / observer / telemetry 等采集器
`exporters`	Prometheus / 未来 OTel / InfluxDB
`naming`	指标命名规则、label 规则、bucket 约定
`bootstrap`	app 启动时初始化 registry / exporters / collector manager

3.5 与现有 telemetry 的关系

operation telemetry 与 metrics 并不是两套彼此替代的系统。前者继续作为请求级结构化摘要存在，服务于单次调用的解释与排障；后者则只对白名单字段做低基数抽取，用于持续抓取、聚合与告警。

operation telemetry 已经拥有很多有价值的数据字段，如：

duration_ms
tokens.*
vector.*
queue.*
semantic_nodes.*
memory.extract.*
errors.*

但并非所有字段都适合直接进入 /metrics，因此这里只对白名单字段进行指标化抽取：

telemetry 分组	是否建议指标化	原因
`duration_ms`	是	低基数、高通用性
`tokens.total / llm / embedding`	是	有明确容量与成本价值
`vector.searches / scored / returned`	是	检索链路核心运行指标
`queue.*`	是	适合形成任务吞吐与错误指标
`semantic_nodes.*`	有条件	更适合资源导入链路，不应无限扩散标签
`memory.extract.*`	部分保留在 stats / telemetry	更偏业务分析，首版不全面指标化
`errors.message`	否	高基数、可能泄漏上下文

3.6 `/metrics`、`/api/v1/observer`、`/api/v1/stats` 的职责边界

三类对外观测接口共享部分数据来源，但并不共享同一种输出模型，也不应追求由同一套接口承担全部观测需求。明确这一边界，是为了避免后续设计在机器抓取、人工诊断与业务分析之间发生职责漂移。

接口	定位	数据特征	输出风格
`/metrics`	机器消费型监控接口	低基数、可持续抓取、可聚合	Prometheus exposition
`/api/v1/observer/*`	人工诊断型瞬时状态接口	组件当前状态、表格或结构化描述	JSON + 可读状态文本
`/api/v1/stats/*`	业务分析型接口	可能昂贵、可扫描、可聚合但非高频	JSON 统计结果

4. 核心设计细节

4.1 `MetricDataSource` 设计

在实现层，抽象角色 MetricDataSource 建议落为统一基类 BaseMetricDataSource，并在其下进一步划分 EventMetricDataSource、StateMetricDataSource、DomainStatsMetricDataSource、ProbeMetricDataSource 四类中间抽象。这四类中间抽象并非单纯的逻辑标签，而是分别对应不同的数据访问契约与刷新方式，因此需要在架构层被明确区分。

这些输入可能是：

某个组件的瞬时状态；
某个运行时事件；
某个领域内部维护的累计统计；
某个探针执行结果。

在 OpenViking 中，结合当前文档覆盖的功能面与现有代码采样点，推荐采用“统一基类 + 中间契约层”的两层结构。对应继承关系如下：

graph LR
    A["BaseMetricDataSource"]

    B["EventMetricDataSource"]
    C["StateMetricDataSource"]
    D["DomainStatsMetricDataSource"]
    E["ProbeMetricDataSource"]

    A --> B
    A --> C
    A --> D
    A --> E

这四类中间抽象对应的数据访问契约如下：

中间抽象	契约语义	典型读取方式	典型指标类型
`EventMetricDataSource`	提供增量事件流或生命周期事件	读取事件批次、消费新事件、按游标增量拉取	Counter、Histogram
`StateMetricDataSource`	提供当前状态快照	按需读取当前状态、重复读取返回最新值	Gauge
`DomainStatsMetricDataSource`	提供领域内部已经维护好的累计统计	读取累计计数、汇总结果、统计快照	Counter、Gauge、部分 Histogram 输入
`ProbeMetricDataSource`	提供探针执行结果或健康检查结果	执行探测、读取最近一次探测结果	Gauge、Health

Observer、Telemetry、TaskTracker、HTTP Router 与各类业务服务继续保持原有职责；指标系统只把它们视为数据源，而不将其改造成 exporter 或 collector。

4.2 `BaseMetricCollector` 设计

Collector 位于指标体系中的语义转换层，负责接收不同类型的可观测输入，并将其稳定映射为对 MetricRegistry 的统一写入操作。随着 MetricDataSource 被进一步划分为 Event、State、DomainStats 与 Probe 四类，Collector 侧也应采用对应的分层组织，而不宜继续停留在仅区分 Event / State 的简化模型。

在这一设计下，Collector 不再只是“埋点写入器”，而是承担统一收口职责：一方面屏蔽上游数据源在访问方式与更新节奏上的差异，另一方面对下游 registry 暴露一致的写入语义。这样可以保证新增指标链路时，扩展点仍然集中在 Collector 层，而不会把 source-specific 逻辑扩散到 registry 或 exporter。

推荐采用“基类 + 四类子抽象”的组织方式：

graph LR
    A["BaseMetricCollector"]
    B["EventMetricCollector"]
    C["StateMetricCollector"]
    D["DomainStatsMetricCollector"]
    E["ProbeMetricCollector"]

    A --> B
    A --> C
    A --> D
    A --> E

抽象层次的分工如下：

抽象层次	职责	典型触发方式
`BaseMetricCollector`	定义 collector 统一行为与 registry 写入约束	所有 collector 共用
`EventMetricCollector`	处理“事件发生一次，就写一次”的场景	HTTP 请求完成、资源处理完成、VLM 调用完成、加密操作完成
`StateMetricCollector`	处理“读取当前状态并刷新 gauge”的场景	`/metrics` 抓取前刷新
`DomainStatsMetricCollector`	处理“读取已有累计统计并写入 registry”的场景	检索统计、模型使用统计、Observer 聚合视图刷新
`ProbeMetricCollector`	处理“执行探针并映射为健康类指标”的场景	readiness / dependency probe 刷新

Collector 与 DataSource 的主映射关系如下：

DataSource 分类	优先对应的 Collector 分类	说明
`EventMetricDataSource`	`EventMetricCollector`	事件型输入天然适合 Counter / Histogram
`StateMetricDataSource`	`StateMetricCollector`	当前状态快照天然适合 Gauge
`DomainStatsMetricDataSource`	`DomainStatsMetricCollector`	已聚合统计应直接桥接到 registry
`ProbeMetricDataSource`	`ProbeMetricCollector`	健康检查结果应统一映射为 readiness / health 指标

设计理由：在引入四类 Collector 之后，整个体系的语义边界更清晰：

EventCollector 偏向增量写入，天然适合 Counter / Histogram；
StateCollector 偏向覆盖写入，天然适合 Gauge。
DomainStatsCollector 偏向桥接已有累计统计，避免把已聚合结果重新拆回事件。
ProbeCollector 偏向健康检查与 readiness 映射，避免把探针逻辑塞进 StateCollector 或 EventCollector。

4.3 `MetricRegistry` 设计

MetricRegistry 是整个体系的稳定中心，负责统一注册、校验、存储和读取指标，但不负责采集触发，也不承担协议导出职责。它对外维持统一接口，不针对 Event / State / DomainStats / Probe 四类 Collector 再拆分多套写入 API，语义分流应由 Collector 自身完成。

MetricRegistry 必须满足以下能力：

能力	要求
统一注册	每个指标在进程内只注册一次，防止重复定义
类型约束	同名指标不能一会儿是 Counter、一会儿是 Gauge
标签约束	每个指标的 label key 集合固定，运行时只允许填充固定维度
并发安全	允许 FastAPI 协程、队列线程、后台清理线程同时读写
低开销读取	`/metrics` 抓取不应持有长时间大锁
可扩展性	不耦合 Prometheus 专属概念，避免 registry 层直接出现 exposition 文本逻辑
统一写入接口	对外维持统一的 counter / gauge / histogram 写入接口，不按 source 类型暴露分裂 API

Registry 只解决“如何统一保存指标”，不解决：

数据从哪里来；
谁来触发采集；
以什么协议导出。

统一接口策略如下：

EventCollector 决定何时调用 inc_counter 或 observe_histogram；
StateCollector 决定何时调用 set_gauge；
DomainStatsCollector 决定如何把已有累计统计映射到统一写接口；
ProbeCollector 决定如何把 probe 结果翻译成 readiness / health gauge。

也就是说，Registry 不感知“当前写入的是事件、状态、统计还是探针”，它只感知“要写入哪种指标类型、哪个指标名、哪些标签和值”。

由此，registry 可以作为整个指标系统的稳定核心，而不随 exporter 或 collector 的演化频繁变动。

4.4 `BaseMetricExporter` 设计

Exporter 位于指标体系最下游，只负责把 registry 中的当前指标状态转换成外部协议。与 Registry 相同，Exporter 也遵循统一接口原则：它只依赖统一的读接口，不感知指标来自 Event / State / DomainStats / Probe 中的哪一路。

推荐采用如下继承关系：

graph TB
    A["BaseMetricExporter"]
    B["PrometheusExporter"]
    C["OtelExporter"]
    D["InfluxDBExporter"]

    A --> B
    A --> C
    A --> D

各 exporter 的定位如下：

Exporter	定位	首版状态
`PrometheusExporter`	负责 `/metrics` exposition 输出	首版必须落地
`OtelExporter`	对接未来 OTel 指标导出	预留
`InfluxDBExporter`	对接未来 InfluxDB	预留

统一读取策略如下：

Exporter 通过统一的 registry 读取接口获取当前样本集合；
Exporter 不关心某个样本来自哪类 DataSource 或哪类 Collector；
Exporter 只关心指标定义、标签集合、当前值与目标协议的序列化方式；
source/category 语义在 Collector 写入 registry 前已经完成收敛，不应渗透到 exporter 层。

4.5 Prometheus 方式数据路径

本节讨论 /metrics 请求从进入服务到返回结果之间的完整运行时序，并明确导出前刷新策略的边界。推荐做法是：Prometheus 抓取 /metrics 时，先执行必要的 collector 刷新，再统一读取 registry 快照，最后由 exporter 完成协议序列化。

Prometheus 抓取 /metrics 时，推荐执行以下流程：

sequenceDiagram
    participant P as Prometheus
    participant R as /metrics Router
    participant E as PrometheusExporter
    participant C as StateCollectors
    participant G as MetricRegistry

    P->>R: GET /metrics
    R->>E: export()
    E->>C: refresh()
    C->>G: set Gauge values
    E->>G: read unified metric snapshot
    G-->>E: metric samples
    E-->>R: text exposition
    R-->>P: 200 text/plain

设计理由：采用“导出前刷新 StateCollector”的原因是：

不需要额外后台采样协程；
指标与抓取时间点一致；
避免无人抓取时持续消耗资源；
与现有 Observer 的按需读取模式一致。
Exporter 保持协议层纯度，不因 source/category 增长而扩展额外分支逻辑。

5. 指标策略

5.1 指标映射总览

在展开具体指标清单之前，先把“指标族 → DataSource → Collector → Metric Type”的关系收敛成统一视图，可以避免后续列表只见指标名、不见来源链路。对于首版范围内的所有指标，都应能够映射回明确的 DataSource 与 Collector；如果某个指标无法说明其输入来源或采集责任，就不应直接进入首版清单。

指标族	主要 DataSource	主要 Collector	主要指标类型	代表指标
HTTP 请求监控	`HttpRequestLifecycleDataSource`	`EventMetricCollector`（后续可细化为 HTTPCollector）	Counter、Histogram	request total、duration、status code
检索链路监控	`RetrievalStatsDataSource`、retrieval 完成事件	`RetrievalCollector`	Counter、Histogram	retrieval requests、zero result、latency
模型链路监控	`ModelUsageDataSource`、模型调用事件	`VLMCollector`、`EmbeddingCollector`、`ModelUsageCollector`	Counter、Histogram	model calls、tokens、duration
资源导入监控	`ResourceIngestionEventDataSource`	`EventMetricCollector`（后续可细化为 ResourceIngestionCollector）	Counter、Histogram	parse / finalize / summarize / wait duration
Session 与异步任务监控	`SessionLifecycleDataSource`、`TaskStateDataSource`、`QueuePipelineStateDataSource`	`TaskTrackerCollector`、`QueueCollector`	Gauge、Counter	task pending、queue backlog、session lifecycle count
诊断与状态监控	`ObserverStateDataSource`、`QueuePipelineStateDataSource`	`ObserverHealthCollector`、`LockCollector`、`VikingDBCollector`	Gauge	component health、lock active、collection health
加密监控	`EncryptionEventDataSource`、`EncryptionProbeDataSource`	`EncryptionCollector`、`EncryptionProbeCollector`	Counter、Histogram、Gauge	encrypt count、decrypt duration、root key readiness
系统探针监控	各类 `*ProbeDataSource`	各类 `*ProbeCollector`	Gauge、Health	service readiness、storage readiness、kms availability
操作级 Telemetry 指标化	telemetry adapter / bridge	`TelemetryBridgeCollector`	Counter、Histogram、Gauge	operation requests、vector scanned、memory extracted

5.2 指标对象模型

首版的指标对象模型应尽量收敛，只保留当前最必要、最稳定的三类基础指标：Counter、Gauge 与 Histogram。Summary 暂不纳入首版范围，以避免在聚合语义、实现复杂度与使用收益之间引入不必要的失衡。

首版统一支持三种指标类型：

类型	用途	典型场景
Counter	单调递增计数	请求量、错误数、cache 命中数、任务完成数
Gauge	当前值	队列 backlog、运行中任务数、锁持有数、健康状态
Histogram	分布统计	请求耗时、embedding 耗时、VLM 调用耗时、资源处理耗时

不推荐首版支持 Summary，原因如下：

对 Prometheus 多实例聚合不友好；
与 Histogram 的职责边界重叠；
会增加 registry 与 exporter 实现复杂度；
当前 OpenViking 真实缺的是 Gauge，不是 Summary。

5.3 标签策略

标签策略的核心，不在于“尽可能表达更多业务信息”，而在于在可观测性价值与高基数风险之间取得稳定平衡。为此，标签设计必须遵守“低基数优先”原则，默认只允许有限、可枚举、可控的标签集合。

允许的常用标签

标签	说明	是否推荐
`operation`	操作名，如 `search.find`、`resources.add_resource`	推荐
`status`	`ok` / `error`	推荐
`queue`	`Embedding` / `Semantic` / 其他队列名	推荐
`level`	cache level，如 `L0` / `L1` / `L2`	推荐
`context_type`	retrieval context，如 `memory` / `resource`	推荐
`component`	queue / models / lock / retrieval / vikingdb	推荐
`task_type`	如 `session_commit`	推荐
`account_id`	租户 ID	有条件推荐
`provider`	模型供应商，如 openai / volcengine	有条件推荐
`model_name`	模型名	谨慎使用，建议归一化后有限枚举

租户标签策略

account_id 仅建议用于：

请求量；
请求耗时；
任务堆积；
资源导入吞吐。

推荐增加三条保护规则：

默认关闭；
仅对明确列入白名单的指标启用；
提供最大活跃租户数保护，超出后回退到无标签聚合。

建议避免的高基数标签

user_id
session_id
resource_uri
原始 error_message
原始 query
文件路径
request id / telemetry id

5.4 命名规范

命名规范的目标，是让同一类指标在不同 collector、不同链路中保持一致的表达方式，减少命名漂移与语义歧义。为此，指标命名统一采用 openviking_<domain>_<metric>_<unit> 模板，并对 Counter / Histogram / Health 指标施加额外约束。

指标命名建议统一采用：

openviking_<domain>_<metric>_<unit>

示例：

openviking_retrieval_requests_total
openviking_queue_pending
openviking_task_running
openviking_operation_duration_seconds

设计要求：

Counter 以 _total 结尾；
Histogram / duration 统一使用 _seconds；
Gauge 不强制带单位后缀，但建议显式表达；
health 指标统一使用 0/1 数值。

5.5 明确不纳入 `/metrics` 的指标

以下指标保留在 /api/v1/stats 或 telemetry JSON 中更合适：

数据项	原因
memory category 分布	需要扫描或查询聚合，更接近业务分析
hotness / staleness 分布	不适合高频抓取
单 session extraction 明细	高基数、偏诊断
原始错误文本	高基数且可能包含敏感信息
任意 resource / URI 级统计	高基数

5.6 指标桶策略

所有耗时类 Histogram 建议统一使用秒级桶，便于跨模块比较：

桶值（秒）	适用场景
`0.005` / `0.01` / `0.025`	极短本地操作
`0.05` / `0.1` / `0.25`	retrieval / cache / 轻量 API
`0.5` / `1.0` / `2.5`	embedding / 中等资源操作
`5.0` / `10.0` / `30.0`	VLM / wait 模式 / 重处理任务

如果某条链路需要独立桶配置，应通过指标定义层配置，而不是在业务埋点中写死。

5.7 Telemetry 指标化细则

对于 operation telemetry，只抽取“结束时就已具备、并且不会导致高基数”的字段。结合 docs/zh/guides/05-observability.md、docs/zh/guides/07-operation-telemetry.md 与 openviking/telemetry/operation.py 的当前能力。

telemetry 字段	映射指标
`summary.operation` + `summary.status`	`openviking_operation_requests_total`
`summary.duration_ms`	`openviking_operation_duration_seconds`
`summary.tokens.total`	`openviking_operation_tokens_total{token_type="all"}`
`summary.tokens.llm.input`	`openviking_operation_tokens_total{token_type="llm_input"}`
`summary.tokens.llm.output`	`openviking_operation_tokens_total{token_type="llm_output"}`
`summary.tokens.embedding.total`	`openviking_operation_tokens_total{token_type="embedding"}`
`summary.vector.searches`	`openviking_vector_searches_total`
`summary.vector.scored`	`openviking_vector_scored_total`
`summary.vector.passed`	`openviking_vector_passed_total`
`summary.vector.returned`	`openviking_vector_returned_total`
`summary.vector.scanned`	`openviking_vector_scanned_total`
`summary.semantic_nodes.{total	done
`summary.memory.extracted`	`openviking_memory_extracted_total`

6. 最终建议

预期收益

架构职责更清晰：Observer 不再承担指标存储；
业务埋点更稳定：不再直连具体 exporter，事件、状态、统计、probe 各有明确分工；
指标覆盖更全面：补足 HTTP、Resource、Session、Encryption、Probe 以及 Gauge / Health / Queue / Task / Lock / VikingDB；
扩展性更强：未来可以平滑接入 OTel / InfluxDB；
认知负担更低：observer、stats、metrics 三套接口边界清晰。

yc111233 · 2026-04-05T08:42:10Z

yc111233
Apr 5, 2026

Review from 外部 Contributor 🐻‍❄️

Overall solid RFC — the problem diagnosis is accurate and the recommendations are reasonable. A few suggestions to strengthen it:

1. Missing Performance Budget Discussion

Multi-tenant + label dimension expansion → Cardinality explosion is a classic Prometheus pitfall. Suggest adding a section discussing:

Label cardinality limits per metric
Estimated total time series count under production load

2. Gauge Refresh Strategy Needs Nuance

"Refresh on scrape" works for simple metrics, but if get_status_table() itself is slow (e.g., querying VikingDB status), Prometheus scrape timeout (default 10s) will trigger.

Suggest: scrape-triggered + local cache (TTL 5-10s) hybrid approach.

3. Metric Naming Convention Not Specified

The RFC identifies "missing unified naming convention" as Problem 3, but doesn't propose one. Suggest defining it directly:

openviking_<domain>_<name>_<unit>
e.g., openviking_retrieval_latency_seconds

4. Missing Migration Strategy

Renaming existing metrics = breaking change. Suggest:

Mark old metrics as deprecated with a transition period
New metrics follow the规范 naming convention
Document the mapping old → new

Summary

7.5/10 — Clear structure and accurate problem identification, but lacks implementation details (performance budget, naming convention, migration plan). Good enough to spark discussion, but needs another iteration before it's directly implementable.

Great work on the bilingual format — very appropriate for an open-source project targeting both Chinese and international contributors.

1 reply

baojun-zhang Apr 7, 2026
Maintainer Author

Hello,

We are glad you joined our discussion.

We can further discuss your suggestions point by point:

1. Missing Performance Budget Discussion

In the current Discussion, section 5.3 Label Strategy already discusses the definition of low-cardinality metrics and the corresponding constraints. The core purpose is to tighten label boundaries at the principle level first, so that the metrics system does not move toward cardinality explosion too early in a multi-tenant scenario.

However, content such as performance budget / cardinality budget usually involves more detailed capacity assumptions and operational constraints, for example:

the allowed label dimensions for each metric;
how to estimate the total number of time series under production load;
the capacity upper bound before and after enabling tenant labels;
the latency budget of the /metrics scrape path.

We prefer to elaborate on this kind of content in a follow-up Metric guide document, rather than expanding it too early in the current Discussion.

The current Discussion is more focused on first clarifying architectural boundaries, abstract layering, and metric constraint principles. More detailed capacity budgets, runtime recommendations, and operational constraints will be supplemented in the supporting documents later.

2. Gauge Refresh Strategy Needs Nuance

This is a very valuable suggestion.

We did initially consider a mixed scrape-triggered + local cache approach, but the most we want is keeping the /metrics path stable and low-latency, and do not require “instantaneous latest” for state/probe-style metrics.

So at the moment, we are more inclined toward a more flexible strategy:

the default still remains scrape-triggered refresh;
only high-cost State / Probe collectors adopt TTL-based refresh control (refresh gating), combined with stale-while-revalidate (SWR) semantics to avoid blocking the /metrics scrape path;
this is not treated as a global uniform strategy, but as a capability defined by each specific collector.

To avoid “silent stale values”, for state-style metrics where a failure-default value is not well-defined, we include a minimal validity label from the beginning:

the metric is defined with a valid=1/0 label;
on refresh success, the collector writes valid=1;
on refresh failure, the collector keeps exporting the last successful value from the registry, but writes valid=0, so queries and alerts can explicitly gate on valid=1.

For probe/readiness-style metrics with clear failure semantics (e.g., dependency probes), refresh failure should explicitly write the failure value (e.g., UP -> DOWN) rather than relying on series disappearance or implicit expiration.

One practical issue here, as you pointed out, is that “high cost” itself is not a set that can be fully defined statically in the first version. It usually needs to be gradually refined based on real runtime observations, for example:

whether it accesses external dependencies;
whether the refresh latency is stable;
whether it involves aggregation, scanning, or probing actions;
whether it is sensitive to downstream availability or timeout issues.

Therefore, this part is better handled by leaving room for the mechanism at the architecture level first, and then continuously refining it through implementation and runtime observation, rather than making it a one-size-fits-all rule in the first version.

3. Metric Naming Convention Not Specified

We have defined this issue in 5.4 Naming Convention.

4. Missing Migration Strategy

The currently existing metrics are, overall, already broadly aligned with our naming convention. Therefore, at this stage, we do not need to make “metric renaming migration” a primary focus. If the semantics, labels, or exposure style of specific metrics change later, we can then add compatibility and migration notes separately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] OpenViking Metrics Architecture Design / OpenViking 指标体系设计 #1219

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[RFC] OpenViking Metrics Architecture Design / OpenViking 指标体系设计 #1219

Uh oh!

Uh oh!

baojun-zhang Apr 4, 2026 Maintainer

[RFC] OpenViking Metrics Architecture

English Version / 英文版

🔍 Review Items

Confirmed Assumptions

1. Background and Current State

1.1 Current Observation Entrypoints

1.2 Current Code Analysis

1.3 Prometheus Metrics Already Supported

1.4 Main Problems

Problem 1: PrometheusObserver breaks consistency inside the Observer family

Problem 2: Strong coupling between collection and export layers

Problem 3: Incomplete metric type coverage

Problem 4: Insufficient multi-tenant observability

Problem 5: /metrics and /api/v1/stats are easy to confuse

2. Goals and Non-goals

2.1 Goals

2.2 Non-goals

3. New Architecture Overview

3.1 Core Principles

Principle A: Abstract responsibilities first, concrete types later

Principle B: Decouple data sources from the metrics system

Principle C: Registry is the single source of truth for metrics

Principle D: Exporter is responsible only for protocol export

Principle E: Preserve the responsibility boundaries of the three observation entrypoints

3.2 Abstract Layering

3.3 Design Boundaries

3.4 Directory and Module Boundaries

3.5 Relationship with Existing Telemetry

3.6 Responsibility Boundaries of /metrics, /api/v1/observer, and /api/v1/stats

4. Core Design Details

4.1 MetricDataSource Design

4.2 BaseMetricCollector Design

4.3 MetricRegistry Design

4.4 BaseMetricExporter Design

4.5 Prometheus Runtime Data Path

5. Metric Strategy

5.1 Metric Mapping Overview

5.2 Metric Object Model

5.3 Label Strategy

Allowed Common Labels

Tenant Label Strategy

High-cardinality Labels to Avoid

5.4 Naming Convention

5.5 Metrics Explicitly Excluded from /metrics

5.6 Bucket Strategy

5.7 Telemetry Metricization Details

6. Final Recommendations

Recommended Conclusion

Expected Benefits

Chinese Version / 中文版

[RFC] OpenViking 指标体系设计

🔍 评审项

已确认前提

1. 背景与现状

1.1 当前观测入口

1.2 当前代码实现分析

1.3 当前已支持的 Prometheus 指标

1.4 主要问题

问题 1：PrometheusObserver 破坏了 Observer 体系的一致性

问题 2：采集层与导出层强耦合

问题 3：指标类型不完整

问题 4：多租户可观测性不足

问题 5：/metrics 与 /api/v1/stats 的定位容易混淆

2. 设计目标与非目标

2.1 设计目标

2.2 非目标

3. 新架构概览

3.1 核心原则

原则 A：先抽象职责，再落实现类型

原则 B：数据源与指标体系解耦

原则 C：Registry 是唯一真实指标存储

原则 D：Exporter 只负责协议导出

原则 E：保留三类观测入口的职责边界

3.2 抽象分层设计

3.3 设计边界

3.4 目录与模块边界

baojun-zhang
Apr 4, 2026
Maintainer

Problem 1: `PrometheusObserver` breaks consistency inside the `Observer` family

Problem 5: `/metrics` and `/api/v1/stats` are easy to confuse

3.6 Responsibility Boundaries of `/metrics`, `/api/v1/observer`, and `/api/v1/stats`

4.1 `MetricDataSource` Design

4.2 `BaseMetricCollector` Design

4.3 `MetricRegistry` Design

4.4 `BaseMetricExporter` Design

5.5 Metrics Explicitly Excluded from `/metrics`

问题 1：`PrometheusObserver` 破坏了 `Observer` 体系的一致性

问题 5：`/metrics` 与 `/api/v1/stats` 的定位容易混淆

3.6 `/metrics`、`/api/v1/observer`、`/api/v1/stats` 的职责边界

4.1 `MetricDataSource` 设计

4.2 `BaseMetricCollector` 设计

4.3 `MetricRegistry` 设计

4.4 `BaseMetricExporter` 设计

5.5 明确不纳入 `/metrics` 的指标

Replies: 1 comment 1 reply

yc111233
Apr 5, 2026

baojun-zhang Apr 7, 2026
Maintainer Author