Skip to content
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
b844ccc
Remove redundant metrics
lquerel Dec 31, 2025
5b379ae
Add SEMANTIC_CONVENTIONS_GUIDE.md
lquerel Dec 31, 2025
2d01d29
Organize telemetry documentation
lquerel Jan 4, 2026
f70bfb3
Update entity-model.md
lquerel Jan 4, 2026
6ec0052
Update entity-model.md
lquerel Jan 4, 2026
56ef0fe
Update entity-model.md
lquerel Jan 4, 2026
e74191c
Update entity-model.md
lquerel Jan 4, 2026
44d2b98
Update README.md
lquerel Jan 4, 2026
bf8bc95
Add metrics-guide.md
lquerel Jan 4, 2026
d7e4720
Update README.md
lquerel Jan 4, 2026
a58942a
Update metrics-guide.md
lquerel Jan 5, 2026
43f5788
Update events-guide.md
lquerel Jan 6, 2026
16b8bb4
Update telemetry guides
lquerel Jan 6, 2026
e375cf1
Update telemetry README.md
lquerel Jan 6, 2026
7e7b7b2
Several updates in the document based on feedback
lquerel Jan 6, 2026
a7a14f7
Fix markdown issues
lquerel Jan 6, 2026
529047c
Add stability, compatibility, safety recommendations
lquerel Jan 6, 2026
2a72f07
Add 3 new guides on attributes, stability, and security/privacy
lquerel Jan 6, 2026
348fac4
Identify implementation gaps
lquerel Jan 6, 2026
96105e8
Remove content duplication
lquerel Jan 6, 2026
06c9b29
Clarify tracing status
lquerel Jan 6, 2026
51303e4
Fix markdown issues
lquerel Jan 6, 2026
986c300
Additional edits in the doc for improving consistency and clarity
lquerel Jan 6, 2026
c8d10bf
Unify title styles
lquerel Jan 7, 2026
161b209
Fix few missing/unclear points
lquerel Jan 7, 2026
ba507ee
Merge branch 'main' into metrics-cleanup
lquerel Jan 7, 2026
d69fb41
Few minor changes in the README.md
lquerel Jan 7, 2026
918233a
Few minor changes in the main README.md
lquerel Jan 7, 2026
4fe4bb3
Update admin endpoints security requirements
lquerel Jan 7, 2026
a493c22
Update rust/otap-dataflow/docs/telemetry/security-privacy-guide.md
lquerel Jan 8, 2026
91022dd
Merge branch 'main' into metrics-cleanup
lquerel Jan 8, 2026
57e4f67
Take into account all feedback
lquerel Jan 8, 2026
2a825de
Fix markdown issues
lquerel Jan 8, 2026
3642d55
Fix markdown issues
lquerel Jan 8, 2026
c709fd1
Fix markdown issues
lquerel Jan 8, 2026
4d676c5
Fix markdown issues
lquerel Jan 8, 2026
9831aab
Fix markdown issues
lquerel Jan 8, 2026
2bb7677
Fix markdown issues
lquerel Jan 8, 2026
bd82699
Merge branch 'main' into metrics-cleanup
lquerel Jan 8, 2026
c284533
Fix clippy issues
lquerel Jan 8, 2026
4f67d3b
Merge remote-tracking branch 'origin/metrics-cleanup' into metrics-cl…
lquerel Jan 8, 2026
3e1884f
Fix unit test
lquerel Jan 8, 2026
a5ec1ee
Fix unit test
lquerel Jan 8, 2026
87ffa22
Fix unit test
lquerel Jan 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 0 additions & 14 deletions rust/otap-dataflow/crates/otap/src/attributes_processor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -299,21 +299,12 @@ impl local::Processor<OtapPdata> for AttributesProcessor {
_ => Ok(()),
},
Message::PData(pdata) => {
if let Some(m) = self.metrics.as_mut() {
m.msgs_consumed.inc();
}

// Fast path: no actions to apply
if self.is_noop() {
let res = effect_handler
.send_message(pdata)
.await
.map_err(|e| e.into());
if res.is_ok() {
if let Some(m) = self.metrics.as_mut() {
m.msgs_forwarded.inc();
}
}
return res;
}

Expand Down Expand Up @@ -358,11 +349,6 @@ impl local::Processor<OtapPdata> for AttributesProcessor {
.send_message(OtapPdata::new(context, records.into()))
.await
.map_err(|e| e.into());
if res.is_ok() {
if let Some(m) = self.metrics.as_mut() {
m.msgs_forwarded.inc();
}
}
res
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,6 @@ use otap_df_telemetry_macros::metric_set;
#[metric_set(name = "attributes.processor.metrics")]
#[derive(Debug, Default, Clone)]
pub struct AttributesProcessorMetrics {
/// PData messages consumed by this processor.
#[metric(unit = "{msg}")]
pub msgs_consumed: Counter<u64>,

/// PData messages forwarded by this processor.
#[metric(unit = "{msg}")]
pub msgs_forwarded: Counter<u64>,

Comment on lines -13 to -20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because redundant with the channel metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Its easier from review standpoint, to keep PRs more focused. Since this PR is adding telemetry guidelines doc, lets stick with that. Cleaning up metrics can be own PR.

/// Number of failed transform attempts.
#[metric(unit = "{op}")]
pub transform_failed: Counter<u64>,
Expand Down
4 changes: 1 addition & 3 deletions rust/otap-dataflow/crates/otap/src/transform_processor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,6 @@ impl Processor<OtapPdata> for TransformProcessor {
}
},
Message::PData(pdata) => {
self.metrics.msgs_consumed.inc();
let (context, payload) = pdata.into_parts();
let payload = if !self.should_process(&payload) {
// skip handling this pdata
Expand All @@ -200,8 +199,7 @@ impl Processor<OtapPdata> for TransformProcessor {

effect_handler
.send_message(OtapPdata::new(context, payload))
.await
.inspect(|_| self.metrics.msgs_forwarded.inc())?;
.await?;
}
};

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,9 @@ use otap_df_telemetry::instrument::Counter;
use otap_df_telemetry_macros::metric_set;

/// Metrics for the TransformProcessor node.
#[metric_set(name = "transform.processor.metrics")]
#[metric_set(name = "transform.processor")]
#[derive(Debug, Default, Clone)]
pub struct Metrics {
/// PData messages consumed by this processor.
#[metric(unit = "{msg}")]
pub msgs_consumed: Counter<u64>,

/// PData messages forwarded by this processor.
#[metric(unit = "{msg}")]
pub msgs_forwarded: Counter<u64>,

Comment on lines -13 to -20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because redundant with the channel metrics.

/// Number of messages successfully transformed.
#[metric(unit = "{msg}")]
pub msgs_transformed: Counter<u64>,
Expand Down
5 changes: 5 additions & 0 deletions rust/otap-dataflow/crates/telemetry/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Telemetry SDK (schema-first, multivariate, NUMA-aware)

Status: draft, under active development.

A low-overhead, NUMA-aware telemetry SDK that turns a declarative schema into a
type-safe Rust API for emitting richly structured, multivariate metrics. It is
designed for engines that run a thread-per-core and require predictable latency
Expand Down Expand Up @@ -62,3 +64,6 @@
- NUMA-aware aggregation.

![Architecture Phase 2](assets/Metrics%20Phase%202.svg)

Note: The recent telemetry guidelines defined in `/docs/telemetry` are
still being implemented in this SDK. Expect changes and improvements over time.

Check failure on line 69 in rust/otap-dataflow/crates/telemetry/README.md

View workflow job for this annotation

GitHub Actions / markdownlint

Files should end with a single newline character

rust/otap-dataflow/crates/telemetry/README.md:69:79 MD047/single-trailing-newline Files should end with a single newline character https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md047.md
272 changes: 272 additions & 0 deletions rust/otap-dataflow/docs/telemetry/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# Internal telemetry documentation and policy

Status: **Draft** – under active development.

## Scope

This documentation applies to all telemetry produced by the OTAP dataflow engine
(runtime and its core libraries):

- metrics
- events (structured logs with an event name)
- traces (when implemented)
- resource metadata (service, host, container, process)

## Normative language

The key words MUST, SHOULD, and MAY are to be interpreted as normative
requirements in all the documentation within this directory.

## Overview

Internal telemetry is a first-class concern of this project. As with any complex
system, reliable operation, performance analysis, and effective debugging
require intentional and well-designed instrumentation. This document defines the
principles, guidelines, and implementation details governing the project's
internal telemetry.

We follow an **observability by design** approach: observability requirements
are defined early and evolve alongside the system itself. All entities or
components are expected to be instrumented consistently, using well-defined
schemas and conventions, so that emitted telemetry is coherent, actionable, and
suitable for long-term analysis and optimization.

This approach is structured around the following lifecycle:

1) **Set clear goals**: Define observability objectives up front. Identify which
signals are required and why.
2) **Automate**: Use tooling to derive code, documentation, tests, and schemas
from shared conventions.
3) **Validate**: Detect observability and schema issues early through CI and
automated checks, not in production.
4) **Iterate**: Refine telemetry based on real-world usage, feedback, and
evolving system requirements.

Telemetry is treated as a **stable interface** of the system. As with any public
API, backward compatibility, semantic clarity, and versioning discipline are
essential. Changes to telemetry should be intentional, reviewed, and aligned
with the overall observability model.

See the [Stability and Compatibility Guide](stability-compatibility-guide.md)
for the stability model, compatibility rules, and deprecation process.

## Goals

Internal telemetry MUST enable:

- reliable operation and incident response
- performance analysis and regression detection
- capacity planning and saturation detection
- change impact analysis (deploys, config reloads, topology changes)
- long-term trend analysis with stable schema and naming

Telemetry MUST NOT compromise:

- system safety and correctness
- performance budgets on hot paths
- confidentiality (PII, secrets, sensitive payloads)

## Core principles

The principles below define how internal telemetry is designed, implemented,
validated, and evolved in this project. They are intentionally opinionated and
serve as a shared contract between contributors, tooling, and runtime behavior.

### 1. Schema-first

All telemetry is defined **schema-first**. Entities, signals, attributes, and
their relationships MUST be described explicitly in a schema before or alongside
their implementation.

Schemas are treated as versioned artifacts and as the primary source of truth
for:

* instrumentation requirements,
* validation rules,
* documentation generation,
* and client SDK generation.

Ad hoc or implicit telemetry definitions are discouraged, as they undermine
consistency, tooling, and long-term maintainability.

### 2. Entity-centric

Telemetry is modeled around **entities**, which represent stable, identifiable
subjects of observation. Signals describe the state, behavior, or performance of
one or more entities at a given point in time.

This project favors:

* clear separation between **entity attributes** (stable context) and
**signal-specific attributes** (dynamic context),
* bounded and well-justified attribute cardinality,
* stable identifiers to support correlation across signals, restarts, and
system boundaries.

Entity modeling is a prerequisite for producing telemetry that is interpretable,
composable, and operationally useful at scale.

### 3. Type-safe and performance-focused instrumentation

The telemetry SDK is **type-safe by construction** and **performance-aware**.

Instrumentation APIs should:

* prevent invalid or non-compliant telemetry at compile time whenever possible,
* minimize overhead on hot paths,
* avoid unnecessary allocations and dynamic behavior,
* make the cost of instrumentation explicit and predictable.

Correctness, efficiency, and safety take precedence over convenience.

### 4. Alignment with OpenTelemetry semantic conventions

This project adopts **OpenTelemetry semantic conventions** as the baseline
vocabulary and modeling framework.

Where existing conventions are sufficient, they are reused directly. Where
project-specific concepts are required, they are defined in a **custom semantic
convention registry**, aligned with OpenTelemetry principles and formats.

This registry formally describes:

* the entities relevant to the project,
* the signals emitted by the system,
* the allowed attributes, types, units, and stability guarantees.

### 5. First-class support for multivariate metrics

The internal telemetry model and SDK natively support **multivariate metric
sets**.

This enables:

* efficient sharing of attribute tuples,
* coherent modeling of related measurements,
* reduced duplication and cardinality explosion compared to naive univariate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure how it avoids cardinality explosions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did get a bit carried away on the cardinality aspect indeed. That said, the redundancy of attributes is still greatly reduced, inversely reduced relative to the number of metrics in the set.

Regarding cardinality, I had in mind what the Collector does, for example with memory-related metrics using the state attribute (same for mode). I had the impression that state is a pattern often used to represent a category of multivariate metrics.

In any case, I’m going to remove the word "cardinality" from that sentence.

metrics.

Multivariate metrics are treated as a fundamental modeling capability rather
than a post-processing optimization.

Note: OTLP and OTAP protocols do not yet have first-class support for
multivariate metrics. The SDK and exporters handle the necessary translation and
encoding. We plan to contribute multivariate support to OpenTelemetry protocols
in the future. In the meantime, this project serves as a proving ground for the
concept.

### 6. Tooling-driven validation and documentation with Weaver

Telemetry correctness and completeness are enforced through **tooling, not
convention alone**.

This project plans to integrate with **Weaver** to:

* validate emitted telemetry against the versioned semantic convention registry,
* perform registry compliance checks in CI,
* execute live checks during tests to ensure that expected signals are actually
produced,
* generate authoritative documentation in Markdown or HTML from the registry.

An administrative endpoint exposes the live, resolved schema at runtime to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be configurable, not always on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is partially covered in security-privacy-guide.md (link below the sentence you commented), but I've just added a more explicit point on this aspect (see 4fe4bb3).

support inspection, debugging, and tooling integration.

Security and deployment guidance for this endpoint is in the
[Security and Privacy Guide](security-privacy-guide.md).

Registry compliance checks and live checks are not yet enforced in CI. See
[Implementation Gaps](implementation-gaps.md).

### 7. Automated client SDK generation (longer term)

In the longer term, the custom semantic convention registry will be used to
generate **type-safe Rust client SDKs** via Weaver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we are not generated while client SDKs right? just the API used to instrument, which gets plugged into an existing SDK.

Suggested change
generate **type-safe Rust client SDKs** via Weaver.
generate **type-safe Rust instrumentation APIs** via Weaver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, in the short term, yes, we are focusing on the instrumentation side and reusing an existing SDK. In a second phase, what Joshua is preparing is to integrate our OTAP pipeline to report our internal telemetry. In absolute terms, this will allow us to report our telemetry via OTLP, OTAP, and so on, without going through an existing SDK. In the long term, I was thinking of reusing all this infrastructure to generate a new kind of schema-driven and type-safe SDKs.


The objective is to:

* eliminate manual duplication between schema and code,
* ensure strict alignment between instrumentation and specification,
* provide contributors with safe, ergonomic APIs that encode observability
rules directly in types.

This is considered a strategic investment and will be introduced incrementally.

### 8. Telemetry as a stable interface

Telemetry is treated as a **stable interface of the system**.
Refer to [Stability and Compatibility Guide](stability-compatibility-guide.md).

For items that are documented but not yet implemented or enforced, see
[Implementation Gaps](implementation-gaps.md).

## Runtime safety and failure behavior

Telemetry MUST be non-fatal and bounded:

- Export failures MUST NOT break the dataflow engine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Export failure not breaking engine -- this is not really related to telemetry, but more broader principle..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this sentence, I meant for example that our instrumentation API must not be blocked, in one way or another, by a defective or congested SDK export function. If you have a rewording in mind that you think would be more accurate, feel free to suggest it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Telemetry pipelines MUST use bounded buffers.
- Under pressure, the default behavior SHOULD be to drop telemetry rather than
block critical work.
- Drops SHOULD be observable via counters (by drop reason) and optionally debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 love this part!

events.

The telemetry system MUST NOT introduce deadlocks, unbounded memory growth, or
process termination.

## Instrumentation guides

**Instrumentation** is the act of adding telemetry signals (metrics, events,
traces) to the codebase to observe the system behavior and performance.

The [entity model](entity-model.md) defines the observed things, the "nouns" of
our system, and how signals describe them. Entities are described by attributes
that provide context to metrics, events, and traces, and a single signal can
involve multiple entities at once. **Attribute cardinality must be bounded** to
keep telemetry efficient and aggregations meaningful. Identifier stability
matters for correlation across signals and restarts; refer to the stability
guarantees in the entity model when adding new attributes.

The naming conventions, units and general guidelines are in the
[semantic conventions guide](semantic-conventions-guide.md). Contributors SHOULD
read it before introducing new telemetry.

The guides below provide a framework for defining **good, consistent, secure,
and actionable signals**. They are not an exhaustive list of every signal and
attribute in the project, but a shared reference for how to introduce and evolve
telemetry:

- [Attributes Guide](attributes-guide.md)
- [System Metrics Guide](metrics-guide.md)
- [System Events Guide](events-guide.md)
- [System Traces Draft - Not For Review](tracing-draft-not-for-review.md)
- [Stability and Compatibility Guide](stability-compatibility-guide.md)
- [Security and Privacy Guide](security-privacy-guide.md)

## Implementation details

For implementation details of the telemetry SDK, including macros, schema
handling, and the dataflow for metric collection, see the
[telemetry implementation description](/crates/telemetry/README.md).

Note: This SDK is internal to the project and optimized for our use cases. It is
not intended for public use (at least not yet). It may change without notice.

The documentation in this directory focuses on the intented design and policy
aspects of internal telemetry. The current implementation does not yet fully
realize all goals and principles described here, but it is evolving rapidly.
The [implementation gaps](implementation-gaps.md) document tracks the progress.

## Contributor workflow (minimum)

When adding or changing telemetry:

1) Update the semantic convention registry first (schema-first).
2) Regenerate documentation and code (when applicable).
3) Run CI validation when available (registry checks, live checks in tests).
4) If the change is breaking, bump the registry version and add a migration
note.

Implementation status of this workflow (what is enforced, generated, and
validated) is tracked in [implementation-gaps.md](implementation-gaps.md).
Coordinate with maintainers when making changes that are not yet
tooling-supported.
Loading
Loading