-
Notifications
You must be signed in to change notification settings - Fork 69
Internal Telemetry Guidelines #1727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 26 commits
b844ccc
5b379ae
2d01d29
f70bfb3
6ec0052
56ef0fe
e74191c
44d2b98
bf8bc95
d7e4720
a58942a
43f5788
16b8bb4
e375cf1
7e7b7b2
a7a14f7
529047c
2a72f07
348fac4
96105e8
06c9b29
51303e4
986c300
c8d10bf
161b209
ba507ee
d69fb41
918233a
4fe4bb3
a493c22
91022dd
57e4f67
2a825de
3642d55
c709fd1
4d676c5
9831aab
2bb7677
bd82699
c284533
4f67d3b
3e1884f
a5ec1ee
87ffa22
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,17 +7,9 @@ use otap_df_telemetry::instrument::Counter; | |
| use otap_df_telemetry_macros::metric_set; | ||
|
|
||
| /// Metrics for the TransformProcessor node. | ||
| #[metric_set(name = "transform.processor.metrics")] | ||
| #[metric_set(name = "transform.processor")] | ||
| #[derive(Debug, Default, Clone)] | ||
| pub struct Metrics { | ||
| /// PData messages consumed by this processor. | ||
| #[metric(unit = "{msg}")] | ||
| pub msgs_consumed: Counter<u64>, | ||
|
|
||
| /// PData messages forwarded by this processor. | ||
| #[metric(unit = "{msg}")] | ||
| pub msgs_forwarded: Counter<u64>, | ||
|
|
||
|
Comment on lines
-13
to
-20
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed because redundant with the channel metrics. |
||
| /// Number of messages successfully transformed. | ||
| #[metric(unit = "{msg}")] | ||
| pub msgs_transformed: Counter<u64>, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,7 @@ | ||
| # Telemetry SDK (schema-first, multivariate, NUMA-aware) | ||
|
|
||
| Status: draft, under active development. | ||
|
|
||
| A low-overhead, NUMA-aware telemetry SDK that turns a declarative schema into a | ||
| type-safe Rust API for emitting richly structured, multivariate metrics. It is | ||
| designed for engines that run a thread-per-core and require predictable latency | ||
|
|
@@ -62,3 +64,6 @@ | |
| - NUMA-aware aggregation. | ||
|
|
||
|  | ||
|
|
||
| Note: The recent telemetry guidelines defined in `/docs/telemetry` are | ||
| still being implemented in this SDK. Expect changes and improvements over time. | ||
|
Check failure on line 69 in rust/otap-dataflow/crates/telemetry/README.md
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,272 @@ | ||||||
| # Internal telemetry documentation and policy | ||||||
|
|
||||||
| Status: **Draft** – under active development. | ||||||
|
|
||||||
| ## Scope | ||||||
|
|
||||||
| This documentation applies to all telemetry produced by the OTAP dataflow engine | ||||||
| (runtime and its core libraries): | ||||||
|
|
||||||
| - metrics | ||||||
| - events (structured logs with an event name) | ||||||
| - traces (when implemented) | ||||||
| - resource metadata (service, host, container, process) | ||||||
|
|
||||||
| ## Normative language | ||||||
|
|
||||||
| The key words MUST, SHOULD, and MAY are to be interpreted as normative | ||||||
| requirements in all the documentation within this directory. | ||||||
|
|
||||||
| ## Overview | ||||||
|
|
||||||
| Internal telemetry is a first-class concern of this project. As with any complex | ||||||
| system, reliable operation, performance analysis, and effective debugging | ||||||
| require intentional and well-designed instrumentation. This document defines the | ||||||
| principles, guidelines, and implementation details governing the project's | ||||||
| internal telemetry. | ||||||
|
|
||||||
| We follow an **observability by design** approach: observability requirements | ||||||
| are defined early and evolve alongside the system itself. All entities or | ||||||
| components are expected to be instrumented consistently, using well-defined | ||||||
| schemas and conventions, so that emitted telemetry is coherent, actionable, and | ||||||
| suitable for long-term analysis and optimization. | ||||||
|
|
||||||
| This approach is structured around the following lifecycle: | ||||||
|
|
||||||
| 1) **Set clear goals**: Define observability objectives up front. Identify which | ||||||
| signals are required and why. | ||||||
| 2) **Automate**: Use tooling to derive code, documentation, tests, and schemas | ||||||
| from shared conventions. | ||||||
| 3) **Validate**: Detect observability and schema issues early through CI and | ||||||
| automated checks, not in production. | ||||||
| 4) **Iterate**: Refine telemetry based on real-world usage, feedback, and | ||||||
| evolving system requirements. | ||||||
|
|
||||||
| Telemetry is treated as a **stable interface** of the system. As with any public | ||||||
| API, backward compatibility, semantic clarity, and versioning discipline are | ||||||
| essential. Changes to telemetry should be intentional, reviewed, and aligned | ||||||
| with the overall observability model. | ||||||
|
|
||||||
| See the [Stability and Compatibility Guide](stability-compatibility-guide.md) | ||||||
| for the stability model, compatibility rules, and deprecation process. | ||||||
|
|
||||||
| ## Goals | ||||||
|
|
||||||
| Internal telemetry MUST enable: | ||||||
|
|
||||||
| - reliable operation and incident response | ||||||
| - performance analysis and regression detection | ||||||
| - capacity planning and saturation detection | ||||||
| - change impact analysis (deploys, config reloads, topology changes) | ||||||
| - long-term trend analysis with stable schema and naming | ||||||
|
|
||||||
| Telemetry MUST NOT compromise: | ||||||
|
|
||||||
| - system safety and correctness | ||||||
| - performance budgets on hot paths | ||||||
| - confidentiality (PII, secrets, sensitive payloads) | ||||||
|
|
||||||
| ## Core principles | ||||||
|
|
||||||
| The principles below define how internal telemetry is designed, implemented, | ||||||
| validated, and evolved in this project. They are intentionally opinionated and | ||||||
| serve as a shared contract between contributors, tooling, and runtime behavior. | ||||||
|
|
||||||
| ### 1. Schema-first | ||||||
|
|
||||||
| All telemetry is defined **schema-first**. Entities, signals, attributes, and | ||||||
| their relationships MUST be described explicitly in a schema before or alongside | ||||||
| their implementation. | ||||||
|
|
||||||
| Schemas are treated as versioned artifacts and as the primary source of truth | ||||||
| for: | ||||||
|
|
||||||
| * instrumentation requirements, | ||||||
| * validation rules, | ||||||
| * documentation generation, | ||||||
| * and client SDK generation. | ||||||
|
|
||||||
| Ad hoc or implicit telemetry definitions are discouraged, as they undermine | ||||||
| consistency, tooling, and long-term maintainability. | ||||||
|
|
||||||
| ### 2. Entity-centric | ||||||
|
|
||||||
| Telemetry is modeled around **entities**, which represent stable, identifiable | ||||||
| subjects of observation. Signals describe the state, behavior, or performance of | ||||||
| one or more entities at a given point in time. | ||||||
|
|
||||||
| This project favors: | ||||||
|
|
||||||
| * clear separation between **entity attributes** (stable context) and | ||||||
| **signal-specific attributes** (dynamic context), | ||||||
| * bounded and well-justified attribute cardinality, | ||||||
| * stable identifiers to support correlation across signals, restarts, and | ||||||
| system boundaries. | ||||||
|
|
||||||
| Entity modeling is a prerequisite for producing telemetry that is interpretable, | ||||||
| composable, and operationally useful at scale. | ||||||
|
|
||||||
| ### 3. Type-safe and performance-focused instrumentation | ||||||
|
|
||||||
| The telemetry SDK is **type-safe by construction** and **performance-aware**. | ||||||
|
|
||||||
| Instrumentation APIs should: | ||||||
|
|
||||||
| * prevent invalid or non-compliant telemetry at compile time whenever possible, | ||||||
| * minimize overhead on hot paths, | ||||||
| * avoid unnecessary allocations and dynamic behavior, | ||||||
| * make the cost of instrumentation explicit and predictable. | ||||||
|
|
||||||
| Correctness, efficiency, and safety take precedence over convenience. | ||||||
|
|
||||||
| ### 4. Alignment with OpenTelemetry semantic conventions | ||||||
|
|
||||||
| This project adopts **OpenTelemetry semantic conventions** as the baseline | ||||||
| vocabulary and modeling framework. | ||||||
|
|
||||||
| Where existing conventions are sufficient, they are reused directly. Where | ||||||
| project-specific concepts are required, they are defined in a **custom semantic | ||||||
| convention registry**, aligned with OpenTelemetry principles and formats. | ||||||
|
|
||||||
| This registry formally describes: | ||||||
|
|
||||||
| * the entities relevant to the project, | ||||||
| * the signals emitted by the system, | ||||||
| * the allowed attributes, types, units, and stability guarantees. | ||||||
|
|
||||||
| ### 5. First-class support for multivariate metrics | ||||||
|
|
||||||
| The internal telemetry model and SDK natively support **multivariate metric | ||||||
| sets**. | ||||||
|
|
||||||
| This enables: | ||||||
|
|
||||||
| * efficient sharing of attribute tuples, | ||||||
| * coherent modeling of related measurements, | ||||||
| * reduced duplication and cardinality explosion compared to naive univariate | ||||||
|
||||||
| metrics. | ||||||
|
|
||||||
| Multivariate metrics are treated as a fundamental modeling capability rather | ||||||
| than a post-processing optimization. | ||||||
|
|
||||||
| Note: OTLP and OTAP protocols do not yet have first-class support for | ||||||
| multivariate metrics. The SDK and exporters handle the necessary translation and | ||||||
| encoding. We plan to contribute multivariate support to OpenTelemetry protocols | ||||||
| in the future. In the meantime, this project serves as a proving ground for the | ||||||
| concept. | ||||||
|
|
||||||
| ### 6. Tooling-driven validation and documentation with Weaver | ||||||
|
|
||||||
| Telemetry correctness and completeness are enforced through **tooling, not | ||||||
| convention alone**. | ||||||
|
|
||||||
| This project plans to integrate with **Weaver** to: | ||||||
|
|
||||||
| * validate emitted telemetry against the versioned semantic convention registry, | ||||||
| * perform registry compliance checks in CI, | ||||||
| * execute live checks during tests to ensure that expected signals are actually | ||||||
| produced, | ||||||
| * generate authoritative documentation in Markdown or HTML from the registry. | ||||||
|
|
||||||
| An administrative endpoint exposes the live, resolved schema at runtime to | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be configurable, not always on
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is partially covered in |
||||||
| support inspection, debugging, and tooling integration. | ||||||
|
|
||||||
| Security and deployment guidance for this endpoint is in the | ||||||
| [Security and Privacy Guide](security-privacy-guide.md). | ||||||
|
|
||||||
| Registry compliance checks and live checks are not yet enforced in CI. See | ||||||
| [Implementation Gaps](implementation-gaps.md). | ||||||
|
|
||||||
| ### 7. Automated client SDK generation (longer term) | ||||||
|
|
||||||
| In the longer term, the custom semantic convention registry will be used to | ||||||
| generate **type-safe Rust client SDKs** via Weaver. | ||||||
|
||||||
| generate **type-safe Rust client SDKs** via Weaver. | |
| generate **type-safe Rust instrumentation APIs** via Weaver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, in the short term, yes, we are focusing on the instrumentation side and reusing an existing SDK. In a second phase, what Joshua is preparing is to integrate our OTAP pipeline to report our internal telemetry. In absolute terms, this will allow us to report our telemetry via OTLP, OTAP, and so on, without going through an existing SDK. In the long term, I was thinking of reusing all this infrastructure to generate a new kind of schema-driven and type-safe SDKs.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Export failure not breaking engine -- this is not really related to telemetry, but more broader principle..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this sentence, I meant for example that our instrumentation API must not be blocked, in one way or another, by a defective or congested SDK export function. If you have a rewording in mind that you think would be more accurate, feel free to suggest it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds similar to https://opentelemetry.io/docs/specs/otel/error-handling/
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 love this part!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because redundant with the channel metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Its easier from review standpoint, to keep PRs more focused. Since this PR is adding telemetry guidelines doc, lets stick with that. Cleaning up metrics can be own PR.