Internal logs architecture document #1741

jmacd · 2026-01-08T00:32:22Z

Document the approach we will take for routing internal logs, see #1736.

codecov · 2026-01-08T00:35:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.17%. Comparing base (25bae32) to head (a694088).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #1741    +/-   ##
========================================
  Coverage   84.17%   84.17%            
========================================
  Files         480      480            
  Lines      139186   139361   +175     
========================================
+ Hits       117156   117312   +156     
- Misses      21496    21515    +19     
  Partials      534      534

Components	Coverage Δ
otap-dataflow	`85.41% <ø> (+<0.01%)`	⬆️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.54% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`53.50% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

lalitb · 2026-01-08T18:22:14Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+the log record on the same core. When this fails, the configurable
+telemetry router will support options to use global logs collection
+thread, a raw logger, or do nothing (dropping the internal log
+record).


What should be the default for production, if the user doesn't select any of these options. Good to mention that.

New default I added new text stating the default would be an asynchronous raw console logger, like:

service: telemetry: logs: level: info strategies: global: unbuffered engine: buffered internal: noop output: raw

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

lquerel

I made a series of proposals that I hope go in the direction of what you want to put in place.

rust/otap-dataflow/docs/self_tracing_architecture.md

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

lquerel · 2026-01-08T22:36:57Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+  no-op implementation, multi-threaded subscriber, routing to the
+  same-core ITR, and/or raw logging.
+
+## OTLP-bytes first


In our recent telemetry guidelines, for events we distinguish between entity-based attributes (stable context) and other event-specific attributes (dynamic context). Using this terminology, attributes belonging to the stable context do not need to be emitted with every event instance; instead, they are identified by a unique numeric ID that references the attributes in an attribute registry.

The "dynamic" attributes are the ones that should travel as OTLP bytes from the hot path to the cold path. If I recall our recent conversation correctly, you were arguing that building this dynamic map would take roughly the same amount of time in a classic representation as in OTLP bytes form. That seems plausible to me, provided we are careful with this micro-serialization and keep attribute values simple.

In any case, I think we should run a few benchmarks to validate all of this.

I'd like to take this one step at a time. Presently, the internal logging design is concerned with the mechanics for handling and encoding events without considering entity-based attributes. We will be able to add dynamic context using the appropriate numeric ID to represent "scope" in a future PR.

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

jmacd

_

jmacd · 2026-01-12T19:03:43Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+  no-op implementation, multi-threaded subscriber, routing to the
+  same-core ITR, and/or raw logging.
+
+## OTLP-bytes first


I'd like to take this one step at a time. Presently, the internal logging design is concerned with the mechanics for handling and encoding events without considering entity-based attributes. We will be able to add dynamic context using the appropriate numeric ID to represent "scope" in a future PR.

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

rust/otap-dataflow/docs/self_tracing_architecture.md

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

…d/internal_logs_arch

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

rust/otap-dataflow/docs/self_tracing_architecture.md

andborja · 2026-01-12T19:43:24Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+instrumentation. Using the `otel_info!(effect, name, args...)` macro
+requires access the component EffectHandler. This is considered
+first-party internal logging, and other uses of Tokio `tracing` are
+considered third-party internal logging.


I think we need to design and implement the internal logging and explicitly exclude/ignore third-party logging.

The providers::internal setting does just that, and it defaults to Noop to prevent internal telemetry pipeline components for logging.

andborja · 2026-01-12T20:01:41Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

+    Noop,
+
+    /// Place into a thread-local buffer.
+    Buffered,


I think this should not be an option at production time. If someone wishes to buffer and batch events, that should be done in an internal telemetry pipeline (not at telemetry production time).

The internal telemetry pipeline is the buffer. The internal telemetry pipeline starts with a buffer in each engine thread.

andborja · 2026-01-12T20:04:32Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+    BL -->|periodic flush<br/>LogPayload::Batch| CH
+    NL -.->|discarded| X[∅]
+
+    CH --> LC


There is one component missing here. There might be some "subscription manager" or some component in charge of pulling the objects from the channel and sending it to the configured destination (console or an internal telemetry receiver channel)

This "LC" is the logs collector. The design documented in crates/telemetry/README.md calls for this to be NUMA-regional eventually, for now it's a global/single thread. This is one of the dedicated threads described for internal telemetry above, and it pulls from the channel and does what the OutputMode prescribes, meaning it prints to the console or sends to the internal pipeline.

andborja · 2026-01-12T20:05:27Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

 use serde::{Deserialize, Serialize};

+/// Internal Telemetry Receiver node URN for internal logging using OTLP bytes.
+pub const INTERNAL_TELEMETRY_RECEIVER_URN: &str = "urn:otel:otlp:telemetry:receiver";


The controller or engine at startup need to know which node is the ITR, and it's implemented in the crates/otap area, this just needs to be somewhere both can depend on, and this is sort of a type of configuration. It could go in the engine, controller, or other low-level crate.

In my opinion, this URN is ambiguous. We should have a URN that clearly expresses that this is a receiver used by our internal telemetry. I would suggest something like urn:otelcol:internal:receiver, or something along those lines.

andborja · 2026-01-12T20:07:28Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

 use schemars::JsonSchema;
 use serde::{Deserialize, Serialize};

+/// Internal Telemetry Receiver node URN for internal logging using OTLP bytes.


Suggestion: Do not include any source code changes as part of this PR unless it is a complete POC.
I'm not clear how this changes are aligned with the design, and there might be problems when starting the SDK threads when some objects are present.

If you and other reviewers feel that this configuration is OK, I'll be glad to remove this change and include it in the PR with implementation.

andborja · 2026-01-12T20:24:48Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

 /// Internal logs configuration.
-#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, Default)]
+#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
 pub struct LogsConfig {


This configuration is more "pipeline" level than engine level. That's why I don't see this structure configured here.
I would expect the configuration to be defined per component, with defaults for the type of component. For example, the internal telemetry receiver and subsequent nodes should be Noop or console only.

Note that @lquerel also doesn't think the new configuration belongs here. I'm not sure where it belongs? This feels OK to me.

I do see this as configuration for the main function, so the global, engine, and internal aspects inside providers are for global-level, engine-level, and internal-pipeline logging configuration. As mentioned #1741 (comment) this is something we can configure component level, per-tenant, etc.

andborja · 2026-01-12T20:27:28Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+  telemetry:
+    logs:
+      level: info
+      strategies:


This is more for the existing SDK configuration (service/telemetry).
We should configure this block in a separate, first level block not coupled to the SDK configuration.

Sorry! I had renamed this field "providers" and the type "LoggingProviders" to capture the three top-level default contexts that we can configure, which are global, engine, and internal.

andborja · 2026-01-12T20:28:17Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+      strategies:
+        global: unbuffered
+        engine: buffered
+        internal: noop


This configuration should not be global to the engine, but local to a pipeline (node for our case?)

In the accompanying document, I tried to capture this concept that the Internal Telemetry Receiver and all of its downstream nodes are considered an Internal Telemetry pipeline. All node in the internal telemetry pipeline would have the Noop configuration in this example. This setting is meant to apply to all of those internal telemetry components.

I'm aware of a desire to control the telemetry provider in more fine-grain ways. For example, we might imagine overriding LogLevel at the component level. We can imagine varying the ProviderMode at that level too, it's just more-detailed configuration.

lquerel

I made a few non-blocking comments, some of which could be easily addressed in this PR. More generally, I think we are in the process of refining the overall vision for this Internal Telemetry System. I should also have a PR coming soon that will help move us further toward our final design.

lquerel · 2026-01-12T21:39:41Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

 use serde::{Deserialize, Serialize};

+/// Internal Telemetry Receiver node URN for internal logging using OTLP bytes.
+pub const INTERNAL_TELEMETRY_RECEIVER_URN: &str = "urn:otel:otlp:telemetry:receiver";


In my opinion, this URN is ambiguous. We should have a URN that clearly expresses that this is a receiver used by our internal telemetry. I would suggest something like urn:otelcol:internal:receiver, or something along those lines.

lquerel · 2026-01-12T21:45:12Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

 /// Internal logs configuration.
-#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, Default)]
+#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
 pub struct LogsConfig {


In the future, I expect a lot of changes at this level. Right now we are talking about LogConfig, but in the long run this will be a configuration for the entire internal telemetry pipeline/system and for all internally used signal types (metric set, events, ...).

I can see this being acceptable as an intermediate step for now, but as I mentioned, I don’t think this configuration is 1) general enough, or 2) in the right place.

@lquerel this position in the YAML is the same as the OTel Collector at service::telemetry::logs with the level field. I'm open to placing it anywhere you or @andborja could agree. Definitely the resource used for logs and metrics; I would expect other signal's internal configuration to mirror this.

rust/otap-dataflow/docs/self_tracing_architecture.md

lquerel · 2026-01-12T21:57:01Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+ITR components and any of the connected processor and exporter
+components reachable from ITR source nodes.


IMO, the ITR(s) are part of the Internal Telemetry System or ITS which is the combination of registries (entity and metric set) and an iternal otap-based telemetry pipeline.

lquerel · 2026-01-12T21:59:17Z

rust/otap-dataflow/docs/self_tracing_architecture.md

@@ -0,0 +1,252 @@
+# Internal Telemetry Logging Pipeline


In the near future, I hope that this internal telemetry logging pipeline will become more general and integrate metrics and other types of signals as well.

Definitely!

andborja · 2026-01-12T22:49:09Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+      output: raw
+```
+
+```mermaid


I'm thinking on something a little simpler:
#1769

Thank you @andborja.

For the record, I think these visions are compatible!

The part you have as "Event manager (configurable)" in the diagram, that's the Internal Telemetry Receiver component. There is one instance in the configuration. As the engine starts, we will have one (global) or one per NUMA region of these, and they can have any configuration we all agree. I have no configuration so far, and (as my next PR will show) in my current draft the ITR uses send_message to the default out-port.

The part where you have multiple internal pipelines. That's already part of the OTAP-Dataflow graph structure, the idea that any component including ITR can have multiple output destinations and policies allowing various behavior in the pipeline-level configuration. So, if the ITR configuration for "event manager" as you describe it can name the out-ports that it wants to send to, I think we'll have essentially the same diagram.

@lquerel please see if this aligns with your thinking, the idea of a single ITR in the OTAP-dataflow pipeline configuration with some sort of policy that it uses to send to one or more out-ports in the config. The ITR configuration is a mapping from something to the out-port name, that is.

What I think is different is the hard/specific separation and decoupling of the production of the internal telemetry and its routing and consumption.
The ITR is not the same event manager. The event manager is part of the reception and routing of the messages produced by any component in the engine. The ITR is yet another receiver, so it is part of the "consumption" block.
A view with this separation simplify several downstream definitions including the configuration of the different components, and even the need or not of a ITR id in the "production" side.
The channels used here are not the same as the graph queues, so the out-ports and other concepts don't have the same meaning. The ITR is not a processor, it is a receiver, so it does not read the incoming messages from a "port". They have their own flume channel.

Co-authored-by: Laurent Quérel <[email protected]>

…w into jmacd/internal_logs_arch

andborja · 2026-01-12T23:50:46Z

rust/otap-dataflow/docs/self_tracing_architecture.md

+The OTAP Dataflow engine has dedicated macros, and every component is
+configured with an internal telemetry SDK meant for primary
+instrumentation. Using the `otel_info!(effect, name, args...)` macro
+requires access the component EffectHandler. This is considered


I think we need a separate object (not the effecthandler) to hold the logger. The effect handler is instantiated per component, and it is even different data structure depending on the component type (receiver effect handler is different from the processor effect handler).
I think we need something like a "request context", that is the same instance through the whole pipeline path, and different per request.
That is the one that should keep the instance of the logger (and later the TelemetrySettings).

This sounds possible, and we already have a request-level Context object use for Ack/Nack handling. This is a step above the kind of protection we need here, however, so I would evaluate that objective on its own. Why should we want per-request-context telemetry settings, instead of per-component settings with isolated internal pipelines using entirely different choices w/o runtime dispatch?

Does this request context available at receiver level (where the message is received/produced)? It should be something that is received since the beginning by the first receiver.
Technically it should be configured per "pipeline", this is a sequence of components for the internal telemetry. This concept does not exist today. We have a graph of components that you don't really know if they are part of the same "pipeline".
As we have nodes, that are linked in sequence, we need to set the logger at the ITR, and then the rest of the nodes in the same request should use the same configuration.
For example, if the node graph is:
ITR->Processor1
OTLPReceiver->Processor1
It would be like 2 pipelines pointing to the same processor, one internal.
The request coming from the ITR to the Processor1 should use Noop (by default), and the other should use the default logger (sending telemetry through the channel).
In general, the logger to use depends on the request, if it is part of an internal telemetry pipeline or not.

Part of #1771. Part of #1736. Follows #1741. This moves the HashMap<_, _> of nodes into a struct `PipelineNodes` and re-uses it to parse an identical graph of `internal` nodes. This internal graph will be use when an internal logging provider is configured to output to an internal pipeline.

Part of #1771. Part of #1736. Follows #1741. This `raw_error!` macro is different from the others in `internal_events.rs` in two ways: 1. Supports the complete Tokio `tracing` syntax, including display and debug formatters 2. Bypasses the Tokio global dispatch and subscriber, calling into the raw logging layer The use of `tracing`'s `valueset!` macro is key to supporting the whole syntax for the other `otel_XXX!` macros. Test log statement prints: ``` 2026-01-15T20:59:42.100Z ERROR otap_df_telemetry::internal_events::tests::raw error message (crates/telemetry/src/internal_events.rs:171): [error=ConfigurationError("bad config")] ```

… setup (#1795) Part of #1771. Part of #1736. As documented in #1741. ~Updates that document to match this change reflecting the prototype in #1771.~ Revised relative to #1771. Adds LoggingProviders (choice of default logging provider for global, engine, and internal-telemetry threads). Adds ProviderMode with names to select instrumentation behavior, with `its` referring to internal telemetry system. Note: These settings are somehow not ideally placed. They belong also in the top-level settings, or with observed_state settings. However, since logging is configured with resource and level, which are part of the service::telemetry config area presently, we use that structure. After the bulk of #1736 is finished we can restructure.

Internal logs architecture document

1b42ce6

github-project-automation bot added this to OTel-Arrow Jan 8, 2026

github-actions bot added the rust Pull requests that update Rust code label Jan 8, 2026

jmacd marked this pull request as ready for review January 8, 2026 00:32

jmacd requested a review from a team as a code owner January 8, 2026 00:32

jmacd mentioned this pull request Jan 8, 2026

Internal logging code path: Raw logger support #1735

Merged

edits

313db2f

lalitb reviewed Jan 8, 2026

View reviewed changes

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md Outdated Show resolved Hide resolved

lalitb reviewed Jan 8, 2026

View reviewed changes

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md Outdated Show resolved Hide resolved

jmacd mentioned this pull request Jan 8, 2026

Internal telemetry routing ** WIP DRAFT ** #1672

Closed

lquerel reviewed Jan 8, 2026

View reviewed changes

jmacd marked this pull request as draft January 9, 2026 22:58

jmacd added 8 commits January 12, 2026 10:08

config from jmacd/experiment_internal_delivery

86f764a

document

a734963

diag

102d6dc

move and edit

3716b6c

edit

908b520

info default

0c0134e

typo

bef718f

about dropped counting

5d7eee7

jmacd commented Jan 12, 2026

View reviewed changes

jmacd marked this pull request as ready for review January 12, 2026 19:11

Merge branch 'main' of github.com:open-telemetry/otel-arrow into jmac…

208ad45

…d/internal_logs_arch

andborja reviewed Jan 12, 2026

View reviewed changes

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs Outdated Show resolved Hide resolved

lint

1398789

andborja reviewed Jan 12, 2026

View reviewed changes

rust/otap-dataflow/docs/self_tracing_architecture.md Outdated Show resolved Hide resolved

andborja reviewed Jan 12, 2026

View reviewed changes

typo & receives not produces

cfba449

andborja reviewed Jan 12, 2026

View reviewed changes

rewrite

5a13e35

andborja reviewed Jan 12, 2026

View reviewed changes

strategies->providers to match config

c4c9381

lquerel approved these changes Jan 12, 2026

View reviewed changes

andborja reviewed Jan 12, 2026

View reviewed changes

jmacd and others added 5 commits January 12, 2026 15:28

Update rust/otap-dataflow/docs/self_tracing_architecture.md

80f1c5b

Co-authored-by: Laurent Quérel <[email protected]>

Update rust/otap-dataflow/docs/self_tracing_architecture.md

50234c3

Co-authored-by: Laurent Quérel <[email protected]>

revert config

8ef345c

Merge branch 'jmacd/internal_logs_arch' of github.com:jmacd/otel-arro…

7afae73

…w into jmacd/internal_logs_arch

no ascii

a694088

jmacd mentioned this pull request Jan 12, 2026

Internal logging configuration model, background thread(s) #1771

Draft

jmacd enabled auto-merge January 12, 2026 23:50

andborja reviewed Jan 12, 2026

View reviewed changes

jmacd added this pull request to the merge queue Jan 13, 2026

Merged via the queue into open-telemetry:main with commit f50fc81 Jan 13, 2026
42 of 43 checks passed

jmacd deleted the jmacd/internal_logs_arch branch January 13, 2026 00:13

github-project-automation bot moved this to Done in OTel-Arrow Jan 13, 2026

This was referenced Jan 15, 2026

Internal telemetry pipeline nodes config #1794

Merged

Add service::telemetry::logs::providers settings for internal logging setup #1795

Merged

Internal logging raw_error! macro support #1796

Merged

		ITR components and any of the connected processor and exporter
		components reachable from ITR source nodes.

Internal logs architecture document #1741

Internal logs architecture document #1741

Uh oh!

Conversation

jmacd commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lquerel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmacd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmacd commented Jan 8, 2026 •

edited

Loading

codecov bot commented Jan 8, 2026 •

edited

Loading

jmacd left a comment •

edited

Loading