Skip to content

Conversation

@jmacd
Copy link
Contributor

@jmacd jmacd commented Jan 8, 2026

Document the approach we will take for routing internal logs, see #1736.

@github-actions github-actions bot added the rust Pull requests that update Rust code label Jan 8, 2026
@jmacd jmacd marked this pull request as ready for review January 8, 2026 00:32
@jmacd jmacd requested a review from a team as a code owner January 8, 2026 00:32
@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.17%. Comparing base (25bae32) to head (a694088).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #1741    +/-   ##
========================================
  Coverage   84.17%   84.17%            
========================================
  Files         480      480            
  Lines      139186   139361   +175     
========================================
+ Hits       117156   117312   +156     
- Misses      21496    21515    +19     
  Partials      534      534            
Components Coverage Δ
otap-dataflow 85.41% <ø> (+<0.01%) ⬆️
query_abstraction 80.61% <ø> (ø)
query_engine 90.54% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 53.50% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

the log record on the same core. When this fails, the configurable
telemetry router will support options to use global logs collection
thread, a raw logger, or do nothing (dropping the internal log
record).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the default for production, if the user doesn't select any of these options. Good to mention that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New default I added new text stating the default would be an asynchronous raw console logger, like:

service:
  telemetry:
    logs:
      level: info
      strategies:
        global: unbuffered
        engine: buffered
        internal: noop
      output: raw

Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a series of proposals that I hope go in the direction of what you want to put in place.

no-op implementation, multi-threaded subscriber, routing to the
same-core ITR, and/or raw logging.

## OTLP-bytes first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our recent telemetry guidelines, for events we distinguish between entity-based attributes (stable context) and other event-specific attributes (dynamic context). Using this terminology, attributes belonging to the stable context do not need to be emitted with every event instance; instead, they are identified by a unique numeric ID that references the attributes in an attribute registry.

The "dynamic" attributes are the ones that should travel as OTLP bytes from the hot path to the cold path. If I recall our recent conversation correctly, you were arguing that building this dynamic map would take roughly the same amount of time in a classic representation as in OTLP bytes form. That seems plausible to me, provided we are careful with this micro-serialization and keep attribute values simple.

In any case, I think we should run a few benchmarks to validate all of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to take this one step at a time. Presently, the internal logging design is concerned with the mechanics for handling and encoding events without considering entity-based attributes. We will be able to add dynamic context using the appropriate numeric ID to represent "scope" in a future PR.

@jmacd jmacd marked this pull request as draft January 9, 2026 22:58
Copy link
Contributor Author

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_

no-op implementation, multi-threaded subscriber, routing to the
same-core ITR, and/or raw logging.

## OTLP-bytes first
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to take this one step at a time. Presently, the internal logging design is concerned with the mechanics for handling and encoding events without considering entity-based attributes. We will be able to add dynamic context using the appropriate numeric ID to represent "scope" in a future PR.

@jmacd jmacd marked this pull request as ready for review January 12, 2026 19:11
instrumentation. Using the `otel_info!(effect, name, args...)` macro
requires access the component EffectHandler. This is considered
first-party internal logging, and other uses of Tokio `tracing` are
considered third-party internal logging.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to design and implement the internal logging and explicitly exclude/ignore third-party logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The providers::internal setting does just that, and it defaults to Noop to prevent internal telemetry pipeline components for logging.

Noop,

/// Place into a thread-local buffer.
Buffered,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be an option at production time. If someone wishes to buffer and batch events, that should be done in an internal telemetry pipeline (not at telemetry production time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal telemetry pipeline is the buffer. The internal telemetry pipeline starts with a buffer in each engine thread.

BL -->|periodic flush<br/>LogPayload::Batch| CH
NL -.->|discarded| X[∅]

CH --> LC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one component missing here. There might be some "subscription manager" or some component in charge of pulling the objects from the channel and sending it to the configured destination (console or an internal telemetry receiver channel)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "LC" is the logs collector. The design documented in crates/telemetry/README.md calls for this to be NUMA-regional eventually, for now it's a global/single thread. This is one of the dedicated threads described for internal telemetry above, and it pulls from the channel and does what the OutputMode prescribes, meaning it prints to the console or sends to the internal pipeline.

use serde::{Deserialize, Serialize};

/// Internal Telemetry Receiver node URN for internal logging using OTLP bytes.
pub const INTERNAL_TELEMETRY_RECEIVER_URN: &str = "urn:otel:otlp:telemetry:receiver";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller or engine at startup need to know which node is the ITR, and it's implemented in the crates/otap area, this just needs to be somewhere both can depend on, and this is sort of a type of configuration. It could go in the engine, controller, or other low-level crate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this URN is ambiguous. We should have a URN that clearly expresses that this is a receiver used by our internal telemetry. I would suggest something like urn:otelcol:internal:receiver, or something along those lines.

use schemars::JsonSchema;
use serde::{Deserialize, Serialize};

/// Internal Telemetry Receiver node URN for internal logging using OTLP bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Do not include any source code changes as part of this PR unless it is a complete POC.
I'm not clear how this changes are aligned with the design, and there might be problems when starting the SDK threads when some objects are present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you and other reviewers feel that this configuration is OK, I'll be glad to remove this change and include it in the PR with implementation.

/// Internal logs configuration.
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, Default)]
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
pub struct LogsConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration is more "pipeline" level than engine level. That's why I don't see this structure configured here.
I would expect the configuration to be defined per component, with defaults for the type of component. For example, the internal telemetry receiver and subsequent nodes should be Noop or console only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that @lquerel also doesn't think the new configuration belongs here. I'm not sure where it belongs? This feels OK to me.

I do see this as configuration for the main function, so the global, engine, and internal aspects inside providers are for global-level, engine-level, and internal-pipeline logging configuration. As mentioned #1741 (comment) this is something we can configure component level, per-tenant, etc.

telemetry:
logs:
level: info
strategies:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more for the existing SDK configuration (service/telemetry).
We should configure this block in a separate, first level block not coupled to the SDK configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I had renamed this field "providers" and the type "LoggingProviders" to capture the three top-level default contexts that we can configure, which are global, engine, and internal.

strategies:
global: unbuffered
engine: buffered
internal: noop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration should not be global to the engine, but local to a pipeline (node for our case?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the accompanying document, I tried to capture this concept that the Internal Telemetry Receiver and all of its downstream nodes are considered an Internal Telemetry pipeline. All node in the internal telemetry pipeline would have the Noop configuration in this example. This setting is meant to apply to all of those internal telemetry components.

I'm aware of a desire to control the telemetry provider in more fine-grain ways. For example, we might imagine overriding LogLevel at the component level. We can imagine varying the ProviderMode at that level too, it's just more-detailed configuration.

Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few non-blocking comments, some of which could be easily addressed in this PR. More generally, I think we are in the process of refining the overall vision for this Internal Telemetry System. I should also have a PR coming soon that will help move us further toward our final design.

use serde::{Deserialize, Serialize};

/// Internal Telemetry Receiver node URN for internal logging using OTLP bytes.
pub const INTERNAL_TELEMETRY_RECEIVER_URN: &str = "urn:otel:otlp:telemetry:receiver";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this URN is ambiguous. We should have a URN that clearly expresses that this is a receiver used by our internal telemetry. I would suggest something like urn:otelcol:internal:receiver, or something along those lines.

/// Internal logs configuration.
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, Default)]
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
pub struct LogsConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, I expect a lot of changes at this level. Right now we are talking about LogConfig, but in the long run this will be a configuration for the entire internal telemetry pipeline/system and for all internally used signal types (metric set, events, ...).

I can see this being acceptable as an intermediate step for now, but as I mentioned, I don’t think this configuration is 1) general enough, or 2) in the right place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lquerel this position in the YAML is the same as the OTel Collector at service::telemetry::logs with the level field. I'm open to placing it anywhere you or @andborja could agree. Definitely the resource used for logs and metrics; I would expect other signal's internal configuration to mirror this.

Comment on lines +28 to +29
ITR components and any of the connected processor and exporter
components reachable from ITR source nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the ITR(s) are part of the Internal Telemetry System or ITS which is the combination of registries (entity and metric set) and an iternal otap-based telemetry pipeline.

@@ -0,0 +1,252 @@
# Internal Telemetry Logging Pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the near future, I hope that this internal telemetry logging pipeline will become more general and integrate metrics and other types of signals as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

output: raw
```
```mermaid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking on something a little simpler:
#1769

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andborja.

For the record, I think these visions are compatible!

The part you have as "Event manager (configurable)" in the diagram, that's the Internal Telemetry Receiver component. There is one instance in the configuration. As the engine starts, we will have one (global) or one per NUMA region of these, and they can have any configuration we all agree. I have no configuration so far, and (as my next PR will show) in my current draft the ITR uses send_message to the default out-port.

The part where you have multiple internal pipelines. That's already part of the OTAP-Dataflow graph structure, the idea that any component including ITR can have multiple output destinations and policies allowing various behavior in the pipeline-level configuration. So, if the ITR configuration for "event manager" as you describe it can name the out-ports that it wants to send to, I think we'll have essentially the same diagram.

@lquerel please see if this aligns with your thinking, the idea of a single ITR in the OTAP-dataflow pipeline configuration with some sort of policy that it uses to send to one or more out-ports in the config. The ITR configuration is a mapping from something to the out-port name, that is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think is different is the hard/specific separation and decoupling of the production of the internal telemetry and its routing and consumption.
The ITR is not the same event manager. The event manager is part of the reception and routing of the messages produced by any component in the engine. The ITR is yet another receiver, so it is part of the "consumption" block.
A view with this separation simplify several downstream definitions including the configuration of the different components, and even the need or not of a ITR id in the "production" side.
The channels used here are not the same as the graph queues, so the out-ports and other concepts don't have the same meaning. The ITR is not a processor, it is a receiver, so it does not read the incoming messages from a "port". They have their own flume channel.

The OTAP Dataflow engine has dedicated macros, and every component is
configured with an internal telemetry SDK meant for primary
instrumentation. Using the `otel_info!(effect, name, args...)` macro
requires access the component EffectHandler. This is considered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a separate object (not the effecthandler) to hold the logger. The effect handler is instantiated per component, and it is even different data structure depending on the component type (receiver effect handler is different from the processor effect handler).
I think we need something like a "request context", that is the same instance through the whole pipeline path, and different per request.
That is the one that should keep the instance of the logger (and later the TelemetrySettings).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds possible, and we already have a request-level Context object use for Ack/Nack handling. This is a step above the kind of protection we need here, however, so I would evaluate that objective on its own. Why should we want per-request-context telemetry settings, instead of per-component settings with isolated internal pipelines using entirely different choices w/o runtime dispatch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this request context available at receiver level (where the message is received/produced)? It should be something that is received since the beginning by the first receiver.
Technically it should be configured per "pipeline", this is a sequence of components for the internal telemetry. This concept does not exist today. We have a graph of components that you don't really know if they are part of the same "pipeline".
As we have nodes, that are linked in sequence, we need to set the logger at the ITR, and then the rest of the nodes in the same request should use the same configuration.
For example, if the node graph is:
ITR->Processor1
OTLPReceiver->Processor1
It would be like 2 pipelines pointing to the same processor, one internal.
The request coming from the ITR to the Processor1 should use Noop (by default), and the other should use the default logger (sending telemetry through the channel).
In general, the logger to use depends on the request, if it is part of an internal telemetry pipeline or not.

@jmacd jmacd added this pull request to the merge queue Jan 13, 2026
Merged via the queue into open-telemetry:main with commit f50fc81 Jan 13, 2026
42 of 43 checks passed
@jmacd jmacd deleted the jmacd/internal_logs_arch branch January 13, 2026 00:13
github-merge-queue bot pushed a commit that referenced this pull request Jan 16, 2026
Part of #1771.

Part of #1736.

Follows #1741.

This moves the HashMap<_, _> of nodes into a struct `PipelineNodes` and
re-uses it to parse an identical graph of `internal` nodes. This
internal graph will be use when an internal logging provider is
configured to output to an internal pipeline.
github-merge-queue bot pushed a commit that referenced this pull request Jan 16, 2026
Part of #1771.

Part of #1736.

Follows #1741.

This `raw_error!` macro is different from the others in
`internal_events.rs` in two ways:

1. Supports the complete Tokio `tracing` syntax, including display and
debug formatters
2. Bypasses the Tokio global dispatch and subscriber, calling into the
raw logging layer

The use of `tracing`'s `valueset!` macro is key to supporting the whole
syntax for the other `otel_XXX!` macros.

Test log statement prints:

```
2026-01-15T20:59:42.100Z  ERROR  otap_df_telemetry::internal_events::tests::raw error message (crates/telemetry/src/internal_events.rs:171):  [error=ConfigurationError("bad config")]
```
github-merge-queue bot pushed a commit that referenced this pull request Jan 16, 2026
… setup (#1795)

Part of #1771.

Part of #1736.

As documented in #1741.

~Updates that document to match this change reflecting the prototype in
#1771.~

Revised relative to #1771.

Adds LoggingProviders (choice of default logging provider for global,
engine, and internal-telemetry threads).
Adds ProviderMode with names to select instrumentation behavior, with
`its` referring to internal telemetry system.

Note: These settings are somehow not ideally placed. They belong also in
the top-level settings, or with observed_state settings. However, since
logging is configured with resource and level, which are part of the
service::telemetry config area presently, we use that structure. After
the bulk of #1736 is finished we can restructure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants