-
Notifications
You must be signed in to change notification settings - Fork 66
Internal logging configuration model, background thread(s) #1771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Internal logging configuration model, background thread(s) #1771
Conversation
…d/tracing_to_otlp_bytes
…d/experiment_internal_delivery
…d/experiment_internal_delivery
| config: | ||
|
|
||
| service: | ||
| telemetry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more for the existing SDK configuration, in alignment with the otel collector style.
I would configure it in /telemetry/logs instead if required.
| global: immediate | ||
| engine: immediate | ||
| output: direct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
|
|
||
| # Internal telemetry pipeline - separate from main pipeline | ||
| # Uses hardcoded settings: single thread, no admin server | ||
| internal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: We can make it multi-graph with the same effort and risk.
Something like:
graphs: # array of graphs
- internal: # name of the graph
# other graph level properties, including the logger type
nodes: # copied from the original graph structure
telemetry: # same block that is configured here
...We can keep the existing configuration of single graph as is. This would be more an extension of the configuration.
| providers: | ||
| global: immediate | ||
| engine: immediate | ||
| internal: raw # Avoid feedback in internal pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be configured at the "internal" block, not really here
| /// these is the main pipeline, and one is the internal telemetry pipeline. | ||
| #[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, Default)] | ||
| #[serde(transparent)] | ||
| pub struct PipelineNodes(HashMap<NodeId, Arc<NodeUserConfig>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We can do something more "structured" using named attributes instead of a tuple
| let span = otel_info_span!("internal_pipeline_thread", core.id = core_id.id); | ||
| let _guard = span.enter(); | ||
|
|
||
| // No core pinning for internal pipeline - it's lightweight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose changing this comment and turning it into a TODO. In the long run, we will also introduce thread pinning for this internal pipeline, since an instance of our Internal Telemetry System will be deployed on each NUMA region. We will therefore need one thread per internal telemetry pipeline for each NUMA node, with thread pinning to one of the cores in the corresponding region.
Part of #1771. Part of #1736. Follows #1741. This `raw_error!` macro is different from the others in `internal_events.rs` in two ways: 1. Supports the complete Tokio `tracing` syntax, including display and debug formatters 2. Bypasses the Tokio global dispatch and subscriber, calling into the raw logging layer The use of `tracing`'s `valueset!` macro is key to supporting the whole syntax for the other `otel_XXX!` macros. Test log statement prints: ``` 2026-01-15T20:59:42.100Z ERROR otap_df_telemetry::internal_events::tests::raw error message (crates/telemetry/src/internal_events.rs:171): [error=ConfigurationError("bad config")] ```
… setup (#1795) Part of #1771. Part of #1736. As documented in #1741. ~Updates that document to match this change reflecting the prototype in #1771.~ Revised relative to #1771. Adds LoggingProviders (choice of default logging provider for global, engine, and internal-telemetry threads). Adds ProviderMode with names to select instrumentation behavior, with `its` referring to internal telemetry system. Note: These settings are somehow not ideally placed. They belong also in the top-level settings, or with observed_state settings. However, since logging is configured with resource and level, which are part of the service::telemetry config area presently, we use that structure. After the bulk of #1736 is finished we can restructure.
…#1808) Part of #1771. Part of #1736. This is a non-functional refactoring of `opentelemetry_client.rs` into other places. This will make it clearer what changes in #1771 and what is just moving around. Moves runtime elements into the InternalTelemetrySystem, simplifies setup for the controller where logs/metrics were separated. Moves OTel-SDK specific pieces into `otel_sdk` module, separates the Tokio `tracing` setup. --------- Co-authored-by: Utkarsh Umesan Pillai <[email protected]>
Too large for review!
Complete draft configuration from #1741. This establishes a default logging provider configuration that supports 4 choices in 3 different contexts. Choices:
Contexts:
There are major caveats. Two critical new configuration fields named "providers" and "output" are added in the current service::telemetry::logs area. This area is off-spec for the OpenTelemetry declarative configuration, so these belong elsewhere. However, the code needs substantial refactoring and it gets in the way of this intermediate step to refactor more.
The major pieces of the changeset are largely independent:
pub struct PipelineNodes(HashMap<NodeId, Arc<NodeUserConfig>>); add a second set of node for the internal pipelinenodes(existing) andinternal(new); refactoring to re-use validation logic (crates/config/src/pipeline.rs)opentelemetry_client.rsinto a newtelemetry_runtime.rs; here the 4 provider modes are supported, with a helper to install Tokio subscribers for certain threads (crates/telemetry/src/logs.rs); definition of a flume bounded channel identical to / parallel to the observed state store. (TODO: future consolidation with observed_state); "direct" collector support for asynchronous console loggingmakes it easy to produce panics, setting the internal telemetry pipeline to use OTAP batching created a nice test for what happens when the internal logging path panics or fails; needs separate investigation, setting internal provider to "raw" helpfully shows what happens.raw_error!macro that is safe to use everywhere including inside Tokio subscribers, for logging about internal telemetry failures as an emergencyConsole logs look like:
We expect to fill in scope attributes soon). I had tested with batching format=otap to see if the OTAP representation would naturally group single statements into the resource value. In this configuration, it now prints:
This is a raw error printed by the internal telemetry pipeline.
That's the error after modifying batch_processor.rs briefly in this PR, otherwise it panics. Since I've modified the batch processor not to return an error, we can no longer see the situation where the internal telemetry pipeline panics. When producers write to the channel after a panic, the send fails resulting in a
raw_error!which tells the user on the console when logging service dies.