Skip to content

feat(executor): add OTel metrics for the TaskAction controller#7483

Merged
pingsutw merged 9 commits into
mainfrom
taskaction-otel-metrics
Jun 12, 2026
Merged

feat(executor): add OTel metrics for the TaskAction controller#7483
pingsutw merged 9 commits into
mainfrom
taskaction-otel-metrics

Conversation

@pingsutw

@pingsutw pingsutw commented Jun 8, 2026

Copy link
Copy Markdown
Member

Tracking issue

Related to #7459

Why are the changes needed?

Screenshot 2026-06-10 at 5 18 58 PM Screenshot 2026-06-10 at 5 19 10 PM Screenshot 2026-06-10 at 5 19 24 PM

#7459 added the OpenTelemetry meter-provider plumbing (otelutils.GetMeterProvider)
and instrumented the connect RPC clients/servers, but added no business metrics for
the TaskAction controller — operators have no visibility into how many TaskAction CRDs
exist, what phase they're in, how large they are, or how long their Kubernetes API
operations take.

It also surfaced a cardinality problem: the otelconnect server interceptors tag every
rpc_server_* metric with the caller's ephemeral source port, which is unbounded.

What changes were proposed in this pull request?

1. TaskAction controller metrics

Three custom OTel meters, registered on the executor meter provider that is injected
into the reconciler (NewTaskActionReconciler(... , meterProvider, cache)
registerTaskActionMetrics(provider, cachedPhaseCounter(cache))), so they flow through
the same OTLP pipeline as the RPC metrics and the helper is testable without globals:

  • taskaction.active (Int64ObservableGauge) — number of TaskAction CRDs by
    phase. Observed asynchronously via a callback that counts straight from the
    controller cache's informer indexer (indexer.List() returns the cached object
    pointers, so there is no per-collection deep-copy of every CRD — unlike client.List,
    this stays O(N) pointer reads even with many TaskActions). Off the reconcile hot path;
    runs on the SDK's collection goroutine ~once per export.
  • taskaction.crd.size_bytes (Int64Histogram) — serialized (JSON) size of a
    TaskAction CRD, recorded once per reconcile.
  • taskaction.k8s.duration (Float64Histogram, op/error attributes) —
    per-operation latency of TaskAction CRD calls to the Kubernetes API
    (get/update/status_update), timed inline at the call sites and recorded via
    recordK8sOp.

Registration short-circuits to a no-op when metrics are disabled (the meter provider is
a noop provider), and every record helper is nil-safe, so the controller degrades to no
custom metrics rather than failing setup.

2. Reduce server RPC metric cardinality

Add otelconnect.WithoutServerPeerAttributes() to the server interceptors across all
services (runs, actions, app, internal-app, dataproxy, cache, secret, events, executor).
otelconnect otherwise tags rpc_server_* metrics/spans with net.peer.name and
net.peer.port — on the server side these are the caller's IP and ephemeral source
port
, so the port is a new value per connection. That makes each series a one-shot
(rate() ≈ 0), explodes cardinality, and grows the cumulative OTLP payload past the
collector's default 4 MiB gRPC limit. This drops those two attributes from server-side
telemetry only; client-side telemetry (where net.peer.* is the server's fixed address)
is untouched, and no metric is removed — only two high-cardinality labels.

Deliberately not re-implemented (already provided)

Already available Source
reconcile rate/errors, workqueue work/queue duration controller-runtime metrics server (:10254)
event → event-proxy send latency otelconnect-wrapped events client — rpc_client_duration{rpc_method="Record"}

How was this patch tested?

  • Unit tests (metrics_test.go): TestRegisterTaskActionMetrics (gauge observes
    the injected phase counter), TestCountByPhase (pure tally: empty phase → Unknown,
    nil entries skipped), TestObserveCRDSize (records and is nil-safe), TestRecordK8sOp
    (records latency and is nil-safe).
  • go build ./..., go vet, and gofmt clean.
  • Verified live on a dev cluster: the taskaction_* metrics flow through an OTel
    collector → Prometheus and render in Grafana next to the controller-runtime and RPC
    panels.

Labels

added

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Copilot AI review requested due to automatic review settings June 8, 2026 22:26
@github-actions github-actions Bot added the flyte2 label Jun 8, 2026
@pingsutw pingsutw self-assigned this Jun 8, 2026
@pingsutw pingsutw added this to the V2 GA milestone Jun 8, 2026
@pingsutw pingsutw marked this pull request as draft June 8, 2026 22:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry business metrics to the executor’s TaskAction controller to improve operator visibility into TaskAction CRD volume, phases, and CRD sizes, integrating with the existing otelutils.GetMeterProvider("executor") pipeline.

Changes:

  • Registers OTel instruments for TaskAction metrics: an active-by-phase observable gauge and a CRD-size histogram.
  • Records TaskAction CRD JSON size once per reconcile.
  • Adds unit tests validating metric registration and nil-safe CRD size observation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
executor/pkg/controller/taskaction_controller.go Initializes controller metrics and records CRD size during reconcile.
executor/pkg/controller/metrics.go Implements OTel metric registration and observation logic for TaskAction controller.
executor/pkg/controller/metrics_test.go Adds unit tests for metric registration and CRD-size observation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread executor/pkg/controller/metrics.go Outdated
Copilot AI review requested due to automatic review settings June 8, 2026 22:37
@pingsutw pingsutw force-pushed the taskaction-otel-metrics branch from 2fc6796 to 81e9c1f Compare June 8, 2026 22:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comment thread executor/pkg/controller/metrics.go Outdated
Comment thread executor/setup.go
pingsutw added 2 commits June 10, 2026 12:22
Build on the meter provider from #7459 to instrument the TaskAction controller
with two metrics the framework doesn't provide:

- taskaction.active (Int64ObservableGauge): number of TaskAction CRDs by plugin
  phase, observed asynchronously by listing from the controller cache.
- taskaction.crd.size_bytes (Int64Histogram): serialized size of a TaskAction
  CRD, recorded per reconcile.

Both are emitted via otelutils.GetMeterProvider("executor"), so they flow
through the same OTLP pipeline as the RPC metrics. Registration is non-fatal
(degrades to no custom metrics) and observeCRDSize is nil-safe.

Reconcile rate/latency, workqueue depth, and k8s API r/w latency are already
exposed by the controller-runtime metrics server; event-proxy send latency is
already captured by the otelconnect-wrapped events client.

Signed-off-by: Kevin Su <pingsutw@apache.org>
- taskaction.k8s.duration{op,error}: per-operation latency (get / update /
  status_update) of TaskAction CRD calls to the Kubernetes API, recorded via
  thin timed wrappers around the embedded client. Fills the gap left by the
  controller-runtime metrics server, which does not register
  rest_client_request_duration_seconds in this version.

- Add otelconnect.WithoutServerPeerAttributes() to every server-side interceptor
  (runs, actions, app, internal-app, dataproxy, cache, secret, events, executor).
  otelconnect otherwise tags rpc_server_* metrics with net.peer.port (the client's
  ephemeral source port), so each series is hit once -> rate() is always ~0 and the
  cumulative OTLP payload grows unbounded (exceeds the collector's default 4MiB).

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw force-pushed the taskaction-otel-metrics branch from 81e9c1f to 9af0d96 Compare June 10, 2026 19:23
…of wrappers

Apply review feedback on the TaskAction OTel metrics:

- Inject the meter provider into registerTaskActionMetrics /
  NewTaskActionReconciler instead of hardcoding "executor" in metrics.go.
  The name lives next to otelutils.RegisterProvidersWithContext in
  executor/setup.go, so a rename can no longer silently route the metrics
  to the noop provider.
- Replace the crdGet/crdUpdate/crdStatusUpdate wrappers with a
  client.Client decorator (newInstrumentedClient) embedded in the
  reconciler. Call sites use the idiomatic r.Get/r.Update/r.Status().Update
  again and cannot accidentally bypass the timing; non-TaskAction objects
  pass through untimed. Op labels become typed constants.
- With a real (injectable) provider, tests now assert recorded data via a
  manual reader: CRD size sum, op/error labels on taskaction.k8s.duration,
  gauge phase counts through the async callback, and pass-through
  behaviour of the instrumented client.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Copilot AI review requested due to automatic review settings June 10, 2026 19:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Comment thread executor/pkg/controller/metrics.go
Comment thread executor/pkg/controller/metrics.go Outdated
Comment thread executor/pkg/controller/metrics.go Outdated
Comment thread executor/pkg/controller/metrics.go
Comment on lines +11 to +18
// newInstrumentedClient wraps c so TaskAction CRD operations (Get, Update, and
// Status().Update) are timed under taskaction.k8s.duration. Operations on other
// object types pass through untimed. When metrics registration failed (m == nil)
// it returns c unchanged, so callers can wrap unconditionally.
//
// The reconciler embeds the wrapped client, which makes the instrumentation
// structural: call sites use the idiomatic r.Get/r.Update/r.Status().Update and
// cannot accidentally bypass the timing.
Comment thread app/setup.go
pingsutw and others added 2 commits June 10, 2026 13:55
Drop the instrumentedClient wrapper in favour of explicit inline timing at
each call site: start := time.Now() before the operation, then
metrics.recordK8sOp(ctx, op, start, err) right after. The timing is now
visible where it happens, with no hidden interception behind the embedded
client. recordK8sOp stays nil-receiver-safe so struct-literal constructed
reconcilers (tests) and failed registration degrade to no-ops.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Kevin Su <pingsutw@gmail.com>
Copilot AI review requested due to automatic review settings June 10, 2026 22:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Comment thread executor/pkg/controller/metrics.go
Comment thread executor/pkg/controller/metrics.go Outdated
Comment thread executor/pkg/controller/metrics.go Outdated
Comment thread executor/pkg/controller/metrics.go
Comment thread executor/setup.go
Comment on lines 140 to 144
otelInterceptor, err := otelconnect.NewInterceptor(
otelconnect.WithTracerProvider(otelutils.GetTracerProvider(otelServiceName)),
otelconnect.WithMeterProvider(otelutils.GetMeterProvider(otelServiceName)),
otelconnect.WithoutServerPeerAttributes(),
)
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Kevin Su <pingsutw@gmail.com>
Copilot AI review requested due to automatic review settings June 10, 2026 22:41
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Kevin Su <pingsutw@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread executor/pkg/controller/metrics_test.go Outdated
Comment on lines +3 to +25
import (
"context"
"encoding/json"
"errors"
"testing"
"time"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"go.opentelemetry.io/otel/attribute"
sdkmetric "go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/sdk/metric/metricdata"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"sigs.k8s.io/controller-runtime/pkg/client/fake"

flyteorgv1 "github.com/flyteorg/flyte/v2/executor/api/v1"
)

// newTestMetricsScheme builds a scheme with the TaskAction CRD and core types
// (the latter for asserting non-TaskAction operations pass through untimed).
func newTestMetricsScheme(t *testing.T) *runtime.Scheme {
A "code review suggestions" commit dropped the `func (m *taskActionMetrics)
observeCRDSize(...)` line, leaving its body at package scope and breaking the
build. Restore the signature.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Copilot AI review requested due to automatic review settings June 10, 2026 23:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comment thread executor/pkg/controller/metrics.go
Comment thread executor/pkg/controller/metrics.go Outdated
@pingsutw pingsutw marked this pull request as ready for review June 11, 2026 00:10
…eep-copy)

The taskaction.active gauge listed CRDs via client.List on every metrics
collection, which deep-copies every TaskAction out of the informer cache.
At high CRD counts that is O(N) full-object copies plus GC churn each cycle.

Read straight from the SharedIndexInformer's indexer instead: indexer.List()
returns the cached object pointers without copying. Extract a pure countByPhase
tally and inject it via cachedPhaseCounter(mgr.GetCache()), which also decouples
registerTaskActionMetrics from the client and simplifies its test.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Comment on lines +141 to +156
return countByPhase(items)
}
}

// observeCRDSize records the serialized size of a TaskAction CRD. No-op when
// custom metrics are disabled (m == nil).
func (m *taskActionMetrics) observeCRDSize(ctx context.Context, ta *flyteorgv1.TaskAction) {
if m == nil || m.crdSizeBytes == nil {
return
}
if b, err := json.Marshal(ta); err == nil {
m.crdSizeBytes.Record(ctx, int64(len(b)))
}
}

// recordK8sOp records the latency of a Kubernetes API operation against the

@pingsutw pingsutw Jun 11, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm going to run some load tests, will remove this if it causes too much overhead

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will definitely cause a good amount of load. We find that de(serialization) is most of the CPU load in propeller v1

metrics, err := registerTaskActionMetrics(meterProvider, cachedPhaseCounter(cache))
if err != nil {
// Non-fatal: degrade to no custom metrics rather than failing controller setup.
log.Log.Error(err, "failed to register TaskAction OTel metrics")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could argue it should fail. I think something really bad would have to happen for this to error

@pingsutw pingsutw merged commit cbb57af into main Jun 12, 2026
21 checks passed
@pingsutw pingsutw deleted the taskaction-otel-metrics branch June 12, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants