Skip to content

[flyte2] Instrument the actions service (watcher metrics + dropped-updates counter) #7450

@pingsutw

Description

@pingsutw

Part of #7445. Depends on #7446 (the /metrics endpoint + initialized Scope must exist first).

Summary

Instrument the actions service with Prometheus metrics: implement the existing dropped-updates counter TODO, and add throughput / latency / queue-depth metrics for the TaskAction watcher.

Background

The actions service is already partly wired for metrics — it just has nothing to plug into yet:

  • actions/setup.go:39 already passes sc.Scope into NewActionsClient(...).
  • actions/k8s/client.go:91 already uses scope.NewSubScope("actions_filter") for the dedup bloom filter.
  • actions/k8s/client.go:65 has an explicit TODO: // TODO: add a prometheus counter for dropped updates when metrics are wired up.

Note on the metrics scope: When run via the unified manager (manager/cmd/main.go:75), sc.Scope is already initialized (promutils.NewScope("flyte")) before actions.Setup runs, so the bloom-filter sub-scope at client.go:91 works and there is no panic. The dependency on #7446 is because #7446 mounts the /metrics endpoint — without it, the metrics you add here are registered into the default registry but never exposed to a scrape. (#7446 also initializes sc.Scope at the framework level, which additionally makes the standalone actions/cmd/main.go binary safe — that path currently leaves sc.Scope nil, so client.go:90-91's scope.NewSubScope(...) would panic there, since RecordFilterSize defaults to 1 << 23 > 0.)

What to do

Using the Scope available on ActionsClient (passed in via NewActionsClient), add metrics under a dedicated sub-scope (e.g. scope.NewSubScope("watcher")):

  1. Dropped updates counter — implement the TODO at actions/k8s/client.go:65. Increment a counter whenever a watch update is dropped (e.g. buffer full / channel send would block).
  2. Watcher throughput — counter of TaskAction events processed, labeled by result (success/error).
  3. Processing latency — a timer/histogram around per-event handling in the watch worker loop.
  4. Queue/buffer depth — a gauge for the watch buffer occupancy (config WatchBufferSize), updated as events are enqueued/dequeued (or sampled periodically).

Acceptance criteria

  • /metrics exposes a dropped-updates counter, watcher event throughput (by result), processing latency, and buffer depth for the actions service.
  • The TODO at actions/k8s/client.go:65 is implemented and removed.
  • Metrics are created once under a dedicated sub-scope (no Prometheus duplicate-registration panics).
  • A unit test verifies the dropped-updates counter increments when an update is dropped, and that the throughput counter increments on event processing.

Pointers

  • actions/k8s/client.go — the watcher, worker loop, buffer, and the dropped-updates TODO (line 65); constructor NewActionsClient (line 77) already receives a promutils.Scope.
  • actions/setup.go:31-40 — where NewActionsClient is constructed with sc.Scope.
  • flytestdlib/promutils/scope.goScope helpers (MustNewCounter, MustNewGauge, MustNewStopWatch, NewSubScope).

Notes for contributors

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions