Skip to content

[flyte2] Instrument the runs service reconcilers (abort-reconciler) with Prometheus metrics #7449

@pingsutw

Description

@pingsutw

Part of #7445. Depends on #7446 (the /metrics endpoint + Scope must exist first).

Summary

Add Prometheus metrics to the runs service background reconcilers (starting with the abort reconciler) to observe queue depth, processing throughput, retries, and failures.

Background

runs/service/abort_reconciler.go runs as a background worker (registered in runs/setup.go via sc.AddWorker("abort-reconciler", ...)). It has a worker pool, a bounded queue (QueueSize: 1000), and retry logic (MaxAttempts, InitialDelay, MaxDelay). None of this is currently observable via metrics.

What to do

  1. Thread the metrics Scope (from [flyte2] Add /metrics endpoint and initialize metrics Scope in the app framework #7446) into service.NewAbortReconciler(...) (extend its config/constructor).
  2. Emit metrics such as:
    • current queue depth / pending items (gauge)
    • items processed (counter, labeled by success/failure)
    • retries / attempts (counter)
    • per-item processing latency (timer/histogram)

Acceptance criteria

  • /metrics exposes abort-reconciler queue depth, processed count (success/failure), retry count, and processing latency.
  • Metrics use a dedicated sub-scope, e.g. scope.NewSubScope("abort_reconciler"), created once.
  • A unit test verifies that processing an item updates the relevant counters/gauges.

Pointers

  • runs/service/abort_reconciler.go — the reconciler implementation and its run loop.
  • runs/setup.go:64-73 — where NewAbortReconciler is constructed and registered as a worker.
  • flytestdlib/promutils/scope.goScope helpers (MustNewGauge, MustNewCounter, MustNewStopWatch, NewSubScope).

Notes for contributors

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions