Skip to content

fix(emitter): P1-1 - Add observability for silent metric query errors#177

Open
peatey wants to merge 8 commits into
developfrom
fix/p1-1-silent-metric-query-errors
Open

fix(emitter): P1-1 - Add observability for silent metric query errors#177
peatey wants to merge 8 commits into
developfrom
fix/p1-1-silent-metric-query-errors

Conversation

@peatey

@peatey peatey commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Phase 1 Implementation:

  • Add FailedQueries field to MetricsSnapshot struct to track failed queries
  • Create awaitWithLog helper function that logs errors and increments Prometheus counter
  • Add finops_agent_metric_query_failures_total counter metric with query_name label
  • Replace all 100+ bare Await() calls in snapshotMetrics with awaitWithLog
  • Failed queries are now logged as warnings and tracked in metrics
  • MetricsSnapshot now includes list of failed query names for downstream processing

This provides visibility into which Prometheus queries are failing and prevents silent bash cost calculations for actively running containers.

Addresses: P1-1 Silent Metric Query Error Discards
File: pkg/emitter/snapshot.go:413-511
Pipeline: Kubecost allocation and asset (all resolutions)

Phase 1 Implementation:
- Add FailedQueries field to MetricsSnapshot struct to track failed queries
- Create awaitWithLog helper function that logs errors and increments Prometheus counter
- Add finops_agent_metric_query_failures_total counter metric with query_name label
- Replace all 100+ bare Await() calls in snapshotMetrics with awaitWithLog
- Failed queries are now logged as warnings and tracked in metrics
- MetricsSnapshot now includes list of failed query names for downstream processing

This provides visibility into which Prometheus queries are failing and prevents
silent bash cost calculations for actively running containers.

Addresses: P1-1 Silent Metric Query Error Discards
File: pkg/emitter/snapshot.go:413-511
Pipeline: Kubecost allocation and asset (all resolutions)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds observability for previously-silent Prometheus/OpenCost metrics query failures during snapshot generation, so failures can be detected via logs, Prometheus counters, and included in the produced snapshot data.

Changes:

  • Introduces a Prometheus counter (finops_agent_metric_query_failures_total) labeled by query_name and an awaitWithLog helper to log and count query failures.
  • Replaces the large set of bare Await() calls in snapshotMetrics with awaitWithLog, collecting failed query names.
  • Extends MetricsSnapshot with a FailedQueries field for downstream consumers.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/emitter/snapshot.go Registers a failure counter, adds awaitWithLog, and captures/logs failed query names while awaiting metric query futures.
pkg/emitter/emitter.go Extends MetricsSnapshot with FailedQueries to surface failed query names to downstream processing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/emitter/snapshot.go Outdated
Comment thread pkg/emitter/snapshot.go
- All query names now match their variable names (consistent camelCase)
- Makes dashboards, alerts, and downstream processing easier
- Prevents confusion with inconsistent casing (e.g., 'PVActiveMinutes' vs 'pvUsedAverage')
- Improves observability and maintainability
peatey added 4 commits April 13, 2026 12:04
- Add tests for successful query execution
- Add tests for failed query error handling
- Add tests for multiple failures tracking
- Add tests for mixed success/failure scenarios
- Add test for Prometheus counter registration and increment
- Use mock queryFuture interface for testability
- All tests verify logging, counter increments, and FailedQueries tracking

Validates P1-1 Phase 1 observability implementation
…etheus counter

The FailedQueries []string field was added but has no downstream consumers.
The Prometheus counter (finops_agent_metric_query_failures_total) provides
all necessary observability for alerting and dashboards without the memory
overhead of storing query names in every snapshot.

Changes:
- Removed FailedQueries field from MetricsSnapshot struct
- Simplified awaitWithLog to only log and increment counter
- Updated all 100+ call sites to remove unused parameter
- Updated all tests to focus on counter validation
- Reduces memory footprint per snapshot

Phase 2 can re-add FailedQueries when there's a concrete use case for it.
…fast behavior

Added comment to document that Phase 1 provides observability (logs + Prometheus
counter) but maintains existing fail-fast behavior via grp.HasErrors() check.

The awaitWithLog function logs and counts each individual query failure, but
if ANY query in the group fails, the entire snapshot is still discarded via
the existing grp.HasErrors() check. This is intentional for Phase 1.

Phase 2 would modify the error handling to allow partial results based on
failure counts/types, enabling the Kubecost allocation pipeline to decide
whether a partial snapshot is acceptable.
…nit tests

Restored original integration tests (TestSnapshottingMetricsStaggeredWindows,
TestSnapshottingTemporaryCache) that were accidentally removed. These tests
exercise the full snapshot pipeline end-to-end through mock data sources.

The new awaitWithLog unit tests (5 test cases) complement these integration
tests by providing focused coverage of the error logging and metrics behavior.

Also added .bob/ to .gitignore to prevent session metadata from being committed.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/emitter/snapshot.go
Comment thread pkg/emitter/snapshot_test.go Outdated
Comment thread pkg/emitter/snapshot_test.go Outdated
Comment thread pkg/emitter/snapshot.go
peatey and others added 2 commits April 13, 2026 12:20
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants