fix(emitter): P1-1 - Add observability for silent metric query errors#177
Open
peatey wants to merge 8 commits into
Open
fix(emitter): P1-1 - Add observability for silent metric query errors#177peatey wants to merge 8 commits into
peatey wants to merge 8 commits into
Conversation
Phase 1 Implementation: - Add FailedQueries field to MetricsSnapshot struct to track failed queries - Create awaitWithLog helper function that logs errors and increments Prometheus counter - Add finops_agent_metric_query_failures_total counter metric with query_name label - Replace all 100+ bare Await() calls in snapshotMetrics with awaitWithLog - Failed queries are now logged as warnings and tracked in metrics - MetricsSnapshot now includes list of failed query names for downstream processing This provides visibility into which Prometheus queries are failing and prevents silent bash cost calculations for actively running containers. Addresses: P1-1 Silent Metric Query Error Discards File: pkg/emitter/snapshot.go:413-511 Pipeline: Kubecost allocation and asset (all resolutions)
There was a problem hiding this comment.
Pull request overview
Adds observability for previously-silent Prometheus/OpenCost metrics query failures during snapshot generation, so failures can be detected via logs, Prometheus counters, and included in the produced snapshot data.
Changes:
- Introduces a Prometheus counter (
finops_agent_metric_query_failures_total) labeled byquery_nameand anawaitWithLoghelper to log and count query failures. - Replaces the large set of bare
Await()calls insnapshotMetricswithawaitWithLog, collecting failed query names. - Extends
MetricsSnapshotwith aFailedQueriesfield for downstream consumers.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/emitter/snapshot.go | Registers a failure counter, adds awaitWithLog, and captures/logs failed query names while awaiting metric query futures. |
| pkg/emitter/emitter.go | Extends MetricsSnapshot with FailedQueries to surface failed query names to downstream processing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- All query names now match their variable names (consistent camelCase) - Makes dashboards, alerts, and downstream processing easier - Prevents confusion with inconsistent casing (e.g., 'PVActiveMinutes' vs 'pvUsedAverage') - Improves observability and maintainability
- Add tests for successful query execution - Add tests for failed query error handling - Add tests for multiple failures tracking - Add tests for mixed success/failure scenarios - Add test for Prometheus counter registration and increment - Use mock queryFuture interface for testability - All tests verify logging, counter increments, and FailedQueries tracking Validates P1-1 Phase 1 observability implementation
…etheus counter The FailedQueries []string field was added but has no downstream consumers. The Prometheus counter (finops_agent_metric_query_failures_total) provides all necessary observability for alerting and dashboards without the memory overhead of storing query names in every snapshot. Changes: - Removed FailedQueries field from MetricsSnapshot struct - Simplified awaitWithLog to only log and increment counter - Updated all 100+ call sites to remove unused parameter - Updated all tests to focus on counter validation - Reduces memory footprint per snapshot Phase 2 can re-add FailedQueries when there's a concrete use case for it.
…fast behavior Added comment to document that Phase 1 provides observability (logs + Prometheus counter) but maintains existing fail-fast behavior via grp.HasErrors() check. The awaitWithLog function logs and counts each individual query failure, but if ANY query in the group fails, the entire snapshot is still discarded via the existing grp.HasErrors() check. This is intentional for Phase 1. Phase 2 would modify the error handling to allow partial results based on failure counts/types, enabling the Kubecost allocation pipeline to decide whether a partial snapshot is acceptable.
…nit tests Restored original integration tests (TestSnapshottingMetricsStaggeredWindows, TestSnapshottingTemporaryCache) that were accidentally removed. These tests exercise the full snapshot pipeline end-to-end through mock data sources. The new awaitWithLog unit tests (5 test cases) complement these integration tests by providing focused coverage of the error logging and metrics behavior. Also added .bob/ to .gitignore to prevent session metadata from being committed.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 1 Implementation:
This provides visibility into which Prometheus queries are failing and prevents silent bash cost calculations for actively running containers.
Addresses: P1-1 Silent Metric Query Error Discards
File: pkg/emitter/snapshot.go:413-511
Pipeline: Kubecost allocation and asset (all resolutions)