Skip to content

feat(health): suppress unchanged success-only reports within configurable interval#2284

Open
rsd-darshan wants to merge 12 commits into
NVIDIA:mainfrom
rsd-darshan:feat/variable-frequency-health-reports
Open

feat(health): suppress unchanged success-only reports within configurable interval#2284
rsd-darshan wants to merge 12 commits into
NVIDIA:mainfrom
rsd-darshan:feat/variable-frequency-health-reports

Conversation

@rsd-darshan

@rsd-darshan rsd-darshan commented Jun 6, 2026

Copy link
Copy Markdown

Summary

Fixes #1811

Health clients sent reports to core on every collection tick regardless of whether content had changed. For stable machines, the vast majority of reports are success-only and carry identical data — causing unnecessary API and network load even though the database already deduplicates at storage time via hash_without_timestamps.

  • Add a suppress_unchanged_interval config field (default 5m) to HealthReportSinkConfig
  • Add a per-key last-sent cache in HealthReportSink that hashes report content (excluding timestamps) and skips re-enqueuing a success-only report whose content is unchanged within the interval
  • Reports that contain any alert bypass suppression entirely and are always forwarded immediately
  • Set suppress_unchanged_interval = null to disable suppression

What changes

crates/health/src/config.rs

  • New field suppress_unchanged_interval: Option<Duration> on HealthReportSinkConfig, serialized with humantime_serde (e.g. "5m"), defaults to Some(300s)

crates/health/src/sink/health_report.rs

  • content_hash() — deterministic hash over sorted successes and alerts, excluding timestamps; mirrors the DB-side hash_without_timestamps approach
  • LastSent — lightweight per-key struct tracking content hash and send time
  • Suppression gate in handle_event before queue.save_latest
  • 4 new unit tests: suppress within interval, alerts bypass suppression, changed content forwarded, suppression disabled

Test plan

  • cargo test -p carbide-health --lib — all 147 tests pass
  • Deploy to a staging environment and confirm API call rate drops for stable machines
  • Confirm machines with active alerts still receive timely report updates

@rsd-darshan rsd-darshan requested a review from a team as a code owner June 6, 2026 16:05
@copy-pr-bot

copy-pr-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3bc3fbb4-915a-414f-aad5-1455f8e4625c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

…able interval

Health clients sent reports to core on every collection tick regardless of
whether the content had changed. For stable machines the vast majority of
reports are success-only and carry identical data, causing unnecessary API
load even though the database deduplicates at storage time.

Add a per-key last-sent cache in HealthReportSink that hashes report content
(excluding timestamps, mirroring the DB-side hash_without_timestamps logic)
and skips re-enqueuing a success-only report whose hash is unchanged within
the configured suppress_unchanged_interval. Reports that contain any alert
are always forwarded immediately. The interval defaults to 5 minutes and can
be set to null to disable suppression entirely.

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan rsd-darshan force-pushed the feat/variable-frequency-health-reports branch from a872b57 to 4760297 Compare June 6, 2026 16:28
@yoks

yoks commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Thanks @rsd-darshan can you also please add quick criterion test for hashing function? As it sits on hot path we may need to be sure it is not overly heavy on CPU.

@yoks

yoks commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

/ok to test 055260e

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

@rsd-darshan

Copy link
Copy Markdown
Author

Thanks for the feedback! Added a criterion benchmark for the hashing function in the latest commit — it covers five report shapes: empty, small/large success-only, and small/large with alerts, to capture the realistic common case as well as worst-case scenarios.

@yoks

yoks commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Thanks, can you post results here? Latest CI run also had lint issues.

@yoks

yoks commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

There is also one minor problem. HashMap with last seen report would grow without bounds over lifetime if report keys will drift (e.g. hardware added/removed). So maybe we can think of struct which would evict stale keys (hours old) or evict them on schedule.

@rsd-darshan

Copy link
Copy Markdown
Author

Benchmark results (Apple M-series, release mode):

Report shape Time
empty ~32 ns
small success-only (8 probes) ~1.27 µs
small with alerts (4 alerts) ~1.90 µs
large with alerts (64 alerts) ~52 µs
large success-only (256 probes) ~110 µs

The common case (small success-only reports) is ~1.3 µs per hash. The hashing only runs on success-only reports, so the alert path is unaffected. The large_success_only case (256 probes) at ~110 µs is an extreme outlier unlikely in practice.

Unbounded map growth — good catch. Fixed in the latest commit: entries older than 2 × suppress_unchanged_interval are evicted on each insert, keeping the map bounded to the set of recently active machines.

Benchmarks the report hashing function across five report shapes:
empty, small/large success-only, and small/large with alerts. Covers
the realistic common case (success-only) as well as alert-heavy reports
to surface any CPU cost at scale.

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
The last_sent cache would grow without bound if machine keys drift over
the lifetime of the process (e.g. hardware added and removed). Evict
entries older than 2× the suppress interval on every insert so the map
stays bounded to the set of recently active machines.

Also fix the criterion benchmark for content_hash to build the sink's
HealthReport directly instead of converting from the external type.

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan rsd-darshan force-pushed the feat/variable-frequency-health-reports branch from 7736892 to b17108f Compare June 7, 2026 08:53
@rsd-darshan

Copy link
Copy Markdown
Author

@yoks both points addressed in the latest commit — stale entry eviction is in, and benchmark results are in the comment above. Let me know if anything else needs changing!

@ajf

ajf commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

/ok to test 4108f43

@rsd-darshan rsd-darshan requested a review from ajf June 8, 2026 16:55

@yoks yoks left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please address remaining comments to the review.

Thanks!

Comment thread crates/health/src/sink/health_report.rs Outdated
Comment thread crates/health/src/sink/health_report.rs Outdated
Comment thread crates/health/src/sink/health_report.rs Outdated
Comment thread crates/health/src/sink/health_report.rs
Comment thread crates/health/src/sink/health_report.rs Outdated
- Throttle last_sent eviction: track last_evicted timestamp and only
  run retain() when 2x the suppress interval has elapsed, avoiding
  an O(n) scan on every insert
- Remove alert hashing dead code: content_hash only runs on success-only
  reports so the alert branch was unreachable
- Clear last_sent entry on alert so the first all-clear success after
  an alert is never throttled
- Add make_sink() helper in tests to reduce boilerplate
- Rewrite report_with_alerts_never_suppressed test to actually verify
  the alert-then-recovery sequence

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan

Copy link
Copy Markdown
Author

All review points addressed in the latest commit:

  • Eviction throttling — introduced LastSentCache that tracks last_evicted and only runs retain() when 2× suppress_interval has elapsed, so it's O(n) at most once per interval rather than on every insert
  • Dead code — removed the alert-path hashing from content_hash; it only runs on success-only reports so the alert branch was unreachable
  • Alert clears suppression — added cache.entries.remove(&key) on the alert path so the first all-clear success after an alert is never throttled
  • Alert test — rewrote report_with_alerts_never_suppressed to verify the full alert → recovery sequence rather than just checking the dedup queue

The nit about updating last_sent before confirming the send succeeded is noted — happy to address that too if you'd like, though it would require threading the result back from the worker.

@rsd-darshan rsd-darshan requested a review from yoks June 9, 2026 03:59
@yoks

yoks commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/ok to test db3e903

@yoks

yoks commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/ok to test ddd7ee3

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan rsd-darshan requested a review from yoks June 9, 2026 08:36
@rsd-darshan

Copy link
Copy Markdown
Author

Fixed the collapsible_if clippy lint in the latest commit — could someone run /ok to test on 9bfa3781 to trigger CI?

@yoks

yoks commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/ok to test 9bfa378

Signed-off-by: rsd-darshan <susanpdl77@gmail.com>
Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan

Copy link
Copy Markdown
Author

The only change in the latest commit is a mechanical rustfmt reformat — no logic was touched. Could either of you re-approve when you get a chance? Thanks!

@yoks

yoks commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/ok to test 3475806

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: health clients (like hw-health) should send health-reports to core with variable frequency

3 participants