Skip to content

Skip historical sweep when collectors resume after a gap (closes #892)#904

Merged
erikdarlingdata merged 1 commit intodevfrom
feature/892-resume-gap-detection
Apr 28, 2026
Merged

Skip historical sweep when collectors resume after a gap (closes #892)#904
erikdarlingdata merged 1 commit intodevfrom
feature/892-resume-gap-detection

Conversation

@erikdarlingdata
Copy link
Copy Markdown
Owner

Summary

After an Off preset, Agent stoppage, or server reboot, the first run of `query_stats` / `procedure_stats` / `query_store` would historically sweep the entire backlog of cumulative state into the delta tables in one shot. On `query_stats` that was enough to blow tempdb overnight (the original pain in #885).

These three collectors now do a lightweight check after computing their normal cutoff: if `MAX(config.collection_log.collection_time)` for their own `collector_name` (status `SUCCESS`) is older than 5× the configured `frequency_minutes` (floored at 30 min), the cutoff is clamped to `SYSDATETIME()` so only forward-going data is collected on the resume run.

What's NOT touched

  • XE-backed collectors (`blocked_process_xml`, `deadlock_xml`, `system_health`, `default_trace`, `trace_analysis`) are already bounded by their `@minutes_back` / `@hours_back` parameters — verified by reading each. No catch-up problem.
  • Snapshot collectors (`wait_stats`, `file_io_stats`, `memory_grant_stats`, `plan_cache_stats`, etc.) insert one row per run regardless of gap.
  • No schema changes. No new column on `config.collection_schedule`, no upgrade folder needed.

Threshold logic

```
threshold_minutes = MAX(@frequency_minutes * 5, 30)
-- with NULL/0 frequency safely defaulting to 30
```

Configured freq Threshold
1–5 min 30 min (floor)
10 min 50 min
30 min 150 min
60 min (custom) 300 min

5× breathes room for slipped cycles; 30-min floor protects 1-min collectors from false-firing on Agent restarts.

Test plan

  • All three procs apply cleanly on sql2016/2017/2019/2022/2025.
  • On sql2019, synthetic 3-hour gap (achieved by transactionally backdating collection_log + ROLLBACK) → heuristic fires for all three: "Resume detected: 182-minute gap exceeds 30-minute threshold. Skipping historical sweep."
  • Normal run with no gap → heuristic stays silent.
  • Verified XE collectors and snapshot collectors are not affected.

Closes #892.

When the Off preset, an Agent stoppage, or a server reboot pauses
collection for hours, the next run of query_stats / procedure_stats /
query_store would dump everything that accumulated during the gap into
their delta tables in one go. On query_stats specifically (issue #885),
that was enough to blow tempdb overnight.

Each of the three procs now reads MAX(config.collection_log.collection_time)
for its own collector_name (where status = SUCCESS) right after computing
the normal cutoff. If the gap to now exceeds 5x the configured frequency
(or 30 minutes, whichever is larger), it clamps the cutoff to SYSDATETIME()
so only forward-going data is collected on the resume run. NULL/0
frequency_minutes safely floors to 30 minutes.

XE-backed collectors (blocked_process_xml, deadlock_xml, system_health,
default_trace, trace_analysis) are bounded by their own @minutes_back /
@hours_back parameters and don't have the catch-up problem, so they're
left alone. Snapshot collectors (wait_stats, file_io_stats, etc) insert
one row per run regardless of gap and were never at risk.

Verified on sql2016/2017/2019/2022/2025: all three procs deploy cleanly,
heuristic fires on a 3-hour synthetic gap, stays quiet on normal runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erikdarlingdata erikdarlingdata merged commit 96b708f into dev Apr 28, 2026
7 checks passed
@erikdarlingdata erikdarlingdata deleted the feature/892-resume-gap-detection branch April 28, 2026 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant