test(bench): stabilize publish gate against sub-30ms metric noise#1093
Conversation
Two changes that together unblock the v3.10.0 publish gate without masking real regressions: 1. scripts/incremental-benchmark.ts: add WARMUP_RUNS=2 to noop and 1-file rebuild phases, bump RUNS 3→5. Mirrors the warmup-discard methodology added to query-benchmark.ts in #1077. A 3-sample median with no warmup pulls cold-start outliers (rusqlite statement cache, OS page cache, NAPI/tree-sitter init) into the timing window — exactly the dynamic that produced +44–52% CI swings on 54ms baselines. 2. tests/benchmarks/regression-guard.test.ts: - Add 'fnDeps depth 1' to NOISY_METRICS. Verified via git log v3.9.6..HEAD that read_queries.rs (fn_deps Rust impl), dependencies.ts (fnDepsData JS wrapper), native_db.rs (schema/indexes), and src/db/ are unchanged since v3.9.6. CI's +44% on the 28.7ms baseline is +13ms absolute — same noise-floor amplification pattern as the existing noisy entries. - Add 3.10.0:1-file rebuild to KNOWN_REGRESSIONS with documented analysis mirroring the existing 3.10.0:No-op rebuild entry. PR #1085 already brought local 1-file from ~108ms to ~60ms; CI sees 75–82ms = local + ~20ms shared-runner overhead. Against the v3.9.6 54ms baseline that fluctuates +46–52% across consecutive runs of identical code. Exempt for this release; remove once 3.11.0+ data confirms stabilization under the new methodology. Verification: simulated the actual failing CI numbers (fnDeps d1=41.4, 1-file=82) as a synthetic 3.10.0 entry — all 17 regression-guard tests pass. Defensive check with a fabricated real regression (fnDeps depth 3 → 120ms) still correctly fails with +312%, confirming the gate has not been weakened against genuine perf regressions outside the noise floor. Closes #1076.
|
Claude finished @carlos-alm's task —— View job PR Review: Benchmark StabilizationReviewing PR #1093 for benchmark noise fixes and regression guard exemptions. Root Cause AnalysisProblem Being Solved: CI benchmark instability on sub-30ms metrics causing false regression alerts that block v3.10.0 publish gate. The issue is that small absolute timing variations (±10-13ms) on sub-30ms baselines translate to large percentage swings (+44-52%) that exceed the 25% regression threshold. Why This Approach: Two-pronged fix targeting both methodology and recognition of CI variance reality:
Risk Assessment: LOW RISK - The changes do not weaken regression detection for meaningful performance changes. The 50% threshold for sub-30ms metrics still catches real regressions while filtering CI jitter noise. Version-specific exemptions are self-expiring. Backlog Compliance
Critical AnalysisMethodology Changes (SOUND):
Exemptions (JUSTIFIED):
Quality Safeguards:
Documentation Quality:
Potential ConcernsMinor: The 50% threshold for noisy metrics is quite wide, but given the empirical data showing +44-60% swings on unchanged code, this appears calibrated correctly for CI reality. Monitoring: The exemptions should be tracked - but the existing stale-entry detection provides this automatically. Final Recommendation
The PR unblocks the publish gate without masking genuine performance regressions - exactly what's needed for infrastructure stability. |
Greptile SummaryThis PR stabilizes the v3.10.0 publish gate by addressing two sources of CI benchmark flakiness: cold-start outliers in the incremental benchmark script and overly narrow regression thresholds for metrics with sub-30ms baselines on shared runners.
Confidence Score: 5/5Safe to merge — the changes narrow the scope of benchmark exemptions to two well-documented, empirically justified cases and improve measurement methodology without touching any production code paths. All changes are confined to benchmark tooling and the regression-guard test. Warmup loops are placed correctly in every timing path (both parent and worker), the full-build section correctly omits warmup since it deletes the DB before each run, and the new exemptions are scoped, documented, and self-pruning via the existing staleness check. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[incremental-benchmark.ts starts] --> B{isWorker?}
B -- No --> C[Parent: resolve import pairs]
C --> D{parentNativeCheck?}
D -- Yes --> E[Native warmup x WARMUP_RUNS=2]
E --> F[Native timed runs x RUNS=5]
F --> G[Compute nativeBatchMs median]
D -- No --> H[JS warmup x WARMUP_RUNS=2]
G --> H
H --> I[JS timed runs x RUNS=5]
I --> J[Compute jsFallbackMs median]
J --> K[Output JSON result]
B -- Yes --> L[Worker: full build x RUNS=5]
L --> M[Compute fullBuildMs median]
M --> N[No-op warmup x WARMUP_RUNS=2]
N --> O[No-op timed runs x RUNS=5]
O --> P[Compute noopRebuildMs median]
P --> Q[1-file warmup x WARMUP_RUNS=2]
Q --> R[1-file timed runs x RUNS=5]
R --> S[Compute oneFileRebuildMs median]
S --> T[finally: restore PROBE_FILE]
Reviews (2): Last reviewed commit: "test(bench): apply warmup+RUNS=5 to pare..." | Re-trigger Greptile |
…#1093) Mirror the worker-side methodology fix to the parent process's import-resolution loop so nativeBatchMs and jsFallbackMs are not exposed to the same cold-start outlier dynamic the rest of this PR is fixing. Both metrics are sub-15ms on codegraph itself — exactly the sub-30ms band where a 3-sample median without warmup picks up rusqlite statement-cache and NAPI init jitter and produces CI-amplified false regressions. Greptile PR review feedback.
|
Addressed Greptile review feedback:
Thanks for catching this — the omission was unintentional. |
Summary
The v3.10.0 publish gate has failed across multiple consecutive runs on two metrics that share a common pattern: sub-30/100ms baselines amplified by CI shared-runner jitter into percentage swings that look like regressions but aren't.
This PR unblocks the gate without masking real regressions:
scripts/incremental-benchmark.ts— methodology fix (durable): addWARMUP_RUNS = 2to noop and 1-file rebuild phases, bumpRUNS3→5. Mirrors the warmup-discard PR fix(bench): discard warmup runs in query benchmark median #1077 added toquery-benchmark.ts. A 3-sample median with no warmup pulls cold-start outliers (rusqlite statement cache, OS page cache, NAPI/tree-sitter init) into the timing window — the same dynamic that produced +44–52% CI swings on 54ms baselines.tests/benchmarks/regression-guard.test.ts— two scoped exemptions with documented rationale:fnDeps depth 1toNOISY_METRICS(50% threshold). Verified viagit log v3.9.6..HEADthatread_queries.rs(fn_deps Rust impl),dependencies.ts(fnDepsData JS wrapper),native_db.rs(schema/indexes), andsrc/db/are byte-for-byte unchanged since v3.9.6. CI's +44% on the 28.7ms baseline is +13ms absolute — same noise-floor pattern as the existingNo-op rebuild/1-file rebuildentries.3.10.0:1-file rebuildtoKNOWN_REGRESSIONSwith empirical analysis mirroring the existing3.10.0:No-op rebuildentry. PR perf(native): skip backfill on clean incrementals + bench guard tuning #1085 already brought local 1-file from ~108ms to ~60ms; CI sees 75–82ms = local + ~20ms shared-runner overhead. Against v3.9.6's 54ms baseline that fluctuates +46–52% across consecutive runs of identical code. One-off exemption; entry becomes stale and gets pruned once 3.11.0+ data confirms the new methodology stabilizes the numbers.Verification
fnDeps d1=41.4,1-file=82) as a synthetic 3.10.0 entry → all 17 regression-guard tests pass.fnDeps depth 3→ 120ms, +312%) → gate still correctly fails. The guard has not been weakened against genuine regressions outside the noise floor.npx biome check tests/benchmarks/regression-guard.test.tsclean. Scripts dir is out of biome scope (intentional).Test plan
workflow_dispatchrunnpm test(they should all skip —RUN_REGRESSION_GUARDis only set in the publish workflow)3.10.0:1-file rebuildentry can be removed (its presence will start firing the existingKNOWN_REGRESSIONS entries are not stalewarn test once it's >1 minor version behind)Closes #1076.