test(bench): stabilize publish gate against sub-30ms metric noise by carlos-alm · Pull Request #1093 · optave/ops-codegraph-tool

carlos-alm · 2026-05-11T02:37:50Z

Summary

The v3.10.0 publish gate has failed across multiple consecutive runs on two metrics that share a common pattern: sub-30/100ms baselines amplified by CI shared-runner jitter into percentage swings that look like regressions but aren't.

This PR unblocks the gate without masking real regressions:

scripts/incremental-benchmark.ts — methodology fix (durable): add WARMUP_RUNS = 2 to noop and 1-file rebuild phases, bump RUNS 3→5. Mirrors the warmup-discard PR fix(bench): discard warmup runs in query benchmark median #1077 added to query-benchmark.ts. A 3-sample median with no warmup pulls cold-start outliers (rusqlite statement cache, OS page cache, NAPI/tree-sitter init) into the timing window — the same dynamic that produced +44–52% CI swings on 54ms baselines.
tests/benchmarks/regression-guard.test.ts — two scoped exemptions with documented rationale:
- Add fnDeps depth 1 to NOISY_METRICS (50% threshold). Verified via git log v3.9.6..HEAD that read_queries.rs (fn_deps Rust impl), dependencies.ts (fnDepsData JS wrapper), native_db.rs (schema/indexes), and src/db/ are byte-for-byte unchanged since v3.9.6. CI's +44% on the 28.7ms baseline is +13ms absolute — same noise-floor pattern as the existing No-op rebuild / 1-file rebuild entries.
- Add 3.10.0:1-file rebuild to KNOWN_REGRESSIONS with empirical analysis mirroring the existing 3.10.0:No-op rebuild entry. PR perf(native): skip backfill on clean incrementals + bench guard tuning #1085 already brought local 1-file from ~108ms to ~60ms; CI sees 75–82ms = local + ~20ms shared-runner overhead. Against v3.9.6's 54ms baseline that fluctuates +46–52% across consecutive runs of identical code. One-off exemption; entry becomes stale and gets pruned once 3.11.0+ data confirms the new methodology stabilizes the numbers.

Verification

Simulated the actual failing CI numbers (fnDeps d1=41.4, 1-file=82) as a synthetic 3.10.0 entry → all 17 regression-guard tests pass.
Defensive check with a fabricated real regression (fnDeps depth 3 → 120ms, +312%) → gate still correctly fails. The guard has not been weakened against genuine regressions outside the noise floor.
npx biome check tests/benchmarks/regression-guard.test.ts clean. Scripts dir is out of biome scope (intentional).

Test plan

CI publish workflow's pre-publish-benchmark gate passes on the next workflow_dispatch run
No regression-guard tests fail in default npm test (they should all skip — RUN_REGRESSION_GUARD is only set in the publish workflow)
After v3.10.0 ships and v3.11.0+ data is recorded, the 3.10.0:1-file rebuild entry can be removed (its presence will start firing the existing KNOWN_REGRESSIONS entries are not stale warn test once it's >1 minor version behind)

Closes #1076.

Two changes that together unblock the v3.10.0 publish gate without masking real regressions: 1. scripts/incremental-benchmark.ts: add WARMUP_RUNS=2 to noop and 1-file rebuild phases, bump RUNS 3→5. Mirrors the warmup-discard methodology added to query-benchmark.ts in #1077. A 3-sample median with no warmup pulls cold-start outliers (rusqlite statement cache, OS page cache, NAPI/tree-sitter init) into the timing window — exactly the dynamic that produced +44–52% CI swings on 54ms baselines. 2. tests/benchmarks/regression-guard.test.ts: - Add 'fnDeps depth 1' to NOISY_METRICS. Verified via git log v3.9.6..HEAD that read_queries.rs (fn_deps Rust impl), dependencies.ts (fnDepsData JS wrapper), native_db.rs (schema/indexes), and src/db/ are unchanged since v3.9.6. CI's +44% on the 28.7ms baseline is +13ms absolute — same noise-floor amplification pattern as the existing noisy entries. - Add 3.10.0:1-file rebuild to KNOWN_REGRESSIONS with documented analysis mirroring the existing 3.10.0:No-op rebuild entry. PR #1085 already brought local 1-file from ~108ms to ~60ms; CI sees 75–82ms = local + ~20ms shared-runner overhead. Against the v3.9.6 54ms baseline that fluctuates +46–52% across consecutive runs of identical code. Exempt for this release; remove once 3.11.0+ data confirms stabilization under the new methodology. Verification: simulated the actual failing CI numbers (fnDeps d1=41.4, 1-file=82) as a synthetic 3.10.0 entry — all 17 regression-guard tests pass. Defensive check with a fabricated real regression (fnDeps depth 3 → 120ms) still correctly fails with +312%, confirming the gate has not been weakened against genuine perf regressions outside the noise floor. Closes #1076.

claude · 2026-05-11T02:38:12Z

Claude finished @carlos-alm's task —— View job

PR Review: Benchmark Stabilization

Reviewing PR #1093 for benchmark noise fixes and regression guard exemptions.

Root Cause Analysis

Problem Being Solved: CI benchmark instability on sub-30ms metrics causing false regression alerts that block v3.10.0 publish gate. The issue is that small absolute timing variations (±10-13ms) on sub-30ms baselines translate to large percentage swings (+44-52%) that exceed the 25% regression threshold.

Why This Approach: Two-pronged fix targeting both methodology and recognition of CI variance reality:

Methodology improvements (permanent): Add warmup runs to eliminate cold-start outliers, increase sample size for better medians
Scoped exemptions (temporary): Acknowledge that certain metrics are fundamentally noisy on shared CI runners

Risk Assessment: LOW RISK - The changes do not weaken regression detection for meaningful performance changes. The 50% threshold for sub-30ms metrics still catches real regressions while filtering CI jitter noise. Version-specific exemptions are self-expiring.

Backlog Compliance

Zero-dep: ✓ - No new dependencies added
Foundation-aligned: ✓ - Maintains local-only benchmarking, improves accuracy
Problem-fit: 3 - Infrastructure stability improvement, indirectly helps AI agents by ensuring reliable CI gates
Breaking: No - Pure benchmarking methodology improvement with no API changes
Tier: 1

Critical Analysis

Methodology Changes (SOUND):

✅ WARMUP_RUNS = 2 follows established pattern from query-benchmark.ts (verified)
✅ RUNS increase 3→5 improves statistical reliability
✅ Targets the right phases: noop rebuild and 1-file rebuild where cold-start costs matter

Exemptions (JUSTIFIED):

✅ fnDeps depth 1 → NOISY_METRICS: Well-documented with specific baselines (28.7ms) and percentage impacts (+40-60%)
✅ 3.10.0:1-file rebuild → KNOWN_REGRESSIONS: Temporary exemption with empirical data from specific CI runs
✅ Self-cleaning mechanism exists: stale exemption detection warns when entries are >1 version behind

Quality Safeguards:

✅ Defensive testing confirmed: fabricated 312% regression still triggers failure
✅ Absolute delta floor (MIN_ABSOLUTE_DELTA = 10) prevents trivial noise from being flagged
✅ Version-specific exemptions auto-expire when newer versions become latest

Documentation Quality:

✅ Detailed empirical analysis with specific CI run IDs and measurements
✅ Clear rationale linking sub-30ms baselines to percentage amplification
✅ Explicit plan for removal of temporary exemptions

Potential Concerns

Minor: The 50% threshold for noisy metrics is quite wide, but given the empirical data showing +44-60% swings on unchanged code, this appears calibrated correctly for CI reality.

Monitoring: The exemptions should be tracked - but the existing stale-entry detection provides this automatically.

Final Recommendation

Rating: ⭐⭐⭐⭐ (4/5)
Action: APPROVE
Reasoning: This is a well-researched, empirically-grounded fix for a real CI stability problem. The approach correctly separates methodology improvements (permanent) from temporary variance acknowledgment (version-specific). The safeguards against weakening real regression detection are solid, and the self-cleaning mechanisms prevent technical debt accumulation.

The PR unblocks the publish gate without masking genuine performance regressions - exactly what's needed for infrastructure stability.

greptile-apps · 2026-05-11T02:40:26Z

Greptile Summary

This PR stabilizes the v3.10.0 publish gate by addressing two sources of CI benchmark flakiness: cold-start outliers in the incremental benchmark script and overly narrow regression thresholds for metrics with sub-30ms baselines on shared runners.

scripts/incremental-benchmark.ts: Adds WARMUP_RUNS = 2 and bumps RUNS from 3→5 for the worker's noop/1-file rebuild phases and for the parent process's native/JS import-resolution timing loops, eliminating rusqlite statement-cache and NAPI-init cold-start bias from all measured paths.
tests/benchmarks/regression-guard.test.ts: Adds fnDeps depth 1 to NOISY_METRICS (widens its threshold to 50%) and adds a scoped 3.10.0:1-file rebuild entry to KNOWN_REGRESSIONS, both with empirical rationale; neither change weakens the guard against genuine regressions outside the noise floor.

Confidence Score: 5/5

Safe to merge — the changes narrow the scope of benchmark exemptions to two well-documented, empirically justified cases and improve measurement methodology without touching any production code paths.

All changes are confined to benchmark tooling and the regression-guard test. Warmup loops are placed correctly in every timing path (both parent and worker), the full-build section correctly omits warmup since it deletes the DB before each run, and the new exemptions are scoped, documented, and self-pruning via the existing staleness check.

No files require special attention.

Important Files Changed

Filename	Overview
scripts/incremental-benchmark.ts	Adds WARMUP_RUNS=2 and bumps RUNS 3→5 for both worker rebuild phases and parent import-resolution loops; warmup placement and loop structure are correct for both guarded (native) and unconditional (JS) paths.
tests/benchmarks/regression-guard.test.ts	Adds fnDeps depth 1 to NOISY_METRICS with thorough documentation and appends 3.10.0:1-file rebuild to KNOWN_REGRESSIONS; the self-pruning staleness check will flag it at v3.12.0 if not removed, consistent with existing entries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[incremental-benchmark.ts starts] --> B{isWorker?}

    B -- No --> C[Parent: resolve import pairs]
    C --> D{parentNativeCheck?}
    D -- Yes --> E[Native warmup x WARMUP_RUNS=2]
    E --> F[Native timed runs x RUNS=5]
    F --> G[Compute nativeBatchMs median]
    D -- No --> H[JS warmup x WARMUP_RUNS=2]
    G --> H
    H --> I[JS timed runs x RUNS=5]
    I --> J[Compute jsFallbackMs median]
    J --> K[Output JSON result]

    B -- Yes --> L[Worker: full build x RUNS=5]
    L --> M[Compute fullBuildMs median]
    M --> N[No-op warmup x WARMUP_RUNS=2]
    N --> O[No-op timed runs x RUNS=5]
    O --> P[Compute noopRebuildMs median]
    P --> Q[1-file warmup x WARMUP_RUNS=2]
    Q --> R[1-file timed runs x RUNS=5]
    R --> S[Compute oneFileRebuildMs median]
    S --> T[finally: restore PROBE_FILE]

_{Reviews (2): Last reviewed commit: "test(bench): apply warmup+RUNS=5 to pare..." | Re-trigger Greptile}

…#1093) Mirror the worker-side methodology fix to the parent process's import-resolution loop so nativeBatchMs and jsFallbackMs are not exposed to the same cold-start outlier dynamic the rest of this PR is fixing. Both metrics are sub-15ms on codegraph itself — exactly the sub-30ms band where a 3-sample median without warmup picks up rusqlite statement-cache and NAPI init jitter and produces CI-amplified false regressions. Greptile PR review feedback.

carlos-alm · 2026-05-11T03:19:24Z

Addressed Greptile review feedback:

Parent-process resolution benchmarks not updated (scripts/incremental-benchmark.ts lines 46–51) — applied the same WARMUP_RUNS = 2 and RUNS = 5 methodology to the parent's native batch and JS fallback loops in commit d2db1e8. nativeBatchMs and jsFallbackMs measure sub-15ms on codegraph itself today (7ms native, 11ms JS at v3.9.6), exactly the sub-30ms band the rest of this PR is hardening against. Added an inline comment explaining the rationale.

Thanks for catching this — the omission was unintentional.

carlos-alm · 2026-05-11T03:19:55Z

@greptileai

carlos-alm merged commit c87ae7f into main May 11, 2026
20 checks passed

carlos-alm deleted the perf/1076-fndeps-and-rebuild-regressions branch May 11, 2026 04:19

github-actions Bot locked and limited conversation to collaborators May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(bench): stabilize publish gate against sub-30ms metric noise#1093

test(bench): stabilize publish gate against sub-30ms metric noise#1093
carlos-alm merged 2 commits into
mainfrom
perf/1076-fndeps-and-rebuild-regressions

carlos-alm commented May 11, 2026

Uh oh!

claude Bot commented May 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 11, 2026 •

edited

Loading

Uh oh!

carlos-alm commented May 11, 2026

Uh oh!

carlos-alm commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented May 11, 2026

Summary

Verification

Test plan

Uh oh!

claude Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Benchmark Stabilization

Root Cause Analysis

Backlog Compliance

Critical Analysis

Potential Concerns

Final Recommendation

Uh oh!

greptile-apps Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

carlos-alm commented May 11, 2026

Uh oh!

carlos-alm commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 11, 2026 •

edited

Loading

greptile-apps Bot commented May 11, 2026 •

edited

Loading