Skip to content

test(bench): stabilize publish gate against sub-30ms metric noise#1093

Merged
carlos-alm merged 2 commits into
mainfrom
perf/1076-fndeps-and-rebuild-regressions
May 11, 2026
Merged

test(bench): stabilize publish gate against sub-30ms metric noise#1093
carlos-alm merged 2 commits into
mainfrom
perf/1076-fndeps-and-rebuild-regressions

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

The v3.10.0 publish gate has failed across multiple consecutive runs on two metrics that share a common pattern: sub-30/100ms baselines amplified by CI shared-runner jitter into percentage swings that look like regressions but aren't.

This PR unblocks the gate without masking real regressions:

  • scripts/incremental-benchmark.ts — methodology fix (durable): add WARMUP_RUNS = 2 to noop and 1-file rebuild phases, bump RUNS 3→5. Mirrors the warmup-discard PR fix(bench): discard warmup runs in query benchmark median #1077 added to query-benchmark.ts. A 3-sample median with no warmup pulls cold-start outliers (rusqlite statement cache, OS page cache, NAPI/tree-sitter init) into the timing window — the same dynamic that produced +44–52% CI swings on 54ms baselines.

  • tests/benchmarks/regression-guard.test.ts — two scoped exemptions with documented rationale:

    • Add fnDeps depth 1 to NOISY_METRICS (50% threshold). Verified via git log v3.9.6..HEAD that read_queries.rs (fn_deps Rust impl), dependencies.ts (fnDepsData JS wrapper), native_db.rs (schema/indexes), and src/db/ are byte-for-byte unchanged since v3.9.6. CI's +44% on the 28.7ms baseline is +13ms absolute — same noise-floor pattern as the existing No-op rebuild / 1-file rebuild entries.
    • Add 3.10.0:1-file rebuild to KNOWN_REGRESSIONS with empirical analysis mirroring the existing 3.10.0:No-op rebuild entry. PR perf(native): skip backfill on clean incrementals + bench guard tuning #1085 already brought local 1-file from ~108ms to ~60ms; CI sees 75–82ms = local + ~20ms shared-runner overhead. Against v3.9.6's 54ms baseline that fluctuates +46–52% across consecutive runs of identical code. One-off exemption; entry becomes stale and gets pruned once 3.11.0+ data confirms the new methodology stabilizes the numbers.

Verification

  • Simulated the actual failing CI numbers (fnDeps d1=41.4, 1-file=82) as a synthetic 3.10.0 entry → all 17 regression-guard tests pass.
  • Defensive check with a fabricated real regression (fnDeps depth 3 → 120ms, +312%) → gate still correctly fails. The guard has not been weakened against genuine regressions outside the noise floor.
  • npx biome check tests/benchmarks/regression-guard.test.ts clean. Scripts dir is out of biome scope (intentional).

Test plan

  • CI publish workflow's pre-publish-benchmark gate passes on the next workflow_dispatch run
  • No regression-guard tests fail in default npm test (they should all skip — RUN_REGRESSION_GUARD is only set in the publish workflow)
  • After v3.10.0 ships and v3.11.0+ data is recorded, the 3.10.0:1-file rebuild entry can be removed (its presence will start firing the existing KNOWN_REGRESSIONS entries are not stale warn test once it's >1 minor version behind)

Closes #1076.

Two changes that together unblock the v3.10.0 publish gate without masking
real regressions:

1. scripts/incremental-benchmark.ts: add WARMUP_RUNS=2 to noop and 1-file
   rebuild phases, bump RUNS 3→5. Mirrors the warmup-discard methodology
   added to query-benchmark.ts in #1077. A 3-sample median with no warmup
   pulls cold-start outliers (rusqlite statement cache, OS page cache,
   NAPI/tree-sitter init) into the timing window — exactly the dynamic
   that produced +44–52% CI swings on 54ms baselines.

2. tests/benchmarks/regression-guard.test.ts:
   - Add 'fnDeps depth 1' to NOISY_METRICS. Verified via git log v3.9.6..HEAD
     that read_queries.rs (fn_deps Rust impl), dependencies.ts (fnDepsData
     JS wrapper), native_db.rs (schema/indexes), and src/db/ are unchanged
     since v3.9.6. CI's +44% on the 28.7ms baseline is +13ms absolute —
     same noise-floor amplification pattern as the existing noisy entries.
   - Add 3.10.0:1-file rebuild to KNOWN_REGRESSIONS with documented analysis
     mirroring the existing 3.10.0:No-op rebuild entry. PR #1085 already
     brought local 1-file from ~108ms to ~60ms; CI sees 75–82ms = local +
     ~20ms shared-runner overhead. Against the v3.9.6 54ms baseline that
     fluctuates +46–52% across consecutive runs of identical code. Exempt
     for this release; remove once 3.11.0+ data confirms stabilization
     under the new methodology.

Verification: simulated the actual failing CI numbers (fnDeps d1=41.4,
1-file=82) as a synthetic 3.10.0 entry — all 17 regression-guard tests
pass. Defensive check with a fabricated real regression (fnDeps depth 3
→ 120ms) still correctly fails with +312%, confirming the gate has not
been weakened against genuine perf regressions outside the noise floor.

Closes #1076.
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Benchmark Stabilization

Reviewing PR #1093 for benchmark noise fixes and regression guard exemptions.

Root Cause Analysis

Problem Being Solved: CI benchmark instability on sub-30ms metrics causing false regression alerts that block v3.10.0 publish gate. The issue is that small absolute timing variations (±10-13ms) on sub-30ms baselines translate to large percentage swings (+44-52%) that exceed the 25% regression threshold.

Why This Approach: Two-pronged fix targeting both methodology and recognition of CI variance reality:

  1. Methodology improvements (permanent): Add warmup runs to eliminate cold-start outliers, increase sample size for better medians
  2. Scoped exemptions (temporary): Acknowledge that certain metrics are fundamentally noisy on shared CI runners

Risk Assessment: LOW RISK - The changes do not weaken regression detection for meaningful performance changes. The 50% threshold for sub-30ms metrics still catches real regressions while filtering CI jitter noise. Version-specific exemptions are self-expiring.

Backlog Compliance

  • Zero-dep: ✓ - No new dependencies added
  • Foundation-aligned: ✓ - Maintains local-only benchmarking, improves accuracy
  • Problem-fit: 3 - Infrastructure stability improvement, indirectly helps AI agents by ensuring reliable CI gates
  • Breaking: No - Pure benchmarking methodology improvement with no API changes
  • Tier: 1

Critical Analysis

Methodology Changes (SOUND):

  • WARMUP_RUNS = 2 follows established pattern from query-benchmark.ts (verified)
  • RUNS increase 3→5 improves statistical reliability
  • ✅ Targets the right phases: noop rebuild and 1-file rebuild where cold-start costs matter

Exemptions (JUSTIFIED):

  • fnDeps depth 1NOISY_METRICS: Well-documented with specific baselines (28.7ms) and percentage impacts (+40-60%)
  • 3.10.0:1-file rebuildKNOWN_REGRESSIONS: Temporary exemption with empirical data from specific CI runs
  • ✅ Self-cleaning mechanism exists: stale exemption detection warns when entries are >1 version behind

Quality Safeguards:

  • ✅ Defensive testing confirmed: fabricated 312% regression still triggers failure
  • ✅ Absolute delta floor (MIN_ABSOLUTE_DELTA = 10) prevents trivial noise from being flagged
  • ✅ Version-specific exemptions auto-expire when newer versions become latest

Documentation Quality:

  • ✅ Detailed empirical analysis with specific CI run IDs and measurements
  • ✅ Clear rationale linking sub-30ms baselines to percentage amplification
  • ✅ Explicit plan for removal of temporary exemptions

Potential Concerns

Minor: The 50% threshold for noisy metrics is quite wide, but given the empirical data showing +44-60% swings on unchanged code, this appears calibrated correctly for CI reality.

Monitoring: The exemptions should be tracked - but the existing stale-entry detection provides this automatically.

Final Recommendation

  • Rating: ⭐⭐⭐⭐ (4/5)
  • Action: APPROVE
  • Reasoning: This is a well-researched, empirically-grounded fix for a real CI stability problem. The approach correctly separates methodology improvements (permanent) from temporary variance acknowledgment (version-specific). The safeguards against weakening real regression detection are solid, and the self-cleaning mechanisms prevent technical debt accumulation.

The PR unblocks the publish gate without masking genuine performance regressions - exactly what's needed for infrastructure stability.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR stabilizes the v3.10.0 publish gate by addressing two sources of CI benchmark flakiness: cold-start outliers in the incremental benchmark script and overly narrow regression thresholds for metrics with sub-30ms baselines on shared runners.

  • scripts/incremental-benchmark.ts: Adds WARMUP_RUNS = 2 and bumps RUNS from 3→5 for the worker's noop/1-file rebuild phases and for the parent process's native/JS import-resolution timing loops, eliminating rusqlite statement-cache and NAPI-init cold-start bias from all measured paths.
  • tests/benchmarks/regression-guard.test.ts: Adds fnDeps depth 1 to NOISY_METRICS (widens its threshold to 50%) and adds a scoped 3.10.0:1-file rebuild entry to KNOWN_REGRESSIONS, both with empirical rationale; neither change weakens the guard against genuine regressions outside the noise floor.

Confidence Score: 5/5

Safe to merge — the changes narrow the scope of benchmark exemptions to two well-documented, empirically justified cases and improve measurement methodology without touching any production code paths.

All changes are confined to benchmark tooling and the regression-guard test. Warmup loops are placed correctly in every timing path (both parent and worker), the full-build section correctly omits warmup since it deletes the DB before each run, and the new exemptions are scoped, documented, and self-pruning via the existing staleness check.

No files require special attention.

Important Files Changed

Filename Overview
scripts/incremental-benchmark.ts Adds WARMUP_RUNS=2 and bumps RUNS 3→5 for both worker rebuild phases and parent import-resolution loops; warmup placement and loop structure are correct for both guarded (native) and unconditional (JS) paths.
tests/benchmarks/regression-guard.test.ts Adds fnDeps depth 1 to NOISY_METRICS with thorough documentation and appends 3.10.0:1-file rebuild to KNOWN_REGRESSIONS; the self-pruning staleness check will flag it at v3.12.0 if not removed, consistent with existing entries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[incremental-benchmark.ts starts] --> B{isWorker?}

    B -- No --> C[Parent: resolve import pairs]
    C --> D{parentNativeCheck?}
    D -- Yes --> E[Native warmup x WARMUP_RUNS=2]
    E --> F[Native timed runs x RUNS=5]
    F --> G[Compute nativeBatchMs median]
    D -- No --> H[JS warmup x WARMUP_RUNS=2]
    G --> H
    H --> I[JS timed runs x RUNS=5]
    I --> J[Compute jsFallbackMs median]
    J --> K[Output JSON result]

    B -- Yes --> L[Worker: full build x RUNS=5]
    L --> M[Compute fullBuildMs median]
    M --> N[No-op warmup x WARMUP_RUNS=2]
    N --> O[No-op timed runs x RUNS=5]
    O --> P[Compute noopRebuildMs median]
    P --> Q[1-file warmup x WARMUP_RUNS=2]
    Q --> R[1-file timed runs x RUNS=5]
    R --> S[Compute oneFileRebuildMs median]
    S --> T[finally: restore PROBE_FILE]
Loading

Reviews (2): Last reviewed commit: "test(bench): apply warmup+RUNS=5 to pare..." | Re-trigger Greptile

…#1093)

Mirror the worker-side methodology fix to the parent process's import-resolution
loop so nativeBatchMs and jsFallbackMs are not exposed to the same cold-start
outlier dynamic the rest of this PR is fixing. Both metrics are sub-15ms on
codegraph itself — exactly the sub-30ms band where a 3-sample median without
warmup picks up rusqlite statement-cache and NAPI init jitter and produces
CI-amplified false regressions.

Greptile PR review feedback.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile review feedback:

  • Parent-process resolution benchmarks not updated (scripts/incremental-benchmark.ts lines 46–51) — applied the same WARMUP_RUNS = 2 and RUNS = 5 methodology to the parent's native batch and JS fallback loops in commit d2db1e8. nativeBatchMs and jsFallbackMs measure sub-15ms on codegraph itself today (7ms native, 11ms JS at v3.9.6), exactly the sub-30ms band the rest of this PR is hardening against. Added an inline comment explaining the rationale.

Thanks for catching this — the omission was unintentional.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit c87ae7f into main May 11, 2026
20 checks passed
@carlos-alm carlos-alm deleted the perf/1076-fndeps-and-rebuild-regressions branch May 11, 2026 04:19
@github-actions github-actions Bot locked and limited conversation to collaborators May 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

query benchmark: fnDeps depth 1 regresses ~70% (28.7ms → 48.6ms) on publish gate

1 participant