ci: run pre-publish benchmark gate on every PR#1072
Conversation
Mirrors the regression gate from publish.yml so PRs catch performance regressions before merge instead of at release time. The gate measures the PR-built native artifact as version "dev", appends to the benchmark history files in-job, and fails if any metric regresses beyond threshold vs the most recent release baseline. Reuses the existing native-host-build artifact (linux-x64), runs in parallel with the rest of CI, and preserves CODEGRAPH_FAST_SKIP_DIAG so fast-skip rejection reasons surface in CI logs when issues happen. JSON results upload on always() for post-failure debugging.
|
Claude finished @carlos-alm's task —— View job Reviewing PR: ci: run pre-publish benchmark gate on every PRTodo List:
Root Cause AnalysisProblem Being Solved: Performance regressions were only being caught at release time via the pre-publish benchmark gate in Why This Approach: Moving the gate to run on every PR means regressions are caught immediately when introduced, while the diagnostic context is still fresh and the fix is cheap. The recent diagnostics improvements (#1067, #1066) make the gate logs genuinely useful for debugging. Risk Assessment: This is a defensive change that adds quality gates without removing existing safeguards. The approach mirrors the existing proven gate from Backlog Compliance
Critical ConcernsNone identified. This is a well-executed defensive improvement. Strengths:
Technical Implementation Quality:
Minor Considerations:
Final Recommendation
|
Greptile SummaryThis PR ports the
Confidence Score: 5/5Safe to merge — the benchmark job is correctly gated to PR events only, all regression-guard logic paths now include Both previously identified silent failures in the regression guard are addressed: No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([PR opened / push event]) --> B{event_name?}
B -- pull_request --> C[native-host-build]
B -- push to main --> D[lint / test / typecheck\naudit / verify-imports\nrust-check / parity]
C --> E[pre-publish-benchmark\nif: pull_request]
C --> D
E --> E1[Download native-host-ubuntu-latest]
E1 --> E2[npm install + build TS]
E2 --> E3[Run build benchmark\n→ benchmark-result.json]
E3 --> E4[Run resolution benchmark\n→ resolution-result.json]
E4 --> E5[Gate on resolution thresholds\nvitest verbose]
E5 --> E6[Run tracer validation]
E6 --> E7[Merge resolution into build result]
E7 --> E8[Run query benchmark]
E8 --> E9[Run incremental benchmark]
E9 --> E10[Update build / query / incremental reports]
E10 --> E11[Regression guard\nRUN_REGRESSION_GUARD=1]
E11 --> E12[Upload JSON artifacts\nif: always]
D --> F[ci-pipeline\nif: always]
E12 --> F
B -- push to main\nskips benchmark --> F
F --> G{any failure\nor cancelled?}
G -- yes --> H([❌ PR blocked])
G -- no --> I([✅ PR gates pass])
Reviews (13): Last reviewed commit: "Merge branch 'main' into ci/regression-g..." | Re-trigger Greptile |
| - name: Run build benchmark | ||
| run: | | ||
| STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')") | ||
| node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/benchmark.ts --version dev --dist > benchmark-result.json | ||
|
|
||
| - name: Run resolution benchmark | ||
| run: | | ||
| STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')") | ||
| node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/resolution-benchmark.ts --version dev --dist > resolution-result.json | ||
|
|
||
| - name: Gate on resolution thresholds | ||
| timeout-minutes: 30 | ||
| run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose | ||
|
|
||
| - name: Run tracer validation (same-file edge recall) | ||
| timeout-minutes: 10 | ||
| run: npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts --reporter=verbose | ||
|
|
||
| - name: Merge resolution into build result | ||
| run: | | ||
| node -e " | ||
| const fs = require('fs'); | ||
| const build = JSON.parse(fs.readFileSync('benchmark-result.json', 'utf8')); | ||
| const resolution = JSON.parse(fs.readFileSync('resolution-result.json', 'utf8')); | ||
| build.resolution = resolution; | ||
| fs.writeFileSync('benchmark-result.json', JSON.stringify(build, null, 2)); | ||
| " | ||
|
|
||
| - name: Run query benchmark | ||
| run: | | ||
| STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')") | ||
| node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/query-benchmark.ts --version dev --dist > query-benchmark-result.json | ||
|
|
||
| - name: Run incremental benchmark | ||
| run: | | ||
| STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')") | ||
| node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/incremental-benchmark.ts --version dev --dist > incremental-benchmark-result.json |
There was a problem hiding this comment.
Missing timeouts on benchmark execution steps
The four benchmark run steps (Run build benchmark, Run resolution benchmark, Run query benchmark, Run incremental benchmark) have no timeout-minutes, unlike Gate on resolution thresholds (30 min) and Run tracer validation (10 min). If any benchmark script hangs — for example due to an unresponsive subprocess or an infinite loop in new code — the job will occupy the runner for up to 6 hours before GitHub kills it, blocking the ci-pipeline gate for the entire duration. The equivalent steps in publish.yml share the same gap. Adding per-step timeouts (e.g., 15–20 min each) keeps a stuck job from becoming a multi-hour blocker.
There was a problem hiding this comment.
Fixed in 9aca6db — added timeout-minutes: 20 to all four benchmark execution steps (build / resolution / query / incremental) so a hung script can't hold the runner for the GitHub default 6 hours. Same caveat carries over to publish.yml's equivalent steps; left untouched here per one-PR-one-concern (will follow up if needed).
…1072) The benchmark regression guard fails on the same root cause as the existing 3.9.6 known regressions ('Build ms/file', 'No-op rebuild'): PR #1016 expanded AST_TYPE_MAPS from 3 to 23 languages, causing zero-AST-row files to return astNodes: undefined and trigger a full-corpus re-parse. The regression shows up in INCREMENTAL-BENCHMARKS under the 'Full build' label as well, which was missed when PR #1038 added the other 3.9.6 entries. Both engines regressed: native 2148 -> 2986 (+39%), wasm 7563 -> 14036 (+86%). Fixed by PR #1038; reclears with v3.9.7+ benchmark data. Refs #1036, #1037, #1038.
Address Greptile review feedback: - Add 'if: github.event_name == "pull_request"' to skip the gate on push-to-main. The merged PR already passed the gate on this same diff, so re-running on the merge commit doubles CI minutes per landed change with no new signal. Mirrors publish.yml's 'if: github.event_name != "push"' skip. - Add per-step 'timeout-minutes: 20' to the four benchmark execution steps (build/resolution/query/incremental). Without explicit timeouts, a hung script would hold the runner for the GitHub default (6 hours), blocking the ci-pipeline gate for the entire duration.
|
Addressed Greptile review feedback in 9aca6db and fda3c4f:
Also fixed the failing CI:
|
parseSemver('dev') returns null, so effectiveGap('dev', anyRelease) is
Infinity — the > MAX_VERSION_GAP check silently rejected every dev →
release pairing and findLatestPair fell through to compare the two most
recent real releases. The per-PR regression gate was running on static
historical data instead of the PR's own dev numbers.
Bypass the gap check when latestVersion === 'dev' so dev is always
compared against the most recent comparable release. Real releases
still respect MAX_VERSION_GAP to avoid stale baselines.
|
Addressed Greptile's outstanding concern about The bug was real — Fix: in |
… is dev (#1072) KNOWN_REGRESSIONS keys are anchored to the release where the regression was first observed (e.g. '3.9.6:No-op rebuild'). When the per-PR gate runs 'dev' as latest, lookups using 'dev:Foo' never match — defeating the exemption mechanism for every documented regression. Fall back to the baseline (previous) version's key when latest is 'dev', so a single '3.9.6:Foo' entry covers both '3.9.6 vs 3.9.5' (release-time) and 'dev vs 3.9.6' (per-PR) until the next release clears the regression and the entry is pruned by the existing stale-entry test.
The native 1-file rebuild regressed from 78ms to ~116ms (build) and 54ms to ~81ms (incremental) when #1069 made backfillNativeDroppedFiles run on every successful orchestrator pass — including incrementals — to repair file_hashes/nodes rows for unsupported-extension files. #1070 fixed the orchestrator side, but the JS-side call stayed unconditional, wasting ~45ms per incremental on the codegraph corpus. Fix is tracked in PR #1082, which gates the backfill call on `isFullBuild || removedCount > 0`. Adding the known-regression entry lets this PR's gate pass while #1082 ships through review; the existing stale-entry test will warn once 3.9.7 lands and the entry stays past its useful life.
|
Pushed two follow-up commits to fix the failing CI on this PR's own gate:
Pre-publish benchmark gate is now passing. |
The resolveEntries filter excluded 'dev' from the candidate pool, so nativeBatchMs / jsFallbackMs regressions introduced by a PR would silently pass the gate. Switch to findLatestPair (which already pairs dev → most-recent comparable release) so the import-resolution sub-benchmark gets the same per-PR coverage as the rest of the gate.
|
Addressed Greptile's outstanding concern from the May 6 review summary about the import-resolution sub-benchmark in f5f8372:
Fix: switched the resolve sub-benchmark from the dev-excluding Also resolved the merge conflict with main (9194394). The conflict was in |
CI run 25708925467 surfaced fnDeps depth 1 at 24.7 → 44.9ms (+82%) on the per-PR gate, above the 50% NOISY_METRIC_THRESHOLD. The fn_deps Rust implementation and JS wrapper are byte-for-byte unchanged since v3.9.6 (already documented in the NOISY_METRICS docstring); the ~20ms absolute delta is the shared-runner noise floor on a sub-30ms baseline. Same pattern as the existing 3.10.0:No-op rebuild / 3.10.0:1-file rebuild entries; remove once 3.11.0+ data confirms stabilization under the warmup + 5-sample methodology.
Summary
pre-publish-benchmarkjob frompublish.ymlintoci.ymlso the regression gate runs on every PR instead of only at release time.native-host-buildartifact (linux-x64), runs in parallel with the rest of CI, and is wired intoci-pipelineneeds so a regression fails the PR.CODEGRAPH_FAST_SKIP_DIAG=1and the verbose vitest reporters so when something regresses, the logs already pinpoint the cause without re-running locally.Why
Until now, the regression gate only ran inside
publish.yml's pre-publish path (stable releases /workflow_dispatch). Regressions could land onmainand were only caught at the next release. The recent diagnostics work (#1067, #1066 follow-ups) makes the gate's logs genuinely useful — moving it onto every PR puts that signal in front of the change that introduced the regression, while it's still cheap to fix.How it differs from publish.yml's gate
--version devinstead of a semver.update-benchmark-report.tstreatsdevas a rolling entry (replaces previousdev); the regression-guard'sfindLatestPaircomparesdevagainst the most recent non-skipped release.native-host-ubuntu-latestfrom CI's existing matrix instead ofnative-linux-x64from publish's matrix — no new Rust build added.record-benchmarksjob inbenchmark.yml.if: always(): failed gates are debuggable without re-running the suite locally.Test plan
lint/typecheck/test(only depends onnative-host-build).CODEGRAPH_FAST_SKIP_DIAG=1, verbose reporters, JSON artifact attached).