ci(release): gate npm publish on benchmark regressions#1040
Conversation
|
Claude finished @carlos-alm's task —— View job PR Review: ci(release): gate npm publish on benchmark regressionsRoot Cause AnalysisProblem Being Solved: The regression guard previously ran after releases were published to npm, making it ineffective at blocking bad releases. When v3.9.6 shipped with regressions, the guard only fired after the docs PR merged, blocking subsequent dev work without ever having prevented the problematic release. Why This Approach: Moving benchmark measurement and regression testing into the publish workflow's Risk Assessment: Low risk. The changes are well-isolated, maintain existing behavior for the happy path, and the conditional test gating prevents false failures during development. Backlog Compliance
Critical ConcernsNone identified. This is a well-executed infrastructure improvement. Code Quality AssessmentWorkflow Design: Excellent
Test Gating Implementation: Solid
Configuration Changes: Appropriate
Error Handling: Adequate
Minor Observations
Final Recommendation
The PR demonstrates good engineering judgment in restructuring the benchmark pipeline to create an effective regression gate where one can actually block problematic releases. |
Greptile SummaryThis PR restructures the benchmark regression guard so it runs as a pre-publish gate in
Confidence Score: 4/5Safe to merge after re-adding the omitted resolution and tracer test gates to One P1 finding: the resolution threshold tests and tracer validation that previously gated releases are absent from the new pre-publish job, creating a hole where accuracy regressions can ship undetected. All other logic (regression guard gating publish, artifact hand-off, PR consolidation, untracked-file detection) is sound.
Important Files Changed
Sequence DiagramsequenceDiagram
participant Release as Release Event
participant BuildNative as build-native
participant PreBench as pre-publish-benchmark
participant Publish as publish (npm)
participant BenchWF as benchmark.yml / record-benchmarks
Release->>BuildNative: trigger
BuildNative-->>PreBench: native artifact (linux-x64)
PreBench->>PreBench: run build / query / incremental benchmarks
PreBench->>PreBench: update history files
PreBench->>PreBench: regression guard (RUN_REGRESSION_GUARD=1)
alt regression detected
PreBench-->>Publish: job fails — publish skipped
Note over BenchWF: workflow_run conclusion != success → record-benchmarks skipped
else no regression
PreBench-->>Publish: upload benchmark-files + benchmark-results-json artifacts
Publish->>Publish: npm publish
Publish-->>BenchWF: workflow_run conclusion=success
BenchWF->>BenchWF: download benchmark-files artifact
BenchWF->>BenchWF: git diff / ls-files check
BenchWF->>BenchWF: open single combined PR to main
BenchWF->>BenchWF: engine-parity gate (soft signal)
end
|
| - uses: actions/setup-node@v6 | ||
| with: | ||
| node-version: "22" |
There was a problem hiding this comment.
Missing npm cache in
pre-publish-benchmark
actions/setup-node@v6 is configured without cache: "npm", so every run re-downloads all npm dependencies from scratch. All other jobs in the workflow that call npm install use cache: "npm". For a release-gating job this adds unnecessary minutes to every stable release.
| - uses: actions/setup-node@v6 | |
| with: | |
| node-version: "22" | |
| - uses: actions/setup-node@v6 | |
| with: | |
| node-version: "22" | |
| cache: "npm" |
There was a problem hiding this comment.
Fixed in 1d50950 — added cache: "npm" to the actions/setup-node@v6 step in pre-publish-benchmark, matching the convention used by every other job in this workflow.
| uses: actions/upload-artifact@v7 | ||
| with: | ||
| name: benchmark-files | ||
| path: | | ||
| generated/benchmarks/BUILD-BENCHMARKS.md | ||
| generated/benchmarks/QUERY-BENCHMARKS.md | ||
| generated/benchmarks/INCREMENTAL-BENCHMARKS.md | ||
| README.md | ||
| if-no-files-found: error | ||
|
|
||
| # Raw JSON used by post-publish soft-signal jobs (e.g. engine-parity | ||
| # gate in the Benchmark workflow). Separated from the history-files | ||
| # artifact because consumers read different shapes. | ||
| - name: Upload benchmark JSON results | ||
| uses: actions/upload-artifact@v7 | ||
| with: | ||
| name: benchmark-results-json | ||
| path: | | ||
| benchmark-result.json | ||
| query-benchmark-result.json | ||
| incremental-benchmark-result.json | ||
| if-no-files-found: error |
There was a problem hiding this comment.
upload-artifact@v7 / download-artifact@v8 version mismatch
publish.yml uploads with actions/upload-artifact@v7 while benchmark.yml downloads with actions/download-artifact@v8. Both are past the v4 threshold that switched to the new artifact service backend, so they are functionally compatible today, but pinning them to the same major version avoids any future surprise if a protocol change lands in v8 upload before v7 catches up. Consider aligning both to @v8 (or whichever is current).
There was a problem hiding this comment.
Not actionable as proposed — actions/upload-artifact and actions/download-artifact are versioned independently and have always had different majors. The current latest released majors are upload-artifact@v7 (v7.0.1) and download-artifact@v8 (v8.0.1); there is no upload-artifact@v8. So the existing pinning is the correct latest-stable pairing and is also consistent with every other workflow in this repo (ci.yml, build-native.yml, benchmark.yml, shield-license-compliance.yml). Leaving as-is.
Match the cache: "npm" setting used by every other job in publish.yml so release runs reuse the npm cache instead of re-downloading dependencies.
| - name: Check for changes | ||
| if: steps.existing.outputs.skip != 'true' | ||
| id: changes | ||
| run: | | ||
| CHANGED=false | ||
| # Detect modified tracked files | ||
| if ! git diff --quiet HEAD -- generated/benchmarks/BUILD-BENCHMARKS.md README.md 2>/dev/null; then | ||
| CHANGED=true | ||
| fi | ||
| # Detect newly created (untracked) files | ||
| if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then | ||
| if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then | ||
| CHANGED=true | ||
| fi | ||
| echo "changed=$CHANGED" >> "$GITHUB_OUTPUT" |
There was a problem hiding this comment.
Untracked new files not detected by
git diff --quiet HEAD
The old build-benchmark, query-benchmark, and incremental-benchmark jobs each paired git diff --quiet HEAD with an explicit git ls-files --others --exclude-standard check to catch newly-created (untracked) history files. The new consolidated check dropped that second guard:
# old pattern, in each of the three removed jobs:
if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then
CHANGED=true
fiIf a benchmark history file does not yet exist in the repo (first-run for that benchmark type, or an intentionally deleted file being re-created), the download will produce an untracked file that git diff HEAD silently ignores, CHANGED stays false, and the PR step is skipped — the verified numbers never make it back to main.
| - name: Check for changes | |
| if: steps.existing.outputs.skip != 'true' | |
| id: changes | |
| run: | | |
| CHANGED=false | |
| # Detect modified tracked files | |
| if ! git diff --quiet HEAD -- generated/benchmarks/BUILD-BENCHMARKS.md README.md 2>/dev/null; then | |
| CHANGED=true | |
| fi | |
| # Detect newly created (untracked) files | |
| if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then | |
| if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then | |
| CHANGED=true | |
| fi | |
| echo "changed=$CHANGED" >> "$GITHUB_OUTPUT" | |
| - name: Check for changes | |
| id: changes | |
| run: | | |
| CHANGED=false | |
| if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then | |
| CHANGED=true | |
| fi | |
| if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/)" ]; then | |
| CHANGED=true | |
| fi | |
| echo "changed=$CHANGED" >> "$GITHUB_OUTPUT" |
There was a problem hiding this comment.
Fixed in b0389af — restored the git ls-files --others --exclude-standard generated/benchmarks/ guard alongside the git diff --quiet HEAD check, so first-run/re-created history files are still committed back to main.
The regression guard previously ran inside `npm test` against the existing benchmark history files. Because those files are only updated post-publish (by the Benchmark workflow's auto-PRs), a regression introduced in vX could ship to npm before its numbers were ever recorded — and the guard would then fire on every dev push to main once the docs PR landed, blocking unrelated work without ever having prevented the bad release. Restructure so the gate runs where it can actually block: - New `pre-publish-benchmark` job in publish.yml (release events only): measures the just-built native artifact, writes new history entries, runs the regression guard, and uploads the modified files. The `publish` job depends on it, so a regression fails the workflow before npm sees the new version. The history files are uploaded as an artifact. - benchmark.yml's three measurement jobs (build/query/incremental) are replaced by a single `record-benchmarks` job that downloads the pre-publish artifact and opens one PR with the verified numbers. `workflow_run.conclusion == 'success'` already gates this, so no PR is opened for an aborted publish. The engine-parity gate runs here as a soft signal (unchanged semantics). The embedding-benchmark job is unchanged — no regression guard, can't fit in pre-publish. - The regression-guard test is gated on `RUN_REGRESSION_GUARD=1` so the default `npm test` run shows it as skipped rather than failing on history that already passed gating at release time. CI sets the env var in the pre-publish step.
Match the cache: "npm" setting used by every other job in publish.yml so release runs reuse the npm cache instead of re-downloading dependencies.
Each of the three pre-consolidation jobs paired `git diff --quiet HEAD` with `git ls-files --others --exclude-standard` so a first-run history file (or one re-created after deletion) would still be picked up. The consolidated check dropped that second guard, which would silently skip the PR step if a benchmark history file was untracked. Restore the guard so verified numbers always make it back to main.
04d86c1 to
b0389af
Compare
|
Addressed Greptile's review feedback in b0389af + the prior 1fc92b2:
|
Summary
The regression guard previously ran inside
npm testagainst the existing benchmark history files. Because those files are only updated post-publish (by the Benchmark workflow's auto-PRs), a regression introduced in vX could ship to npm before its numbers were ever recorded — and the guard would then fire on every dev push to main once the docs PR landed, blocking unrelated work without ever having prevented the bad release. (See Publish #831 failure for the canonical instance: 3.9.6 shipped, then docs PRs merged, then regression-guard fired and blocked the dev pre-release for the docs commit.)This restructures so the gate runs where it can actually block.
pre-publish-benchmarkjob inpublish.yml(release events only): measures the just-built native artifact, writes new history entries, runs the regression guard, uploads modified files as artifacts. Thepublishjob depends on it, so a regression fails the workflow before npm sees the new version.benchmark.yml's three measurement jobs (build/query/incremental) are replaced by a singlerecord-benchmarksjob that downloads the pre-publish artifact and opens one PR with the verified numbers.workflow_run.conclusion == 'success'already gates this, so no PR is opened for an aborted publish. Engine-parity gate runs here as a soft signal (unchanged semantics).embedding-benchmarkjob is unchanged — no regression guard, can't fit in pre-publish (~2.5h runtime).tests/benchmarks/regression-guard.test.tsis gated onRUN_REGRESSION_GUARD=1so the defaultnpm testrun shows it as skipped rather than failing on history that already passed gating at release time. The CI step sets the env var explicitly.Test plan
npm testskips the regression-guard tests (verified locally — 17 skipped, 0 failed)RUN_REGRESSION_GUARD=1 npm run test:regression-guardruns the file (verified — 17 passed against current history)record-benchmarksopens a single combined PR (replacing the previous 3 separate ones)