Skip to content

ci(release): gate npm publish on benchmark regressions#1040

Merged
carlos-alm merged 3 commits intomainfrom
ci/prepublish-bench-gate
May 1, 2026
Merged

ci(release): gate npm publish on benchmark regressions#1040
carlos-alm merged 3 commits intomainfrom
ci/prepublish-bench-gate

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

The regression guard previously ran inside npm test against the existing benchmark history files. Because those files are only updated post-publish (by the Benchmark workflow's auto-PRs), a regression introduced in vX could ship to npm before its numbers were ever recorded — and the guard would then fire on every dev push to main once the docs PR landed, blocking unrelated work without ever having prevented the bad release. (See Publish #831 failure for the canonical instance: 3.9.6 shipped, then docs PRs merged, then regression-guard fired and blocked the dev pre-release for the docs commit.)

This restructures so the gate runs where it can actually block.

  • New pre-publish-benchmark job in publish.yml (release events only): measures the just-built native artifact, writes new history entries, runs the regression guard, uploads modified files as artifacts. The publish job depends on it, so a regression fails the workflow before npm sees the new version.
  • benchmark.yml's three measurement jobs (build/query/incremental) are replaced by a single record-benchmarks job that downloads the pre-publish artifact and opens one PR with the verified numbers. workflow_run.conclusion == 'success' already gates this, so no PR is opened for an aborted publish. Engine-parity gate runs here as a soft signal (unchanged semantics). embedding-benchmark job is unchanged — no regression guard, can't fit in pre-publish (~2.5h runtime).
  • tests/benchmarks/regression-guard.test.ts is gated on RUN_REGRESSION_GUARD=1 so the default npm test run shows it as skipped rather than failing on history that already passed gating at release time. The CI step sets the env var explicitly.

Test plan

  • npm test skips the regression-guard tests (verified locally — 17 skipped, 0 failed)
  • RUN_REGRESSION_GUARD=1 npm run test:regression-guard runs the file (verified — 17 passed against current history)
  • Both workflow files parse as valid YAML
  • CI: next stable release exercises the new pre-publish job end-to-end
  • Verify record-benchmarks opens a single combined PR (replacing the previous 3 separate ones)

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @carlos-alm's task —— View job


PR Review: ci(release): gate npm publish on benchmark regressions

Root Cause Analysis

Problem Being Solved: The regression guard previously ran after releases were published to npm, making it ineffective at blocking bad releases. When v3.9.6 shipped with regressions, the guard only fired after the docs PR merged, blocking subsequent dev work without ever having prevented the problematic release.

Why This Approach: Moving benchmark measurement and regression testing into the publish workflow's pre-publish-benchmark job creates a true gate - the publish job depends on it, so regressions block npm publishing. This is architecturally sound and addresses the core timing issue.

Risk Assessment: Low risk. The changes are well-isolated, maintain existing behavior for the happy path, and the conditional test gating prevents false failures during development.

Backlog Compliance

  • Zero-dep: ✓ — No new runtime dependencies, builds on existing benchmark infrastructure
  • Foundation-aligned: ✓ — Pure CI/testing improvement, doesn't violate any core principles
  • Problem-fit: 2 — Improves release quality but tangential to core "lost AI" problem
  • Breaking: No — Purely additive changes to CI workflows
  • Tier: 1

Critical Concerns

None identified. This is a well-executed infrastructure improvement.

Code Quality Assessment

Workflow Design: Excellent

  • pre-publish-benchmark job correctly depends on build-native and runs only for non-push events (stable releases)
  • publish job properly depends on pre-publish-benchmark, creating the intended gate
  • Artifact flow is clean: pre-publish measures → uploads files → record-benchmarks downloads and creates PR

Test Gating Implementation: Solid

  • RUN_REGRESSION_GUARD=1 environment variable properly gates the test suite
  • describe.runIf() ensures tests are skipped (not failed) during normal development
  • New npm run test:regression-guard script provides explicit invocation path

Configuration Changes: Appropriate

  • 469-line reduction in benchmark.yml eliminates duplicate measurement logic
  • 133-line addition to publish.yml adds the pre-publish gate without disrupting existing flows
  • Changes are focused and well-contained

Error Handling: Adequate

  • workflow_run.conclusion == 'success' prevents PR creation for failed publishes
  • Engine parity gate runs as intended soft signal (unchanged semantics)
  • Embedding benchmarks remain post-publish due to 2.5h runtime constraint

Minor Observations

  1. Embedding Constraint: The 2.5h embedding benchmark runtime legitimately cannot fit in the pre-publish path - this limitation is well-documented and the current post-publish approach is the correct tradeoff.

  2. Test Structure: The conditional test gating is clean and follows vitest patterns correctly. The approach prevents the confusing scenario where tests fail on already-merged benchmark data.

  3. Dependency Chain: The job dependency structure (preflight → compute-version/build-native → pre-publish-benchmark → publish) creates the intended gate without over-constraining the pipeline.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: This is a thoughtful solution to a real operational problem. The implementation is clean, well-tested, and addresses the root cause (timing) rather than symptoms. The only reason it's not 5-star is that it's infrastructure work tangential to the core value proposition - but it's excellent infrastructure work that materially improves release quality.

The PR demonstrates good engineering judgment in restructuring the benchmark pipeline to create an effective regression gate where one can actually block problematic releases.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR restructures the benchmark regression guard so it runs as a pre-publish gate in publish.yml rather than post-publish in benchmark.yml, correctly blocking bad releases before they reach npm. The three separate benchmark jobs in benchmark.yml are consolidated into a single record-benchmarks job that consumes pre-built artifacts from the publish run.

  • P1: The new pre-publish-benchmark job drops two test gates that previously blocked releases: tests/benchmarks/resolution/resolution-benchmark.test.ts (resolution accuracy thresholds) and tests/benchmarks/resolution/tracer/tracer-validation.test.ts (same-file edge recall). The Regression guard step only checks latency history files — it does not cover resolution accuracy, so an accuracy regression can now ship to npm silently.

Confidence Score: 4/5

Safe to merge after re-adding the omitted resolution and tracer test gates to pre-publish-benchmark

One P1 finding: the resolution threshold tests and tracer validation that previously gated releases are absent from the new pre-publish job, creating a hole where accuracy regressions can ship undetected. All other logic (regression guard gating publish, artifact hand-off, PR consolidation, untracked-file detection) is sound.

.github/workflows/publish.yml — the pre-publish-benchmark job needs the resolution and tracer test steps restored

Important Files Changed

Filename Overview
.github/workflows/publish.yml Adds pre-publish-benchmark job gating publish on regression guard, but omits the resolution threshold tests and tracer validation that previously blocked releases in the old build-benchmark job
.github/workflows/benchmark.yml Three measurement jobs consolidated into a single record-benchmarks job that downloads pre-built artifacts; correctly preserves untracked-file detection and engine-parity gate; workflow_dispatch removal is intentional
package.json Adds test:regression-guard script to run only the regression guard test file; straightforward addition
tests/benchmarks/regression-guard.test.ts Gates the entire describe block behind RUN_REGRESSION_GUARD=1 so default npm test skips it; clean, minimal change

Sequence Diagram

sequenceDiagram
    participant Release as Release Event
    participant BuildNative as build-native
    participant PreBench as pre-publish-benchmark
    participant Publish as publish (npm)
    participant BenchWF as benchmark.yml / record-benchmarks

    Release->>BuildNative: trigger
    BuildNative-->>PreBench: native artifact (linux-x64)
    PreBench->>PreBench: run build / query / incremental benchmarks
    PreBench->>PreBench: update history files
    PreBench->>PreBench: regression guard (RUN_REGRESSION_GUARD=1)
    alt regression detected
        PreBench-->>Publish: job fails — publish skipped
        Note over BenchWF: workflow_run conclusion != success → record-benchmarks skipped
    else no regression
        PreBench-->>Publish: upload benchmark-files + benchmark-results-json artifacts
        Publish->>Publish: npm publish
        Publish-->>BenchWF: workflow_run conclusion=success
        BenchWF->>BenchWF: download benchmark-files artifact
        BenchWF->>BenchWF: git diff / ls-files check
        BenchWF->>BenchWF: open single combined PR to main
        BenchWF->>BenchWF: engine-parity gate (soft signal)
    end
Loading

Comments Outside Diff (1)

  1. .github/workflows/publish.yml, line 602-607 (link)

    P1 Resolution threshold gate and tracer validation silently dropped

    The old build-benchmark job ran two explicit test gates after collecting resolution data:

    1. npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts — gates on per-language resolution accuracy thresholds
    2. npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts — validates same-file edge recall

    The new pre-publish-benchmark job runs scripts/resolution-benchmark.ts to collect data and merges it into benchmark-result.json, but neither test step is present. The Regression guard step only checks history-file comparison (build/query/incremental latency) — it does not cover resolution accuracy or tracer recall. A resolution accuracy regression can now ship to npm with no gate catching it.

    Fix in Claude Code

Fix All in Claude Code

Reviews (4): Last reviewed commit: "fix(ci): restore untracked-file detectio..." | Re-trigger Greptile

Comment on lines +251 to +253
- uses: actions/setup-node@v6
with:
node-version: "22"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing npm cache in pre-publish-benchmark

actions/setup-node@v6 is configured without cache: "npm", so every run re-downloads all npm dependencies from scratch. All other jobs in the workflow that call npm install use cache: "npm". For a release-gating job this adds unnecessary minutes to every stable release.

Suggested change
- uses: actions/setup-node@v6
with:
node-version: "22"
- uses: actions/setup-node@v6
with:
node-version: "22"
cache: "npm"

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1d50950 — added cache: "npm" to the actions/setup-node@v6 step in pre-publish-benchmark, matching the convention used by every other job in this workflow.

Comment on lines +336 to +357
uses: actions/upload-artifact@v7
with:
name: benchmark-files
path: |
generated/benchmarks/BUILD-BENCHMARKS.md
generated/benchmarks/QUERY-BENCHMARKS.md
generated/benchmarks/INCREMENTAL-BENCHMARKS.md
README.md
if-no-files-found: error

# Raw JSON used by post-publish soft-signal jobs (e.g. engine-parity
# gate in the Benchmark workflow). Separated from the history-files
# artifact because consumers read different shapes.
- name: Upload benchmark JSON results
uses: actions/upload-artifact@v7
with:
name: benchmark-results-json
path: |
benchmark-result.json
query-benchmark-result.json
incremental-benchmark-result.json
if-no-files-found: error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 upload-artifact@v7 / download-artifact@v8 version mismatch

publish.yml uploads with actions/upload-artifact@v7 while benchmark.yml downloads with actions/download-artifact@v8. Both are past the v4 threshold that switched to the new artifact service backend, so they are functionally compatible today, but pinning them to the same major version avoids any future surprise if a protocol change lands in v8 upload before v7 catches up. Consider aligning both to @v8 (or whichever is current).

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not actionable as proposed — actions/upload-artifact and actions/download-artifact are versioned independently and have always had different majors. The current latest released majors are upload-artifact@v7 (v7.0.1) and download-artifact@v8 (v8.0.1); there is no upload-artifact@v8. So the existing pinning is the correct latest-stable pairing and is also consistent with every other workflow in this repo (ci.yml, build-native.yml, benchmark.yml, shield-license-compliance.yml). Leaving as-is.

carlos-alm added a commit that referenced this pull request May 1, 2026
Match the cache: "npm" setting used by every other job in publish.yml so release runs reuse the npm cache instead of re-downloading dependencies.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Comment on lines 64 to 71
- name: Check for changes
if: steps.existing.outputs.skip != 'true'
id: changes
run: |
CHANGED=false
# Detect modified tracked files
if ! git diff --quiet HEAD -- generated/benchmarks/BUILD-BENCHMARKS.md README.md 2>/dev/null; then
CHANGED=true
fi
# Detect newly created (untracked) files
if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then
if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then
CHANGED=true
fi
echo "changed=$CHANGED" >> "$GITHUB_OUTPUT"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Untracked new files not detected by git diff --quiet HEAD

The old build-benchmark, query-benchmark, and incremental-benchmark jobs each paired git diff --quiet HEAD with an explicit git ls-files --others --exclude-standard check to catch newly-created (untracked) history files. The new consolidated check dropped that second guard:

# old pattern, in each of the three removed jobs:
if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then
  CHANGED=true
fi

If a benchmark history file does not yet exist in the repo (first-run for that benchmark type, or an intentionally deleted file being re-created), the download will produce an untracked file that git diff HEAD silently ignores, CHANGED stays false, and the PR step is skipped — the verified numbers never make it back to main.

Suggested change
- name: Check for changes
if: steps.existing.outputs.skip != 'true'
id: changes
run: |
CHANGED=false
# Detect modified tracked files
if ! git diff --quiet HEAD -- generated/benchmarks/BUILD-BENCHMARKS.md README.md 2>/dev/null; then
CHANGED=true
fi
# Detect newly created (untracked) files
if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then
if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then
CHANGED=true
fi
echo "changed=$CHANGED" >> "$GITHUB_OUTPUT"
- name: Check for changes
id: changes
run: |
CHANGED=false
if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then
CHANGED=true
fi
if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/)" ]; then
CHANGED=true
fi
echo "changed=$CHANGED" >> "$GITHUB_OUTPUT"

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b0389af — restored the git ls-files --others --exclude-standard generated/benchmarks/ guard alongside the git diff --quiet HEAD check, so first-run/re-created history files are still committed back to main.

carlos-alm and others added 3 commits April 30, 2026 23:41
The regression guard previously ran inside `npm test` against the existing
benchmark history files. Because those files are only updated post-publish
(by the Benchmark workflow's auto-PRs), a regression introduced in vX could
ship to npm before its numbers were ever recorded — and the guard would
then fire on every dev push to main once the docs PR landed, blocking
unrelated work without ever having prevented the bad release.

Restructure so the gate runs where it can actually block:

- New `pre-publish-benchmark` job in publish.yml (release events only):
  measures the just-built native artifact, writes new history entries,
  runs the regression guard, and uploads the modified files. The `publish`
  job depends on it, so a regression fails the workflow before npm sees
  the new version. The history files are uploaded as an artifact.

- benchmark.yml's three measurement jobs (build/query/incremental) are
  replaced by a single `record-benchmarks` job that downloads the
  pre-publish artifact and opens one PR with the verified numbers.
  `workflow_run.conclusion == 'success'` already gates this, so no PR
  is opened for an aborted publish. The engine-parity gate runs here
  as a soft signal (unchanged semantics). The embedding-benchmark job
  is unchanged — no regression guard, can't fit in pre-publish.

- The regression-guard test is gated on `RUN_REGRESSION_GUARD=1` so the
  default `npm test` run shows it as skipped rather than failing on
  history that already passed gating at release time. CI sets the env
  var in the pre-publish step.
Match the cache: "npm" setting used by every other job in publish.yml so
release runs reuse the npm cache instead of re-downloading dependencies.
Each of the three pre-consolidation jobs paired `git diff --quiet HEAD`
with `git ls-files --others --exclude-standard` so a first-run history
file (or one re-created after deletion) would still be picked up. The
consolidated check dropped that second guard, which would silently skip
the PR step if a benchmark history file was untracked. Restore the
guard so verified numbers always make it back to main.
@carlos-alm carlos-alm force-pushed the ci/prepublish-bench-gate branch from 04d86c1 to b0389af Compare May 1, 2026 05:42
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile's review feedback in b0389af + the prior 1fc92b2:

  • P1 untracked-file detection (benchmark.yml): restored the git ls-files --others --exclude-standard generated/benchmarks/ guard alongside the git diff --quiet HEAD check. First-run/re-created history files will once again trigger the PR step.
  • Minor — npm install flags (publish.yml): not changing. The "every other npm install passes those flags" framing isn't accurate for this workflow — publish.yml has bare npm install calls at lines 50 (build-native), 376 (publish-platform-packages), and 555 (publish). The new pre-publish-benchmark step at line 273 matches the existing pattern in this file. Worth a separate PR if we want to standardize, but out of scope here.
  • Commit-message format: force-pushed to fix a body line in the prior commit that exceeded commitlint's 100-char limit. Same content, just wrapped.

@carlos-alm carlos-alm merged commit 20b8596 into main May 1, 2026
26 checks passed
@carlos-alm carlos-alm deleted the ci/prepublish-bench-gate branch May 1, 2026 06:22
@github-actions github-actions Bot locked and limited conversation to collaborators May 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant