ci(release): gate npm publish on benchmark regressions by carlos-alm · Pull Request #1040 · optave/ops-codegraph-tool

carlos-alm · 2026-04-30T22:46:13Z

Summary

The regression guard previously ran inside npm test against the existing benchmark history files. Because those files are only updated post-publish (by the Benchmark workflow's auto-PRs), a regression introduced in vX could ship to npm before its numbers were ever recorded — and the guard would then fire on every dev push to main once the docs PR landed, blocking unrelated work without ever having prevented the bad release. (See Publish #831 failure for the canonical instance: 3.9.6 shipped, then docs PRs merged, then regression-guard fired and blocked the dev pre-release for the docs commit.)

This restructures so the gate runs where it can actually block.

New pre-publish-benchmark job in publish.yml (release events only): measures the just-built native artifact, writes new history entries, runs the regression guard, uploads modified files as artifacts. The publish job depends on it, so a regression fails the workflow before npm sees the new version.
benchmark.yml's three measurement jobs (build/query/incremental) are replaced by a single record-benchmarks job that downloads the pre-publish artifact and opens one PR with the verified numbers. workflow_run.conclusion == 'success' already gates this, so no PR is opened for an aborted publish. Engine-parity gate runs here as a soft signal (unchanged semantics). embedding-benchmark job is unchanged — no regression guard, can't fit in pre-publish (~2.5h runtime).
tests/benchmarks/regression-guard.test.ts is gated on RUN_REGRESSION_GUARD=1 so the default npm test run shows it as skipped rather than failing on history that already passed gating at release time. The CI step sets the env var explicitly.

Test plan

npm test skips the regression-guard tests (verified locally — 17 skipped, 0 failed)
RUN_REGRESSION_GUARD=1 npm run test:regression-guard runs the file (verified — 17 passed against current history)
Both workflow files parse as valid YAML
CI: next stable release exercises the new pre-publish job end-to-end
Verify record-benchmarks opens a single combined PR (replacing the previous 3 separate ones)

claude · 2026-04-30T22:46:32Z

Claude finished @carlos-alm's task —— View job

PR Review: ci(release): gate npm publish on benchmark regressions

Root Cause Analysis

Problem Being Solved: The regression guard previously ran after releases were published to npm, making it ineffective at blocking bad releases. When v3.9.6 shipped with regressions, the guard only fired after the docs PR merged, blocking subsequent dev work without ever having prevented the problematic release.

Why This Approach: Moving benchmark measurement and regression testing into the publish workflow's pre-publish-benchmark job creates a true gate - the publish job depends on it, so regressions block npm publishing. This is architecturally sound and addresses the core timing issue.

Risk Assessment: Low risk. The changes are well-isolated, maintain existing behavior for the happy path, and the conditional test gating prevents false failures during development.

Backlog Compliance

Zero-dep: ✓ — No new runtime dependencies, builds on existing benchmark infrastructure
Foundation-aligned: ✓ — Pure CI/testing improvement, doesn't violate any core principles
Problem-fit: 2 — Improves release quality but tangential to core "lost AI" problem
Breaking: No — Purely additive changes to CI workflows
Tier: 1

Critical Concerns

None identified. This is a well-executed infrastructure improvement.

Code Quality Assessment

Workflow Design: Excellent

pre-publish-benchmark job correctly depends on build-native and runs only for non-push events (stable releases)
publish job properly depends on pre-publish-benchmark, creating the intended gate
Artifact flow is clean: pre-publish measures → uploads files → record-benchmarks downloads and creates PR

Test Gating Implementation: Solid

RUN_REGRESSION_GUARD=1 environment variable properly gates the test suite
describe.runIf() ensures tests are skipped (not failed) during normal development
New npm run test:regression-guard script provides explicit invocation path

Configuration Changes: Appropriate

469-line reduction in benchmark.yml eliminates duplicate measurement logic
133-line addition to publish.yml adds the pre-publish gate without disrupting existing flows
Changes are focused and well-contained

Error Handling: Adequate

workflow_run.conclusion == 'success' prevents PR creation for failed publishes
Engine parity gate runs as intended soft signal (unchanged semantics)
Embedding benchmarks remain post-publish due to 2.5h runtime constraint

Minor Observations

Embedding Constraint: The 2.5h embedding benchmark runtime legitimately cannot fit in the pre-publish path - this limitation is well-documented and the current post-publish approach is the correct tradeoff.
Test Structure: The conditional test gating is clean and follows vitest patterns correctly. The approach prevents the confusing scenario where tests fail on already-merged benchmark data.
Dependency Chain: The job dependency structure (preflight → compute-version/build-native → pre-publish-benchmark → publish) creates the intended gate without over-constraining the pipeline.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: This is a thoughtful solution to a real operational problem. The implementation is clean, well-tested, and addresses the root cause (timing) rather than symptoms. The only reason it's not 5-star is that it's infrastructure work tangential to the core value proposition - but it's excellent infrastructure work that materially improves release quality.

The PR demonstrates good engineering judgment in restructuring the benchmark pipeline to create an effective regression gate where one can actually block problematic releases.

greptile-apps · 2026-04-30T22:49:50Z

Greptile Summary

This PR restructures the benchmark regression guard so it runs as a pre-publish gate in publish.yml rather than post-publish in benchmark.yml, correctly blocking bad releases before they reach npm. The three separate benchmark jobs in benchmark.yml are consolidated into a single record-benchmarks job that consumes pre-built artifacts from the publish run.

P1: The new pre-publish-benchmark job drops two test gates that previously blocked releases: tests/benchmarks/resolution/resolution-benchmark.test.ts (resolution accuracy thresholds) and tests/benchmarks/resolution/tracer/tracer-validation.test.ts (same-file edge recall). The Regression guard step only checks latency history files — it does not cover resolution accuracy, so an accuracy regression can now ship to npm silently.

Confidence Score: 4/5

Safe to merge after re-adding the omitted resolution and tracer test gates to pre-publish-benchmark

One P1 finding: the resolution threshold tests and tracer validation that previously gated releases are absent from the new pre-publish job, creating a hole where accuracy regressions can ship undetected. All other logic (regression guard gating publish, artifact hand-off, PR consolidation, untracked-file detection) is sound.

.github/workflows/publish.yml — the pre-publish-benchmark job needs the resolution and tracer test steps restored

Important Files Changed

Filename	Overview
.github/workflows/publish.yml	Adds `pre-publish-benchmark` job gating publish on regression guard, but omits the resolution threshold tests and tracer validation that previously blocked releases in the old `build-benchmark` job
.github/workflows/benchmark.yml	Three measurement jobs consolidated into a single `record-benchmarks` job that downloads pre-built artifacts; correctly preserves untracked-file detection and engine-parity gate; `workflow_dispatch` removal is intentional
package.json	Adds `test:regression-guard` script to run only the regression guard test file; straightforward addition
tests/benchmarks/regression-guard.test.ts	Gates the entire describe block behind `RUN_REGRESSION_GUARD=1` so default `npm test` skips it; clean, minimal change

Sequence Diagram

sequenceDiagram
    participant Release as Release Event
    participant BuildNative as build-native
    participant PreBench as pre-publish-benchmark
    participant Publish as publish (npm)
    participant BenchWF as benchmark.yml / record-benchmarks

    Release->>BuildNative: trigger
    BuildNative-->>PreBench: native artifact (linux-x64)
    PreBench->>PreBench: run build / query / incremental benchmarks
    PreBench->>PreBench: update history files
    PreBench->>PreBench: regression guard (RUN_REGRESSION_GUARD=1)
    alt regression detected
        PreBench-->>Publish: job fails — publish skipped
        Note over BenchWF: workflow_run conclusion != success → record-benchmarks skipped
    else no regression
        PreBench-->>Publish: upload benchmark-files + benchmark-results-json artifacts
        Publish->>Publish: npm publish
        Publish-->>BenchWF: workflow_run conclusion=success
        BenchWF->>BenchWF: download benchmark-files artifact
        BenchWF->>BenchWF: git diff / ls-files check
        BenchWF->>BenchWF: open single combined PR to main
        BenchWF->>BenchWF: engine-parity gate (soft signal)
    end

Comments Outside Diff (1)

.github/workflows/publish.yml, line 602-607 (link)

Resolution threshold gate and tracer validation silently dropped

The old build-benchmark job ran two explicit test gates after collecting resolution data:
1. npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts — gates on per-language resolution accuracy thresholds
2. npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts — validates same-file edge recall
The new pre-publish-benchmark job runs scripts/resolution-benchmark.ts to collect data and merges it into benchmark-result.json, but neither test step is present. The Regression guard step only checks history-file comparison (build/query/incremental latency) — it does not cover resolution accuracy or tracer recall. A resolution accuracy regression can now ship to npm with no gate catching it.

_{Reviews (4): Last reviewed commit: "fix(ci): restore untracked-file detectio..." | Re-trigger Greptile}

greptile-apps · 2026-04-30T22:49:53Z

+      - uses: actions/setup-node@v6
+        with:
+          node-version: "22"


Missing npm cache in pre-publish-benchmark

actions/setup-node@v6 is configured without cache: "npm", so every run re-downloads all npm dependencies from scratch. All other jobs in the workflow that call npm install use cache: "npm". For a release-gating job this adds unnecessary minutes to every stable release.

Suggested change

- uses: actions/setup-node@v6

with:

node-version: "22"

- uses: actions/setup-node@v6

with:

node-version: "22"

cache: "npm"

Fixed in 1d50950 — added cache: "npm" to the actions/setup-node@v6 step in pre-publish-benchmark, matching the convention used by every other job in this workflow.

greptile-apps · 2026-04-30T22:49:56Z

+        uses: actions/upload-artifact@v7
+        with:
+          name: benchmark-files
+          path: |
+            generated/benchmarks/BUILD-BENCHMARKS.md
+            generated/benchmarks/QUERY-BENCHMARKS.md
+            generated/benchmarks/INCREMENTAL-BENCHMARKS.md
+            README.md
+          if-no-files-found: error
+
+      # Raw JSON used by post-publish soft-signal jobs (e.g. engine-parity
+      # gate in the Benchmark workflow). Separated from the history-files
+      # artifact because consumers read different shapes.
+      - name: Upload benchmark JSON results
+        uses: actions/upload-artifact@v7
+        with:
+          name: benchmark-results-json
+          path: |
+            benchmark-result.json
+            query-benchmark-result.json
+            incremental-benchmark-result.json
+          if-no-files-found: error


upload-artifact@v7 / download-artifact@v8 version mismatch

publish.yml uploads with actions/upload-artifact@v7 while benchmark.yml downloads with actions/download-artifact@v8. Both are past the v4 threshold that switched to the new artifact service backend, so they are functionally compatible today, but pinning them to the same major version avoids any future surprise if a protocol change lands in v8 upload before v7 catches up. Consider aligning both to @v8 (or whichever is current).

Not actionable as proposed — actions/upload-artifact and actions/download-artifact are versioned independently and have always had different majors. The current latest released majors are upload-artifact@v7 (v7.0.1) and download-artifact@v8 (v8.0.1); there is no upload-artifact@v8. So the existing pinning is the correct latest-stable pairing and is also consistent with every other workflow in this repo (ci.yml, build-native.yml, benchmark.yml, shield-license-compliance.yml). Leaving as-is.

Match the cache: "npm" setting used by every other job in publish.yml so release runs reuse the npm cache instead of re-downloading dependencies.

carlos-alm · 2026-05-01T00:53:12Z

@greptileai

greptile-apps · 2026-05-01T01:58:37Z

      - name: Check for changes
-        if: steps.existing.outputs.skip != 'true'
        id: changes
        run: |
          CHANGED=false
-          # Detect modified tracked files
-          if ! git diff --quiet HEAD -- generated/benchmarks/BUILD-BENCHMARKS.md README.md 2>/dev/null; then
-            CHANGED=true
-          fi
-          # Detect newly created (untracked) files
-          if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then
+          if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then
            CHANGED=true
          fi
          echo "changed=$CHANGED" >> "$GITHUB_OUTPUT"


Untracked new files not detected by git diff --quiet HEAD

The old build-benchmark, query-benchmark, and incremental-benchmark jobs each paired git diff --quiet HEAD with an explicit git ls-files --others --exclude-standard check to catch newly-created (untracked) history files. The new consolidated check dropped that second guard:

# old pattern, in each of the three removed jobs: if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then CHANGED=true fi

If a benchmark history file does not yet exist in the repo (first-run for that benchmark type, or an intentionally deleted file being re-created), the download will produce an untracked file that git diff HEAD silently ignores, CHANGED stays false, and the PR step is skipped — the verified numbers never make it back to main.

Suggested change

- name: Check for changes

if: steps.existing.outputs.skip != 'true'

id: changes

run: |

CHANGED=false

# Detect modified tracked files

if ! git diff --quiet HEAD -- generated/benchmarks/BUILD-BENCHMARKS.md README.md 2>/dev/null; then

CHANGED=true

fi

# Detect newly created (untracked) files

if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/BUILD-BENCHMARKS.md)" ]; then

if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then

CHANGED=true

fi

echo "changed=$CHANGED" >> "$GITHUB_OUTPUT"

- name: Check for changes

id: changes

run: |

CHANGED=false

if ! git diff --quiet HEAD -- generated/benchmarks/ README.md 2>/dev/null; then

CHANGED=true

fi

if [ -n "$(git ls-files --others --exclude-standard generated/benchmarks/)" ]; then

CHANGED=true

fi

echo "changed=$CHANGED" >> "$GITHUB_OUTPUT"

Fixed in b0389af — restored the git ls-files --others --exclude-standard generated/benchmarks/ guard alongside the git diff --quiet HEAD check, so first-run/re-created history files are still committed back to main.

The regression guard previously ran inside `npm test` against the existing benchmark history files. Because those files are only updated post-publish (by the Benchmark workflow's auto-PRs), a regression introduced in vX could ship to npm before its numbers were ever recorded — and the guard would then fire on every dev push to main once the docs PR landed, blocking unrelated work without ever having prevented the bad release. Restructure so the gate runs where it can actually block: - New `pre-publish-benchmark` job in publish.yml (release events only): measures the just-built native artifact, writes new history entries, runs the regression guard, and uploads the modified files. The `publish` job depends on it, so a regression fails the workflow before npm sees the new version. The history files are uploaded as an artifact. - benchmark.yml's three measurement jobs (build/query/incremental) are replaced by a single `record-benchmarks` job that downloads the pre-publish artifact and opens one PR with the verified numbers. `workflow_run.conclusion == 'success'` already gates this, so no PR is opened for an aborted publish. The engine-parity gate runs here as a soft signal (unchanged semantics). The embedding-benchmark job is unchanged — no regression guard, can't fit in pre-publish. - The regression-guard test is gated on `RUN_REGRESSION_GUARD=1` so the default `npm test` run shows it as skipped rather than failing on history that already passed gating at release time. CI sets the env var in the pre-publish step.

Match the cache: "npm" setting used by every other job in publish.yml so release runs reuse the npm cache instead of re-downloading dependencies.

Each of the three pre-consolidation jobs paired `git diff --quiet HEAD` with `git ls-files --others --exclude-standard` so a first-run history file (or one re-created after deletion) would still be picked up. The consolidated check dropped that second guard, which would silently skip the PR step if a benchmark history file was untracked. Restore the guard so verified numbers always make it back to main.

carlos-alm · 2026-05-01T05:42:36Z

Addressed Greptile's review feedback in b0389af + the prior 1fc92b2:

P1 untracked-file detection (benchmark.yml): restored the git ls-files --others --exclude-standard generated/benchmarks/ guard alongside the git diff --quiet HEAD check. First-run/re-created history files will once again trigger the PR step.
Minor — npm install flags (publish.yml): not changing. The "every other npm install passes those flags" framing isn't accurate for this workflow — publish.yml has bare npm install calls at lines 50 (build-native), 376 (publish-platform-packages), and 555 (publish). The new pre-publish-benchmark step at line 273 matches the existing pattern in this file. Worth a separate PR if we want to standardize, but out of scope here.
Commit-message format: force-pushed to fix a body line in the prior commit that exceeded commitlint's 100-char limit. Same content, just wrapped.

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

carlos-alm added a commit that referenced this pull request May 1, 2026

fix(ci): cache npm in pre-publish-benchmark job (#1040)

1d50950

Match the cache: "npm" setting used by every other job in publish.yml so release runs reuse the npm cache instead of re-downloading dependencies.

greptile-apps Bot reviewed May 1, 2026

View reviewed changes

carlos-alm and others added 3 commits April 30, 2026 23:41

fix(ci): cache npm in pre-publish-benchmark job (#1040)

1fc92b2

Match the cache: "npm" setting used by every other job in publish.yml so release runs reuse the npm cache instead of re-downloading dependencies.

carlos-alm force-pushed the ci/prepublish-bench-gate branch from 04d86c1 to b0389af Compare May 1, 2026 05:42

carlos-alm merged commit 20b8596 into main May 1, 2026
26 checks passed

carlos-alm deleted the ci/prepublish-bench-gate branch May 1, 2026 06:22

github-actions Bot locked and limited conversation to collaborators May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(release): gate npm publish on benchmark regressions#1040

ci(release): gate npm publish on benchmark regressions#1040
carlos-alm merged 3 commits intomainfrom
ci/prepublish-bench-gate

carlos-alm commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

carlos-alm May 1, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

carlos-alm May 1, 2026

Uh oh!

carlos-alm commented May 1, 2026

Uh oh!

greptile-apps Bot May 1, 2026

Uh oh!

carlos-alm May 1, 2026

Uh oh!

carlos-alm commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 30, 2026

Summary

Test plan

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: ci(release): gate npm publish on benchmark regressions

Root Cause Analysis

Backlog Compliance

Critical Concerns

Code Quality Assessment

Minor Observations

Final Recommendation

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 1, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented May 1, 2026

Uh oh!

greptile-apps Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 1, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 30, 2026 •

edited

Loading

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading