Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,13 @@ jobs:
--body "Automated build benchmark update for **${VERSION}** from workflow run [#${{ github.run_number }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }})."
fi

# Engine-parity gate: runs AFTER the doc PR is created so the PR still
# records raw benchmark data even when parity regresses. The job status
# going red alerts maintainers; the linked issues describe each threshold.
- name: Engine parity gate
if: steps.existing.outputs.skip != 'true'
run: node scripts/benchmark-parity-gate.mjs benchmark-result.json

embedding-benchmark:
runs-on: ubuntu-latest
# 7 models x 30 min each = 210 min worst-case; symbols are sampled to 1500 so
Expand Down
121 changes: 121 additions & 0 deletions scripts/benchmark-parity-gate.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
#!/usr/bin/env node
/**
* Engine parity gate — runs after the release build benchmark.
*
* Reads the merged benchmark-result.json (contains `wasm` and `native` blocks)
* and fails the workflow if the gap between engines breaches a documented
* threshold. A failure here doesn't block the release (benchmark runs *after*
* Publish completes); it surfaces regressions to maintainers via the workflow's
* red status and writes a summary to $GITHUB_STEP_SUMMARY.
*
* Thresholds reference the parity bugs open against v3.9.5:
* - #1010 DB size / excess ast_nodes
* - #1011 Native orchestrator drops files
* - #1012 Native 1-file incremental runs globally
* - #1013 Native full-build edges/roles phases
*
* Each threshold fires only when BOTH engines produced results. If one engine
* failed, we leave the gate passing so the rest of the workflow (doc PR,
* artifact upload) still runs, and a separate "both engines ran" check flags
* the missing engine.
*/
import fs from 'node:fs';
import path from 'node:path';

const resultFile = process.argv[2];
if (!resultFile) {
console.error('Usage: benchmark-parity-gate.mjs <benchmark-result.json>');
process.exit(2);
}

const result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 readFileSync throws an opaque stack trace on missing file

If benchmark-result.json doesn't exist (e.g., the build benchmark step was skipped by a prior failure), Node throws a raw ENOENT stack trace. Wrapping this in a try/catch gives maintainers a cleaner diagnostic message that matches the script's own error-reporting style.

Suggested change
}
const result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));
let result;
try {
result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));
} catch (err) {
console.error(`Failed to read ${resultFile}: ${err.message}`);
process.exit(2);
}
const { wasm, native, version } = result;

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b789411readFileSync/JSON.parse wrapped in try/catch, exits 2 with a clean Failed to read <file>: <err.message> diagnostic instead of a raw ENOENT stack trace.

const { wasm, native, version } = result;

const summaryFile = process.env.GITHUB_STEP_SUMMARY;
const writeSummary = (text) => {
if (summaryFile) fs.appendFileSync(summaryFile, text);
};

function line(s = '') {
console.log(s);
writeSummary(`${s}\n`);
}

line(`## Engine parity gate — v${version}`);
line('');

if (!wasm || !native) {
const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');
line(`**FAIL:** missing engine result for: ${missing}. Benchmark cannot assert parity.`);
process.exit(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing-engine case exits 1, contradicting the inline comment

The script-level JSDoc on line 19 explicitly states: "If one engine failed, we leave the gate passing…" but process.exit(1) on line 50 does the opposite — it fails the gate. Because the gate runs after the doc PR step, a red job here won't block any data from being recorded, but it will fire a false-alarm in every run where only one engine finishes (e.g., during the window right before #1009 is fully deployed).

Suggested change
line(`## Engine parity gate — v${version}`);
line('');
if (!wasm || !native) {
const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');
line(`**FAIL:** missing engine result for: ${missing}. Benchmark cannot assert parity.`);
process.exit(1);
if (!wasm || !native) {
const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');
line(`**SKIP:** missing engine result for: ${missing}. Cannot assert parity — gate passes.`);
process.exit(0);
}

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b789411 — missing-engine branch now logs SKIP and exits 0, matching the documented pass-through intent in the file-level JSDoc.

}

// ── Thresholds ─────────────────────────────────────────────────────────
// Each entry:
// name — human-readable label
// actual — computed metric
// limit — ceiling; actual must be ≤ limit
// formatter — how to render the value
// tracks — related issue link shown on failure
const checks = [
{
Comment on lines +62 to +66
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 File-set gap check doesn't guard against undefined fields

If either wasm.files or native.files is undefined (e.g., a partial result), Math.abs(undefined - undefined) produces NaN. NaN <= 2 is false, so the check silently fails as a false positive with no indication in the summary of why. Worth adding a ?? 0 fallback consistent with how the timing checks handle missing data.

Suggested change
// limit — ceiling; actual must be ≤ limit
// formatter — how to render the value
// tracks — related issue link shown on failure
const checks = [
{
{
name: 'File-set gap (|wasm − native|)',
actual: Math.abs((wasm.files ?? 0) - (native.files ?? 0)),
limit: 2,
formatter: (v) => String(v),
tracks: '#1011',
},

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b789411 — file-set gap now uses Math.abs((wasm.files ?? 0) - (native.files ?? 0)) so undefined fields no longer collapse to NaN.

name: 'File-set gap (|wasm − native|)',
actual: Math.abs(wasm.files - native.files),
limit: 2,
formatter: (v) => String(v),
tracks: '#1011',
Comment on lines +67 to +71
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 wasm.dbSizeBytes has no zero-guard, unlike every timing check

All four timing-based ratio checks defensively use Math.max(... ?? 1, 1) as the denominator. The DB-size ratio does not. If wasm.dbSizeBytes is 0 (empty DB from a partial wasm run), the result is Infinity, which fails the <= 1.02 threshold — a false positive. If it is undefined, the result is NaN, which also fails.

Suggested change
name: 'File-set gap (|wasm − native|)',
actual: Math.abs(wasm.files - native.files),
limit: 2,
formatter: (v) => String(v),
tracks: '#1011',
{
name: 'DB size ratio (native / wasm)',
actual: native.dbSizeBytes / Math.max(wasm.dbSizeBytes ?? 1, 1),
limit: 1.02,
formatter: (v) => v.toFixed(3),
tracks: '#1010',
},

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b789411 — DB size ratio now uses (native.dbSizeBytes ?? 0) / Math.max(wasm.dbSizeBytes ?? 1, 1), matching the zero-guard pattern used by all four timing checks. Verified locally with wasm.dbSizeBytes = 0 — no more Infinity.

},
{
name: 'DB size ratio (native / wasm)',
actual: native.dbSizeBytes / wasm.dbSizeBytes,
limit: 1.02,
formatter: (v) => v.toFixed(3),
tracks: '#1010',
},
{
name: 'Full-build edges-phase ratio',
actual: (native.phases?.edgesMs ?? 0) / Math.max(wasm.phases?.edgesMs ?? 1, 1),
limit: 1.3,
formatter: (v) => v.toFixed(2),
tracks: '#1013',
},
{
name: 'Full-build roles-phase ratio',
actual: (native.phases?.rolesMs ?? 0) / Math.max(wasm.phases?.rolesMs ?? 1, 1),
limit: 1.3,
formatter: (v) => v.toFixed(2),
tracks: '#1013',
},
{
name: '1-file incremental ratio',
actual:
(native.oneFileRebuildMs ?? 0) /
Math.max(wasm.oneFileRebuildMs ?? 1, 1),
limit: 1.5,
formatter: (v) => v.toFixed(2),
tracks: '#1012',
},
];

line('| Check | Actual | Limit | Status | Tracks |');
line('|---|---:|---:|---|---|');

let failed = 0;
for (const c of checks) {
const ok = c.actual <= c.limit;
if (!ok) failed++;
const status = ok ? ':white_check_mark: pass' : ':x: **fail**';
line(
`| ${c.name} | ${c.formatter(c.actual)} | ${c.formatter(c.limit)} | ${status} | ${c.tracks} |`,
);
}

line('');
if (failed > 0) {
line(
`**${failed} parity check(s) failed.** See linked issues for root-cause tracking; the benchmark doc PR (if opened) captures the raw numbers.`,
);
process.exit(1);
}

line('All parity checks passed.');
Loading