Lock your AI app's behavior — golden datasets, LLM-as-judge, and structural assertions in CI.
- Why Goldset?
- Install
- Quickstart
- Three runners
- GitHub Action
- API reference
- Architecture
- Setup guide
- Contributing
Most AI eval tools are dashboards with no CI integration. Goldset is the opposite:
- Evals as code — sit next to your app in the same repo
- GitHub Action native — PR diff comments, merge-blocking on regression
- Provider agnostic — plug in any
llm: (input) => Promise<string> - Three orthogonal runners — catch three completely different failure modes
| Runner | What it catches | Best for |
|---|---|---|
goldenDataset |
Output drifted from canonical answer | FAQ, refusal correctness, deterministic Q&A |
llmJudge |
Behavior regression on open-ended outputs | Tone, helpfulness, brand voice, language matching |
structural |
Output shape broke | Function calling, structured generation, JSON schema |
npm install @ykstormsorg/goldsetPeer dependency (not installed automatically):
npm install tsx # for running .eval.ts files directly// evals/customer-support.eval.ts
import { goldenDataset, llmJudge, structural, runEval } from '@ykstormsorg/goldset'
import { myLLM, myJudge } from '../src/llm'
const golden = await goldenDataset(
[
{ id: 'refund-q', input: 'How do I get a refund?', expected: 'Email support@...' },
{ id: 'shipping-q', input: 'Where is my order?', expected: 'Track at track.example.com/...' },
],
{ llm: myLLM, threshold: 0.85 }
)
const judged = await llmJudge(
[
{ id: 'hindi-q', input: 'Bopal mein 2BHK?', expected: 'Hindi response' },
{ id: 'french-q', input: 'Comment ça marche?', expected: 'French response' },
],
{
llm: myLLM,
judge: myJudge,
rubric: 'Score 5 if response is in the same language as the input. 0 if not.',
passThreshold: 3,
}
)
const shape = await structural(
[{ id: 'tool-q', input: 'lookup order #42' }],
{
llm: myLLM,
assertions: [
{ type: 'json-schema', schema: { type: 'object', properties: { orderId: { type: 'string' } } } },
{ type: 'tool-call-shape', toolName: 'lookupOrder', argCount: 1 },
],
}
)
// runEval prints a human summary (or JSON with `--output json`, as the
// GitHub Action does) and exits non-zero if any runner failed.
await runEval(golden, judged, shape)npx tsx evals/customer-support.eval.tsOutput:
✓ goldenDataset: 2/2 passed
✓ llmJudge: 2/2 passed
✓ structural: 1/1 passed
# .github/workflows/eval.yml
name: Goldset Eval
on:
pull_request:
branches: [main]
permissions:
contents: read
pull-requests: write
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm ci
- uses: ykstorm/goldset@v1
with:
eval-dir: evals
judge-provider: none # or openai | anthropic
fail-on-regression: true
comment-on-pr: true
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} # if judge-provider: openaiThe Action runs every *.eval.ts under eval-dir with npx tsx <file> --output json,
writes a combined goldset-results.json, posts (or updates) a PR comment with a
results table and a delta-vs-base section, and fails the check if any eval
fails or regresses against the base branch — gating the merge.
| Input | Default | Description |
|---|---|---|
eval-dir |
evals |
Directory containing *.eval.ts files |
judge-provider |
none |
openai | anthropic | none. Exposes the matching key (already on the runner env) to your eval as GOLDSET_JUDGE_PROVIDER |
fail-on-regression |
true |
Fail the check if any eval regresses vs the base branch |
comment-on-pr |
true |
Post/update a results + delta comment on the PR |
Outputs: results-path, passed, failed, total, all-passed.
| Runner | Function | Catches |
|---|---|---|
| Golden dataset | goldenDataset(cases, { llm, threshold }) |
Output drifted from the canonical answer (Levenshtein similarity vs a threshold) |
| LLM-as-judge | llmJudge(cases, { llm, judge, rubric }) |
Behavior regression on open-ended outputs (a second LLM scores against a rubric) |
| Structural | structural(cases, { llm, assertions }) |
Output shape broke (JSON schema, regex, substring, tool-call shape) |
See the full API reference.
Apache-2.0