Skip to content

ykstorm/goldset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Goldset

Lock your AI app's behavior — golden datasets, LLM-as-judge, and structural assertions in CI.

npm CI License Docs


Contents


Why Goldset?

Most AI eval tools are dashboards with no CI integration. Goldset is the opposite:

  • Evals as code — sit next to your app in the same repo
  • GitHub Action native — PR diff comments, merge-blocking on regression
  • Provider agnostic — plug in any llm: (input) => Promise<string>
  • Three orthogonal runners — catch three completely different failure modes
Runner What it catches Best for
goldenDataset Output drifted from canonical answer FAQ, refusal correctness, deterministic Q&A
llmJudge Behavior regression on open-ended outputs Tone, helpfulness, brand voice, language matching
structural Output shape broke Function calling, structured generation, JSON schema

Install

npm install @ykstormsorg/goldset

Peer dependency (not installed automatically):

npm install tsx   # for running .eval.ts files directly

Quickstart

1. Create an eval file

// evals/customer-support.eval.ts
import { goldenDataset, llmJudge, structural, runEval } from '@ykstormsorg/goldset'
import { myLLM, myJudge } from '../src/llm'

const golden = await goldenDataset(
  [
    { id: 'refund-q', input: 'How do I get a refund?', expected: 'Email support@...' },
    { id: 'shipping-q', input: 'Where is my order?', expected: 'Track at track.example.com/...' },
  ],
  { llm: myLLM, threshold: 0.85 }
)

const judged = await llmJudge(
  [
    { id: 'hindi-q', input: 'Bopal mein 2BHK?', expected: 'Hindi response' },
    { id: 'french-q', input: 'Comment ça marche?', expected: 'French response' },
  ],
  {
    llm: myLLM,
    judge: myJudge,
    rubric: 'Score 5 if response is in the same language as the input. 0 if not.',
    passThreshold: 3,
  }
)

const shape = await structural(
  [{ id: 'tool-q', input: 'lookup order #42' }],
  {
    llm: myLLM,
    assertions: [
      { type: 'json-schema', schema: { type: 'object', properties: { orderId: { type: 'string' } } } },
      { type: 'tool-call-shape', toolName: 'lookupOrder', argCount: 1 },
    ],
  }
)

// runEval prints a human summary (or JSON with `--output json`, as the
// GitHub Action does) and exits non-zero if any runner failed.
await runEval(golden, judged, shape)

2. Run locally

npx tsx evals/customer-support.eval.ts

Output:

✓ goldenDataset: 2/2 passed
✓ llmJudge: 2/2 passed
✓ structural: 1/1 passed

3. Add to CI

# .github/workflows/eval.yml
name: Goldset Eval

on:
  pull_request:
    branches: [main]

permissions:
  contents: read
  pull-requests: write

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm ci
      - uses: ykstorm/goldset@v1
        with:
          eval-dir: evals
          judge-provider: none   # or openai | anthropic
          fail-on-regression: true
          comment-on-pr: true
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          # OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}   # if judge-provider: openai

The Action runs every *.eval.ts under eval-dir with npx tsx <file> --output json, writes a combined goldset-results.json, posts (or updates) a PR comment with a results table and a delta-vs-base section, and fails the check if any eval fails or regresses against the base branch — gating the merge.

GitHub Action

Input Default Description
eval-dir evals Directory containing *.eval.ts files
judge-provider none openai | anthropic | none. Exposes the matching key (already on the runner env) to your eval as GOLDSET_JUDGE_PROVIDER
fail-on-regression true Fail the check if any eval regresses vs the base branch
comment-on-pr true Post/update a results + delta comment on the PR

Outputs: results-path, passed, failed, total, all-passed.


Three runners

Runner Function Catches
Golden dataset goldenDataset(cases, { llm, threshold }) Output drifted from the canonical answer (Levenshtein similarity vs a threshold)
LLM-as-judge llmJudge(cases, { llm, judge, rubric }) Behavior regression on open-ended outputs (a second LLM scores against a rubric)
Structural structural(cases, { llm, assertions }) Output shape broke (JSON schema, regex, substring, tool-call shape)

See the full API reference.

License

Apache-2.0

Packages

 
 
 

Contributors