Goldset

Lock your AI app's behavior — golden datasets, LLM-as-judge, and structural assertions in CI.

Why Goldset?

Most AI eval tools are dashboards with no CI integration. Goldset is the opposite:

Evals as code — sit next to your app in the same repo
GitHub Action native — PR diff comments, merge-blocking on regression
Provider agnostic — plug in any llm: (input) => Promise<string>
Three orthogonal runners — catch three completely different failure modes

Runner	What it catches	Best for
`goldenDataset`	Output drifted from canonical answer	FAQ, refusal correctness, deterministic Q&A
`llmJudge`	Behavior regression on open-ended outputs	Tone, helpfulness, brand voice, language matching
`structural`	Output shape broke	Function calling, structured generation, JSON schema

Install

npm install @ykstormsorg/goldset

Peer dependency (not installed automatically):

npm install tsx   # for running .eval.ts files directly

Quickstart

1. Create an eval file

// evals/customer-support.eval.ts
import { goldenDataset, llmJudge, structural, runEval } from '@ykstormsorg/goldset'
import { myLLM, myJudge } from '../src/llm'

const golden = await goldenDataset(
  [
    { id: 'refund-q', input: 'How do I get a refund?', expected: 'Email support@...' },
    { id: 'shipping-q', input: 'Where is my order?', expected: 'Track at track.example.com/...' },
  ],
  { llm: myLLM, threshold: 0.85 }
)

const judged = await llmJudge(
  [
    { id: 'hindi-q', input: 'Bopal mein 2BHK?', expected: 'Hindi response' },
    { id: 'french-q', input: 'Comment ça marche?', expected: 'French response' },
  ],
  {
    llm: myLLM,
    judge: myJudge,
    rubric: 'Score 5 if response is in the same language as the input. 0 if not.',
    passThreshold: 3,
  }
)

const shape = await structural(
  [{ id: 'tool-q', input: 'lookup order #42' }],
  {
    llm: myLLM,
    assertions: [
      { type: 'json-schema', schema: { type: 'object', properties: { orderId: { type: 'string' } } } },
      { type: 'tool-call-shape', toolName: 'lookupOrder', argCount: 1 },
    ],
  }
)

// runEval prints a human summary (or JSON with `--output json`, as the
// GitHub Action does) and exits non-zero if any runner failed.
await runEval(golden, judged, shape)

2. Run locally

npx tsx evals/customer-support.eval.ts

Output:

✓ goldenDataset: 2/2 passed
✓ llmJudge: 2/2 passed
✓ structural: 1/1 passed

3. Add to CI

# .github/workflows/eval.yml
name: Goldset Eval

on:
  pull_request:
    branches: [main]

permissions:
  contents: read
  pull-requests: write

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm ci
      - uses: ykstorm/goldset@v1
        with:
          eval-dir: evals
          judge-provider: none   # or openai | anthropic
          fail-on-regression: true
          comment-on-pr: true
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          # OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}   # if judge-provider: openai

The Action runs every *.eval.ts under eval-dir with npx tsx <file> --output json, writes a combined goldset-results.json, posts (or updates) a PR comment with a results table and a delta-vs-base section, and fails the check if any eval fails or regresses against the base branch — gating the merge.

GitHub Action

Input	Default	Description
`eval-dir`	`evals`	Directory containing `*.eval.ts` files
`judge-provider`	`none`	`openai` \| `anthropic` \| `none`. Exposes the matching key (already on the runner env) to your eval as `GOLDSET_JUDGE_PROVIDER`
`fail-on-regression`	`true`	Fail the check if any eval regresses vs the base branch
`comment-on-pr`	`true`	Post/update a results + delta comment on the PR

Outputs: results-path, passed, failed, total, all-passed.

Three runners

Runner	Function	Catches
Golden dataset	`goldenDataset(cases, { llm, threshold })`	Output drifted from the canonical answer (Levenshtein similarity vs a threshold)
LLM-as-judge	`llmJudge(cases, { llm, judge, rubric })`	Behavior regression on open-ended outputs (a second LLM scores against a rubric)
Structural	`structural(cases, { llm, assertions })`	Output shape broke (JSON schema, regex, substring, tool-call shape)

See the full API reference.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
action		action
dist		dist
docs		docs
evals		evals
examples/customer-support		examples/customer-support
fixtures/evals		fixtures/evals
src		src
tests		tests
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOY.md		DEPLOY.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
SPEC.md		SPEC.md
action.yml		action.yml
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.check.json		tsconfig.check.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goldset

Contents

Why Goldset?

Install

Quickstart

1. Create an eval file

2. Run locally

3. Add to CI

GitHub Action

Three runners

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Goldset

Contents

Why Goldset?

Install

Quickstart

1. Create an eval file

2. Run locally

3. Add to CI

GitHub Action

Three runners

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages