Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions kits/llm-eval-harness/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.lamatic/
node_modules/
.next/
.env
.env.local
81 changes: 81 additions & 0 deletions kits/llm-eval-harness/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# LLM Eval Harness

A ready-to-deploy kit that scores an LLM prompt against a **golden set** using an **LLM-as-judge**, then applies a **CI-style pass/fail gate** — so you can catch quality regressions *before* they ship.

> Point it at any system prompt, give it a handful of test cases with expected criteria, and it tells you whether the prompt's outputs are faithful, relevant, and correct — with a single GATE PASSED / GATE FAILED verdict.

---

## The problem

When you ship an LLM feature and then tweak a prompt or swap a model, output quality can silently regress — a small wording change makes the model hallucinate, over-promise, or drift off-task, and you don't find out until a user does. Eyeballing a few outputs doesn't scale and isn't repeatable.

Teams solve this with an **evaluation harness**: a fixed set of representative inputs (a *golden set*), an automated grader, and a quality bar that must be met to ship. This kit packages that pattern as a hosted, reusable tool on Lamatic.

## The approach

For each case in the golden set, the kit runs two flows:

1. **`run-target`** — sends your system-prompt-under-test + the case input to an LLM and captures the output (the *system under test*).
2. **`judge`** — an LLM-as-judge scores that output against the case's `criteria` (and optional `reference`) on three dimensions, **0–5** each:
- **Faithfulness** — is every claim grounded? (hallucination is penalised hard — it's a veto)
- **Relevancy** — does it actually address the input?
- **Correctness** — does it satisfy the case criteria?

The app aggregates the per-case verdicts into a **pass rate** and compares it to a threshold you set (default **90%**) to produce the gate. A case **passes** only if `overall ≥ 3.5` **and** `faithfulness ≥ 3`.

```
golden case ──▶ run-target (LLM) ──▶ output ──▶ judge (LLM-as-judge) ──▶ {scores, pass, reasoning}
all cases ──▶ pass rate vs threshold ──▶ GATE PASS / FAIL
```

## Results

- Runs entirely on **Lamatic flows** (Groq `llama-3.3-70b-versatile`, temperature 0 for deterministic scoring).
- The judge reliably **distinguishes good from bad output** — e.g. it fails a support reply that invents a refund against a "final-sale is non-refundable" policy (faithfulness 0), and passes a correct, grounded reply.
- Per-case results are expandable to show the generated output and the judge's reasoning, so a failure tells you *why*.

## Tradeoffs & assumptions

- **Single provider (v1):** the flows use Groq. Lamatic stores model credentials at the project level, so multi-provider / bring-your-own-key was deliberately scoped out of v1 — runtime credential injection is a security tradeoff worth doing properly rather than quickly.
- **App-side loop:** the golden set is iterated in the Next.js server action (3 cases concurrently) rather than inside one flow, which keeps the flows simple and lets the UI surface per-case progress and errors.
- **Gate recomputed in code:** `overall` and `pass` are recomputed from the judge's dimension scores in the app, so the gate is deterministic and not dependent on the model's own arithmetic.
- **Defensive parsing:** judge output is tolerant of code fences and minor formatting; run-target output is HTML-entity-decoded before scoring.

---

## Flows

| Flow | Input | Output |
|------|-------|--------|
| `judge` | `{ input, output, criteria, reference? }` | `{ faithfulness, relevancy, correctness, overall, pass, reasoning }` |
| `run-target` | `{ systemPrompt, input }` | `{ answer }` (the generated output under test) |

## Setup

```bash
cd kits/llm-eval-harness/apps
cp .env.example .env.local # then fill in the values below
npm install
npm run dev # http://localhost:3000
```

### Environment variables

| Variable | Where to find it |
|----------|------------------|
| `JUDGE_FLOW` | Deploy the `judge` flow in Lamatic Studio → copy its Flow ID |
| `RUN_TARGET_FLOW` | Deploy the `run-target` flow → copy its Flow ID |
| `LAMATIC_API_URL` | Studio → Settings / API |
| `LAMATIC_PROJECT_ID` | Studio → Project settings |
| `LAMATIC_API_KEY` | Studio → API Keys |

## Usage

1. Paste the **system prompt** you want to evaluate.
2. Provide a **golden set** as JSON — an array of `{ input, criteria, reference? }`.
3. Set a **gate threshold** (default 90%).
4. Click **Run evaluation** — or **Load example** to try a support-agent scenario.

Built on [Lamatic](https://lamatic.ai).
52 changes: 52 additions & 0 deletions kits/llm-eval-harness/agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# LLM Eval Harness

## Overview
The LLM Eval Harness is a quality-gate agent for other LLM features. Given a system prompt and a golden set of test cases, it runs each case through the prompt-under-test and then grades the output with an LLM-as-judge across faithfulness, relevancy, and correctness, returning per-case scores and a single pass/fail gate. It is invoked by a Next.js web UI that calls two Lamatic flows and aggregates the verdicts. It depends on Lamatic's hosted runtime, project credentials, and a connected text-generation provider (Groq).

## Purpose
Prompt and model changes can silently regress output quality — a reworded instruction starts hallucinating, over-promising, or drifting off-task. This agent makes that measurable and repeatable: a fixed golden set plus an automated judge plus a quality threshold, so a regression is caught as a failed gate rather than by a user. It generalises the eval-harness pattern (golden sets + LLM-as-judge + CI gate) into a hosted, reusable tool.

## Flows

### `judge`
- **Trigger:** API request with `{ input, output, criteria, reference? }`.
- **Processing:** a single LLM node (Groq `llama-3.3-70b-versatile`, temperature 0) acts as a strict evaluation judge using the system prompt in `prompts/`. It scores the candidate `output` against the `criteria` and optional `reference`.
- **Response:** JSON `{ faithfulness, relevancy, correctness, overall, pass, reasoning }`, each dimension 0–5.
- **When to use:** to score one already-generated output against case criteria.
- **Dependencies:** Groq text model credential.

### `run-target`
- **Trigger:** API request with `{ systemPrompt, input }`.
- **Processing:** a single LLM node runs `systemPrompt` (system) + `input` (user) — this is the *system under test*.
- **Response:** `{ answer }`, the generated output.
- **When to use:** to produce the output that `judge` then scores.
- **Dependencies:** Groq text model credential.

## Guardrails
- The `judge` only scores; it never completes the user's task or rewrites the output.
- It does not reward length, confidence, formatting, or politeness — an eloquent but unsupported answer scores low on faithfulness.
- Faithfulness is a veto: a hallucinated or contradicting answer fails regardless of other scores.
- Scoring is deterministic (temperature 0); identical inputs yield identical scores.

## Integration Reference
- **Lamatic API runtime** — hosts and executes both flows. Requires `LAMATIC_API_URL`, `LAMATIC_PROJECT_ID`, `LAMATIC_API_KEY` in the calling app.
- **Groq (text generation)** — backs both LLM nodes; configured as a model credential in Lamatic Studio.

## Environment Setup
- `JUDGE_FLOW` — deployed `judge` flow ID, called by the app.
- `RUN_TARGET_FLOW` — deployed `run-target` flow ID, called by the app.
- `LAMATIC_API_URL`, `LAMATIC_PROJECT_ID`, `LAMATIC_API_KEY` — Lamatic project credentials used by the app to invoke the flows.

## Quickstart
1. Build and deploy the `judge` and `run-target` flows in Lamatic Studio; copy their Flow IDs.
2. In `apps/`, copy `.env.example` to `.env.local` and fill in the flow IDs + Lamatic credentials.
3. `npm install && npm run dev`, open `http://localhost:3000`.
4. Paste a system prompt + a golden set (or click **Load example**) and run.

## Common Failure Modes
| Symptom | Likely cause | Fix |
|---|---|---|
| Judge scores look random | Model too small or temperature not 0 | Use `llama-3.3-70b-versatile`, set temperature 0 |
| "No answer returned from flow" | Wrong flow ID or response mapping | Verify `JUDGE_FLOW`/`RUN_TARGET_FLOW` and that the response maps `answer` |
| Auth error on run | Missing/invalid Lamatic credentials | Check `LAMATIC_API_*` in `.env.local` |
| A case shows "error" | run-target or judge failed for that input | Expand the row; the run continues for other cases |
8 changes: 8 additions & 0 deletions kits/llm-eval-harness/apps/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Deployed Lamatic flow IDs (Studio → deploy the flow → copy its Flow ID)
JUDGE_FLOW="your-judge-flow-id"
RUN_TARGET_FLOW="your-run-target-flow-id"

# Lamatic project credentials (Studio → Settings / API)
LAMATIC_API_URL="https://your-project.lamatic.dev"
LAMATIC_PROJECT_ID="your-project-id"
LAMATIC_API_KEY="your-lamatic-api-key"
29 changes: 29 additions & 0 deletions kits/llm-eval-harness/apps/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.

# dependencies
/node_modules

# next.js
/.next/
/out/

# production
/build

# debug
npm-debug.log*
yarn-debug.log*
yarn-error.log*
.pnpm-debug.log*

# env files
.env
.env.local
.env*.local

# vercel
.vercel

# typescript
*.tsbuildinfo
next-env.d.ts
34 changes: 34 additions & 0 deletions kits/llm-eval-harness/apps/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# LLM Eval Harness — App

Next.js front end for the **LLM Eval Harness** kit. It calls two Lamatic flows
(`run-target` and `judge`) to score a system prompt against a golden set and
render a CI-style pass/fail gate.

See the [kit README](../README.md) for the full overview.

## Run locally

```bash
cp .env.example .env.local # fill in flow IDs + Lamatic credentials
npm install
npm run dev # http://localhost:3000
```

## Environment variables

| Variable | Source |
|----------|--------|
| `JUDGE_FLOW` | Deployed `judge` flow ID (Lamatic Studio) |
| `RUN_TARGET_FLOW` | Deployed `run-target` flow ID |
| `LAMATIC_API_URL` | Studio → Settings / API |
| `LAMATIC_PROJECT_ID` | Studio → Project settings |
| `LAMATIC_API_KEY` | Studio → API Keys |

## Structure

- `actions/orchestrate.ts` — server action: per-case `run-target` → `judge` loop, aggregation, gate
- `lib/lamatic-client.ts` — Lamatic SDK client + flow IDs from env
- `lib/eval.ts` — judge-output parsing, HTML decode, gate computation, bounded concurrency
- `lib/types.ts` — shared data contracts
- `components/gate-banner.tsx`, `components/results-table.tsx` — results UI
- `app/page.tsx` — the harness UI
84 changes: 84 additions & 0 deletions kits/llm-eval-harness/apps/actions/orchestrate.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
"use server"

import lamaticConfig from "../../lamatic.config"
import { getLamaticClient } from "@/lib/lamatic-client"
import { computeAggregate, decodeHtmlEntities, mapWithConcurrency, parseJudgeResult } from "@/lib/eval"
import type { CaseResult, GoldenCase, RunAggregate } from "@/lib/types"

// Bounded concurrency keeps large golden sets from tripping Groq rate limits.
const CONCURRENCY = 3

/** Resolve a deployed flow ID from the kit's lamatic.config step definitions. */
function resolveFlowId(stepId: string): string {
const step = lamaticConfig.steps.find((s) => s.id === stepId)
if (!step?.envKey) {
throw new Error(`lamatic.config has no step "${stepId}" with an envKey`)
}
const value = process.env[step.envKey]
if (!value) {
throw new Error(`Missing environment variable "${step.envKey}" for flow "${stepId}"`)
}
return value
}

/** Execute a flow and pull the `answer` field out of the Lamatic response. */
async function getAnswer(flowId: string, inputs: Record<string, unknown>): Promise<unknown> {
const resData = await getLamaticClient().executeFlow(flowId, inputs)
const envelope = resData as { result?: { answer?: unknown }; answer?: unknown }
const answer = envelope?.result?.answer ?? envelope?.answer
if (answer === undefined || answer === null) {
throw new Error("No answer returned from flow")
}
return answer
}

/** Run one golden case through run-target, then score it with the judge. */
async function evaluateCase(
systemPrompt: string,
testCase: GoldenCase,
flows: { judge: string; runTarget: string },
): Promise<CaseResult> {
try {
const rawOutput = await getAnswer(flows.runTarget, { systemPrompt, input: testCase.input })
const output = decodeHtmlEntities(typeof rawOutput === "string" ? rawOutput : JSON.stringify(rawOutput))

const rawJudge = await getAnswer(flows.judge, {
input: testCase.input,
output,
criteria: testCase.criteria,
reference: testCase.reference ?? "",
})

return { case: testCase, output, judge: parseJudgeResult(rawJudge) }
} catch (error) {
return {
case: testCase,
output: "",
judge: null,
error: error instanceof Error ? error.message : "Evaluation failed",
}
}
}

/** Evaluate a system prompt against a golden set and return the gate verdict. */
export async function runEvaluation(
systemPrompt: string,
cases: GoldenCase[],
threshold: number,
): Promise<{ success: boolean; data?: RunAggregate; error?: string }> {
try {
if (!systemPrompt.trim()) throw new Error("A system prompt is required")
if (!Array.isArray(cases) || cases.length === 0) throw new Error("Provide at least one test case")
if (!Number.isFinite(threshold) || threshold < 0 || threshold > 100) {
throw new Error("Threshold must be a number between 0 and 100")
}

const flows = { judge: resolveFlowId("judge"), runTarget: resolveFlowId("run-target") }
const results = await mapWithConcurrency(cases, CONCURRENCY, (testCase) =>
evaluateCase(systemPrompt, testCase, flows),
)
return { success: true, data: computeAggregate(results, threshold) }
} catch (error) {
Comment thread
coderabbitai[bot] marked this conversation as resolved.
return { success: false, error: error instanceof Error ? error.message : "Evaluation failed" }
}
}
Loading
Loading