Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions kits/llm-eval-harness/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.lamatic/
node_modules/
.next/
.env
.env.local
81 changes: 81 additions & 0 deletions kits/llm-eval-harness/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# LLM Eval Harness

A ready-to-deploy kit that scores an LLM prompt against a **golden set** using an **LLM-as-judge**, then applies a **CI-style pass/fail gate** — so you can catch quality regressions *before* they ship.

> Point it at any system prompt, give it a handful of test cases with expected criteria, and it tells you whether the prompt's outputs are faithful, relevant, and correct — with a single GATE PASSED / GATE FAILED verdict.

---

## The problem

When you ship an LLM feature and then tweak a prompt or swap a model, output quality can silently regress — a small wording change makes the model hallucinate, over-promise, or drift off-task, and you don't find out until a user does. Eyeballing a few outputs doesn't scale and isn't repeatable.

Teams solve this with an **evaluation harness**: a fixed set of representative inputs (a *golden set*), an automated grader, and a quality bar that must be met to ship. This kit packages that pattern as a hosted, reusable tool on Lamatic.

## The approach

For each case in the golden set, the kit runs two flows:

1. **`run-target`** — sends your system-prompt-under-test + the case input to an LLM and captures the output (the *system under test*).
2. **`judge`** — an LLM-as-judge scores that output against the case's `criteria` (and optional `reference`) on three dimensions, **0–5** each:
- **Faithfulness** — is every claim grounded? (hallucination is penalised hard — it's a veto)
- **Relevancy** — does it actually address the input?
- **Correctness** — does it satisfy the case criteria?

The app aggregates the per-case verdicts into a **pass rate** and compares it to a threshold you set (default **90%**) to produce the gate. A case **passes** only if `overall ≥ 3.5` **and** `faithfulness ≥ 3`.

```
golden case ──▶ run-target (LLM) ──▶ output ──▶ judge (LLM-as-judge) ──▶ {scores, pass, reasoning}
all cases ──▶ pass rate vs threshold ──▶ GATE PASS / FAIL
```

## Results

- Runs entirely on **Lamatic flows** (Groq `llama-3.3-70b-versatile`, temperature 0 for deterministic scoring).
- The judge reliably **distinguishes good from bad output** — e.g. it fails a support reply that invents a refund against a "final-sale is non-refundable" policy (faithfulness 0), and passes a correct, grounded reply.
- Per-case results are expandable to show the generated output and the judge's reasoning, so a failure tells you *why*.

## Tradeoffs & assumptions

- **Single provider (v1):** the flows use Groq. Lamatic stores model credentials at the project level, so multi-provider / bring-your-own-key was deliberately scoped out of v1 — runtime credential injection is a security tradeoff worth doing properly rather than quickly.
- **App-side loop:** the golden set is iterated in the Next.js server action (3 cases concurrently) rather than inside one flow, which keeps the flows simple and lets the UI surface per-case progress and errors.
- **Gate recomputed in code:** `overall` and `pass` are recomputed from the judge's dimension scores in the app, so the gate is deterministic and not dependent on the model's own arithmetic.
- **Defensive parsing:** judge output is tolerant of code fences and minor formatting; run-target output is HTML-entity-decoded before scoring.

---

## Flows

| Flow | Input | Output |
|------|-------|--------|
| `judge` | `{ input, output, criteria, reference? }` | `{ faithfulness, relevancy, correctness, overall, pass, reasoning }` |
| `run-target` | `{ systemPrompt, input }` | `{ answer }` (the generated output under test) |

## Setup

```bash
cd kits/llm-eval-harness/apps
cp .env.example .env.local # then fill in the values below
npm install
npm run dev # http://localhost:3000
```

### Environment variables

| Variable | Where to find it |
|----------|------------------|
| `JUDGE_FLOW` | Deploy the `judge` flow in Lamatic Studio → copy its Flow ID |
| `RUN_TARGET_FLOW` | Deploy the `run-target` flow → copy its Flow ID |
| `LAMATIC_API_URL` | Studio → Settings / API |
| `LAMATIC_PROJECT_ID` | Studio → Project settings |
| `LAMATIC_API_KEY` | Studio → API Keys |

## Usage

1. Paste the **system prompt** you want to evaluate.
2. Provide a **golden set** as JSON — an array of `{ input, criteria, reference? }`.
3. Set a **gate threshold** (default 90%).
4. Click **Run evaluation** — or **Load example** to try a support-agent scenario.

Built on [Lamatic](https://lamatic.ai).
52 changes: 52 additions & 0 deletions kits/llm-eval-harness/agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# LLM Eval Harness

## Overview
The LLM Eval Harness is a quality-gate agent for other LLM features. Given a system prompt and a golden set of test cases, it runs each case through the prompt-under-test and then grades the output with an LLM-as-judge across faithfulness, relevancy, and correctness, returning per-case scores and a single pass/fail gate. It is invoked by a Next.js web UI that calls two Lamatic flows and aggregates the verdicts. It depends on Lamatic's hosted runtime, project credentials, and a connected text-generation provider (Groq).

## Purpose
Prompt and model changes can silently regress output quality — a reworded instruction starts hallucinating, over-promising, or drifting off-task. This agent makes that measurable and repeatable: a fixed golden set plus an automated judge plus a quality threshold, so a regression is caught as a failed gate rather than by a user. It generalises the eval-harness pattern (golden sets + LLM-as-judge + CI gate) into a hosted, reusable tool.

## Flows

### `judge`
- **Trigger:** API request with `{ input, output, criteria, reference? }`.
- **Processing:** a single LLM node (Groq `llama-3.3-70b-versatile`, temperature 0) acts as a strict evaluation judge using the system prompt in `prompts/`. It scores the candidate `output` against the `criteria` and optional `reference`.
- **Response:** JSON `{ faithfulness, relevancy, correctness, overall, pass, reasoning }`, each dimension 0–5.
- **When to use:** to score one already-generated output against case criteria.
- **Dependencies:** Groq text model credential.

### `run-target`
- **Trigger:** API request with `{ systemPrompt, input }`.
- **Processing:** a single LLM node runs `systemPrompt` (system) + `input` (user) — this is the *system under test*.
- **Response:** `{ answer }`, the generated output.
- **When to use:** to produce the output that `judge` then scores.
- **Dependencies:** Groq text model credential.

## Guardrails
- The `judge` only scores; it never completes the user's task or rewrites the output.
- It does not reward length, confidence, formatting, or politeness — an eloquent but unsupported answer scores low on faithfulness.
- Faithfulness is a veto: a hallucinated or contradicting answer fails regardless of other scores.
- Scoring is deterministic (temperature 0); identical inputs yield identical scores.

## Integration Reference
- **Lamatic API runtime** — hosts and executes both flows. Requires `LAMATIC_API_URL`, `LAMATIC_PROJECT_ID`, `LAMATIC_API_KEY` in the calling app.
- **Groq (text generation)** — backs both LLM nodes; configured as a model credential in Lamatic Studio.

## Environment Setup
- `JUDGE_FLOW` — deployed `judge` flow ID, called by the app.
- `RUN_TARGET_FLOW` — deployed `run-target` flow ID, called by the app.
- `LAMATIC_API_URL`, `LAMATIC_PROJECT_ID`, `LAMATIC_API_KEY` — Lamatic project credentials used by the app to invoke the flows.

## Quickstart
1. Build and deploy the `judge` and `run-target` flows in Lamatic Studio; copy their Flow IDs.
2. In `apps/`, copy `.env.example` to `.env.local` and fill in the flow IDs + Lamatic credentials.
3. `npm install && npm run dev`, open `http://localhost:3000`.
4. Paste a system prompt + a golden set (or click **Load example**) and run.

## Common Failure Modes
| Symptom | Likely cause | Fix |
|---|---|---|
| Judge scores look random | Model too small or temperature not 0 | Use `llama-3.3-70b-versatile`, set temperature 0 |
| "No answer returned from flow" | Wrong flow ID or response mapping | Verify `JUDGE_FLOW`/`RUN_TARGET_FLOW` and that the response maps `answer` |
| Auth error on run | Missing/invalid Lamatic credentials | Check `LAMATIC_API_*` in `.env.local` |
| A case shows "error" | run-target or judge failed for that input | Expand the row; the run continues for other cases |
8 changes: 8 additions & 0 deletions kits/llm-eval-harness/apps/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Deployed Lamatic flow IDs (Studio → deploy the flow → copy its Flow ID)
JUDGE_FLOW="your-judge-flow-id"
RUN_TARGET_FLOW="your-run-target-flow-id"

# Lamatic project credentials (Studio → Settings / API)
LAMATIC_API_URL="https://your-project.lamatic.dev"
LAMATIC_PROJECT_ID="your-project-id"
LAMATIC_API_KEY="your-lamatic-api-key"
29 changes: 29 additions & 0 deletions kits/llm-eval-harness/apps/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.

# dependencies
/node_modules

# next.js
/.next/
/out/

# production
/build

# debug
npm-debug.log*
yarn-debug.log*
yarn-error.log*
.pnpm-debug.log*

# env files
.env
.env.local
.env*.local

# vercel
.vercel

# typescript
*.tsbuildinfo
next-env.d.ts
34 changes: 34 additions & 0 deletions kits/llm-eval-harness/apps/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# LLM Eval Harness — App

Next.js front end for the **LLM Eval Harness** kit. It calls two Lamatic flows
(`run-target` and `judge`) to score a system prompt against a golden set and
render a CI-style pass/fail gate.

See the [kit README](../README.md) for the full overview.

## Run locally

```bash
cp .env.example .env.local # fill in flow IDs + Lamatic credentials
npm install
npm run dev # http://localhost:3000
```

## Environment variables

| Variable | Source |
|----------|--------|
| `JUDGE_FLOW` | Deployed `judge` flow ID (Lamatic Studio) |
| `RUN_TARGET_FLOW` | Deployed `run-target` flow ID |
| `LAMATIC_API_URL` | Studio → Settings / API |
| `LAMATIC_PROJECT_ID` | Studio → Project settings |
| `LAMATIC_API_KEY` | Studio → API Keys |

## Structure

- `actions/orchestrate.ts` — server action: per-case `run-target` → `judge` loop, aggregation, gate
- `lib/lamatic-client.ts` — Lamatic SDK client + flow IDs from env
- `lib/eval.ts` — judge-output parsing, HTML decode, gate computation, bounded concurrency
- `lib/types.ts` — shared data contracts
- `components/gate-banner.tsx`, `components/results-table.tsx` — results UI
- `app/page.tsx` — the harness UI
84 changes: 84 additions & 0 deletions kits/llm-eval-harness/apps/actions/orchestrate.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
"use server"

import lamaticConfig from "../../lamatic.config"
import { getLamaticClient } from "@/lib/lamatic-client"
import { computeAggregate, decodeHtmlEntities, mapWithConcurrency, parseJudgeResult } from "@/lib/eval"
import type { CaseResult, GoldenCase, RunAggregate } from "@/lib/types"

// Bounded concurrency keeps large golden sets from tripping Groq rate limits.
const CONCURRENCY = 3

/** Resolve a deployed flow ID from the kit's lamatic.config step definitions. */
function resolveFlowId(stepId: string): string {
const step = lamaticConfig.steps.find((s) => s.id === stepId)
if (!step?.envKey) {
throw new Error(`lamatic.config has no step "${stepId}" with an envKey`)
}
const value = process.env[step.envKey]
if (!value) {
throw new Error(`Missing environment variable "${step.envKey}" for flow "${stepId}"`)
}
return value
}

/** Execute a flow and pull the `answer` field out of the Lamatic response. */
async function getAnswer(flowId: string, inputs: Record<string, unknown>): Promise<unknown> {
const resData = await getLamaticClient().executeFlow(flowId, inputs)
const envelope = resData as { result?: { answer?: unknown }; answer?: unknown }
const answer = envelope?.result?.answer ?? envelope?.answer
if (answer === undefined || answer === null) {
throw new Error("No answer returned from flow")
}
return answer
}

/** Run one golden case through run-target, then score it with the judge. */
async function evaluateCase(
systemPrompt: string,
testCase: GoldenCase,
flows: { judge: string; runTarget: string },
): Promise<CaseResult> {
try {
const rawOutput = await getAnswer(flows.runTarget, { systemPrompt, input: testCase.input })
const output = decodeHtmlEntities(typeof rawOutput === "string" ? rawOutput : JSON.stringify(rawOutput))

const rawJudge = await getAnswer(flows.judge, {
input: testCase.input,
output,
criteria: testCase.criteria,
reference: testCase.reference ?? "",
})

return { case: testCase, output, judge: parseJudgeResult(rawJudge) }
} catch (error) {
return {
case: testCase,
output: "",
judge: null,
error: error instanceof Error ? error.message : "Evaluation failed",
}
}
}

/** Evaluate a system prompt against a golden set and return the gate verdict. */
export async function runEvaluation(
systemPrompt: string,
cases: GoldenCase[],
threshold: number,
): Promise<{ success: boolean; data?: RunAggregate; error?: string }> {
try {
if (!systemPrompt.trim()) throw new Error("A system prompt is required")
if (!Array.isArray(cases) || cases.length === 0) throw new Error("Provide at least one test case")
if (!Number.isFinite(threshold) || threshold < 0 || threshold > 100) {
throw new Error("Threshold must be a number between 0 and 100")
}

const flows = { judge: resolveFlowId("judge"), runTarget: resolveFlowId("run-target") }
const results = await mapWithConcurrency(cases, CONCURRENCY, (testCase) =>
evaluateCase(systemPrompt, testCase, flows),
)
return { success: true, data: computeAggregate(results, threshold) }
} catch (error) {
Comment thread
coderabbitai[bot] marked this conversation as resolved.
return { success: false, error: error instanceof Error ? error.message : "Evaluation failed" }
}
}
Loading
Loading