Setup Virtual Environment, Clone Project, and Install Dependencies

Read the docs

Project Overview

The Judge Reliability Harness (JRH) orchestrates end-to-end evaluations of automated judges. It standardizes how datasets are prepared, perturbations are generated, and model outputs are rescored so teams can compare judge reliability across tasks. The harness coordinates data ingestion, perturbation pipelines, automated grading, and reporting into reproducible runs.

Why JRH?

LLMs are increasingly used as judges or to score, rank, or classify AI outputs in AI evaluations. Human evaluation yields high quality judgments but is expensive and difficult to scale, which has motivated the widespread use of LLMs as judges in place of human annotators However, the reliability of judge system comfiguration, including the LLM judge model, rubric, and prompt templates, are rarely evaluated and measured in a systematic manner or reported alongside benchmark evaluation results. Point estimates of agreement with human raters on small validation sets provide limited assurance about how a judge will respond to realistic variations in inputs, such as changes in formatting, paraphrasing, verbosity, or sampling parameters. This gap between the central role of LLM judges and the limited tools available to characterize their reliability makes it difficult for practitioners and decision makers to understand how much confidence to place in AI evaluation results.

We introduce the Judge Reliability Harness (JRH), an open source library that generates validation suites for any LLM judge on both agentic and free-response benchmarks. JRH generates reliability tests that measure grading accuracy via label flipped responses, invariance to formatting and paraphrasing, susceptibility to verbosity bias, stochastic stability under repeated sampling, and calibration across an ordinal grading scale. JRH features a human-in-the-loop review process for generated reliability tests through a user interface that gives full control to accept, reject, or edit the tests. Across a range of candidate judges, it aggregates pass rates, confidence intervals, and cost curves into standardized reports. By making reliability testing configurable, reproducible, and inexpensive, JRH aims to support a more transparent and trustworthy use of LLM judges in both research and deployment contexts.

Setup Virtual Environment, Clone Project, and Install Dependencies

Clone project with:

git clone https://github.com/RANDCorporation/judge-reliability-harness.git

Move into the project:

cd judge-reliability-harness

Setup virtual environment with:

python -m venv venv
Activate with:
- Linux/macOS: source venv/bin/activate
- On Windows (Command Prompt): venv\Scripts\activate
- On Windows (PowerShell): venv\Scripts\Activate.ps1

Install dependencies:

uv sync --extra dev --native-tls

Set environment variables:

Linux/macOS: echo -e "OPENAI_API_KEY=your_api_key_here\nOPENAI_ORG_ID=your_org_id_here" > .env
On Windows (PowerSHell):

"OPENAI_API_KEY=your_api_key_here" | Out-File -Encoding utf8 .env
"OPENAI_ORG_ID=your_org_id_here" | Out-File -Encoding utf8 -Append .env

Deploy JRH

Deploy with default options

This runs the binary harmbench program, by default: python -m main
This runs the persuade program: python -m main ./inputs/configs/config_persuade.yml

See default options here: ./src/configs/default_config.yml

Deploy with custom options

Step 1: Upload new data set for a project named "PROJECT_NAME"

Place new data set into ./inputs/data/PROJECT_NAME/data.csv
Put new rubric into ./inputs/data/PROJECT_NAME/rubric.txt
Put new instructions into ./inputs/data/PROJECT_NAME/instructions.txt

Step 2: Create new config file

cp ./src/configs/default_config.yml ./inputs/configs/custom_config.yml
Change YAML file with desired configuration. Note that the fields that are in config_persuade.yml are REQUIRED.

Step 3: Run with custom options

python -m main ./inputs/configs/custom_config.yml

The data will appear in the output_dir specified by your YAML file. By default, this is:

Results: ./outputs/PROJECT_NAME_TIMESTAMP/report.md
Perturbations: ./outputs/PROJECT_NAME_TIMESTAMP/TEST_NAME_perturbations.csv

Running Unit Tests

To run all the unit tests in the project, use the following command from the project root:

pytest

Linting & formatting

We use ruff as our linter. Before committing code, run these commands and fix any linter issues:

ruff format
ruff check --fix

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
inputs		inputs
prompts		prompts
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Quickstart.ipynb		Quickstart.ipynb
README.md		README.md
main.py		main.py
make_cost_curves.md		make_cost_curves.md
mock_preprocessed.csv		mock_preprocessed.csv
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Read the docs

Project Overview

Why JRH?

Setup Virtual Environment, Clone Project, and Install Dependencies

Deploy JRH

Deploy with default options

Deploy with custom options

Running Unit Tests

Linting & formatting

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

RANDCorporation/judge-reliability-harness

Folders and files

Latest commit

History

Repository files navigation

Read the docs

Project Overview

Why JRH?

Setup Virtual Environment, Clone Project, and Install Dependencies

Deploy JRH

Deploy with default options

Deploy with custom options

Running Unit Tests

Linting & formatting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages