Human evaluation platform for multilingual RAG chatbot responses. Supports Legal, NLP/AI, and Web Fallback domains with structured per-domain criteria.
- Admin Panel to manage Q&A samples directly from the frontend (
/admin). - Syntax Highlighting for chatbot answers containing code blocks.
- Docker Compose support for easy local setup.
The easiest way to run both backend and frontend locally with persistent data:
# Start the services in the background
docker-compose up -d
# Check logs
docker-compose logs -f- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- Backend Docs: http://localhost:8000/docs
- Admin Panel: http://localhost:3000/admin (Default password:
rag_eval_admin_2026)
To stop the services:
docker-compose downData is stored persistently in the docker-compose managed volume eval-data.
If you prefer to run services manually:
cd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Create .env from example and configure
cp .env.example .env
# Important: Set your RESEND_API_KEY in .env to enable email notifications
uvicorn main:app --reload --port 8000cd frontend
npm install
npm run devOpen: http://localhost:3000
- Push the full repo to GitHub
- Go to https://render.com → New → Web Service
- Connect your repo, set Root Directory to
backend - Build command:
pip install -r requirements.txt - Start command:
uvicorn main:app --host 0.0.0.0 --port $PORT - Add environment variables:
ADMIN_EMAIL = your@email.com
ADMIN_PASSWORD = strong_password_here
RESEND_API_KEY = re_xxxxxxxxxxxx
FROM_EMAIL = noreply@rag-eval.resend.dev
CORS_ORIGINS = https://your-app.vercel.app,http://localhost:3000
EVALUATIONS_FILE = evaluations.json
SAMPLES_FILE = samples.json
- Push to GitHub, then import the project in Vercel.
- OR, use the Vercel CLI:
cd frontend
npm install -g vercel
vercel deployIn Vercel dashboard → Settings → Environment Variables:
NEXT_PUBLIC_API_URL = https://your-backend.onrender.com
Redeploy after setting the variable.
- Sign up at https://resend.com
- Go to API Keys → Create API Key
- Copy the key → set as
RESEND_API_KEYin backend.env - (Optional) Add your domain in Resend → update
FROM_EMAIL
If RESEND_API_KEY is not set, emails are logged to console instead (good for local dev).
Navigate to http://your-app.com/admin to access the Admin Panel.
Login with your ADMIN_PASSWORD to:
- Create new evaluation samples
- Edit existing queries, sources, and answers
- Delete outdated samples
- Filter samples by domain
Export evaluations:
GET /api/evaluations/export?password=YOUR_ADMIN_PASSWORD
This returns a CSV containing all fields: evaluator info, scores per criterion, notes, timestamp.
Manage Samples:
GET /api/admin/samples
POST /api/admin/samples
PUT /api/admin/samples/{id}
DELETE /api/admin/samples/{id}
- Legal Grounding: Real article numbers / law references cited.
- Factual Precision: Exact values (durations, ages, penalties) included.
- No Hallucination: No invented quotes, fake journals, wrong law IDs.
- Scope Discipline: No unsolicited compliance advice added.
- Corpus Gap Honesty: Admits missing law rather than inventing obligations.
- Technical Accuracy: Correct concepts, model names, architectures.
- Source Grounding: Traceable to a paper/doc in corpus.
- Arabic NLP Specificity: Addresses Arabic-specific challenges when relevant.
- Depth vs Padding: Mechanisms explained, no "En résumé" repetition.
- No Fabrication: No invented APIs, fake benchmarks, hallucinated code.
- Source Credibility: Real, verifiable URLs — not plausible-sounding.
- Fallback Justification: Web search was actually needed.
- Technical Validity: Code provided uses real API signatures.
Edit frontend/lib/criteria.ts:
export const CRITERIA: Record<Domain, Criterion[]> = {
legal: [
// add new criterion here
{
key: "my_new_criterion",
label: "My Criterion Label",
description: "What this measures...",
},
],
// ...
};And add the corresponding key to your scoring expectations in backend/models.py if you want server-side validation.