RAG Evaluation Platform

Human evaluation platform for multilingual RAG chatbot responses. Supports Legal, NLP/AI, and Web Fallback domains with structured per-domain criteria.

What's New

Admin Panel to manage Q&A samples directly from the frontend (/admin).
Syntax Highlighting for chatbot answers containing code blocks.
Docker Compose support for easy local setup.

Local Development (Docker Compose - Recommended)

The easiest way to run both backend and frontend locally with persistent data:

# Start the services in the background
docker-compose up -d

# Check logs
docker-compose logs -f

Frontend: http://localhost:3000
Backend API: http://localhost:8000
Backend Docs: http://localhost:8000/docs
Admin Panel: http://localhost:3000/admin (Default password: rag_eval_admin_2026)

To stop the services:

docker-compose down

Data is stored persistently in the docker-compose managed volume eval-data.

Local Development (Manual Setup)

If you prefer to run services manually:

1. Backend

cd backend
python -m venv .venv
source .venv/bin/activate       # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Create .env from example and configure
cp .env.example .env
# Important: Set your RESEND_API_KEY in .env to enable email notifications

uvicorn main:app --reload --port 8000

2. Frontend

cd frontend
npm install

npm run dev

Open: http://localhost:3000

Deploy

Backend → Render (recommended, free tier)

Push the full repo to GitHub
Go to https://render.com → New → Web Service
Connect your repo, set Root Directory to backend
Build command: pip install -r requirements.txt
Start command: uvicorn main:app --host 0.0.0.0 --port $PORT
Add environment variables:

ADMIN_EMAIL        = your@email.com
ADMIN_PASSWORD     = strong_password_here
RESEND_API_KEY     = re_xxxxxxxxxxxx
FROM_EMAIL         = noreply@rag-eval.resend.dev
CORS_ORIGINS       = https://your-app.vercel.app,http://localhost:3000
EVALUATIONS_FILE   = evaluations.json
SAMPLES_FILE       = samples.json

Frontend → Vercel (recommended, free)

Push to GitHub, then import the project in Vercel.
OR, use the Vercel CLI:

cd frontend
npm install -g vercel
vercel deploy

In Vercel dashboard → Settings → Environment Variables:

NEXT_PUBLIC_API_URL = https://your-backend.onrender.com

Redeploy after setting the variable.

Email Setup (Resend — free 100 emails/day)

Sign up at https://resend.com
Go to API Keys → Create API Key
Copy the key → set as RESEND_API_KEY in backend .env
(Optional) Add your domain in Resend → update FROM_EMAIL

If RESEND_API_KEY is not set, emails are logged to console instead (good for local dev).

Admin Panel & Sample Management

Navigate to http://your-app.com/admin to access the Admin Panel. Login with your ADMIN_PASSWORD to:

Create new evaluation samples
Edit existing queries, sources, and answers
Delete outdated samples
Filter samples by domain

Admin API Endpoints

Export evaluations:

GET /api/evaluations/export?password=YOUR_ADMIN_PASSWORD

This returns a CSV containing all fields: evaluator info, scores per criterion, notes, timestamp.

Manage Samples:

GET    /api/admin/samples
POST   /api/admin/samples
PUT    /api/admin/samples/{id}
DELETE /api/admin/samples/{id}

Evaluation Criteria

⚖ Legal (5 criteria)

Legal Grounding: Real article numbers / law references cited.
Factual Precision: Exact values (durations, ages, penalties) included.
No Hallucination: No invented quotes, fake journals, wrong law IDs.
Scope Discipline: No unsolicited compliance advice added.
Corpus Gap Honesty: Admits missing law rather than inventing obligations.

🧠 NLP/AI (5 criteria)

Technical Accuracy: Correct concepts, model names, architectures.
Source Grounding: Traceable to a paper/doc in corpus.
Arabic NLP Specificity: Addresses Arabic-specific challenges when relevant.
Depth vs Padding: Mechanisms explained, no "En résumé" repetition.
No Fabrication: No invented APIs, fake benchmarks, hallucinated code.

🌐 Web Fallback (3 criteria)

Source Credibility: Real, verifiable URLs — not plausible-sounding.
Fallback Justification: Web search was actually needed.
Technical Validity: Code provided uses real API signatures.

Adding Custom Criteria

Edit frontend/lib/criteria.ts:

export const CRITERIA: Record<Domain, Criterion[]> = {
  legal: [
    // add new criterion here
    {
      key: "my_new_criterion",
      label: "My Criterion Label",
      description: "What this measures...",
    },
  ],
  // ...
};

And add the corresponding key to your scoring expectations in backend/models.py if you want server-side validation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
backend		backend
deploy		deploy
frontend		frontend
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Evaluation Platform

What's New

Local Development (Docker Compose - Recommended)

Local Development (Manual Setup)

1. Backend

2. Frontend

Deploy

Backend → Render (recommended, free tier)

Frontend → Vercel (recommended, free)

Email Setup (Resend — free 100 emails/day)

Admin Panel & Sample Management

Admin API Endpoints

Evaluation Criteria

⚖ Legal (5 criteria)

🧠 NLP/AI (5 criteria)

🌐 Web Fallback (3 criteria)

Adding Custom Criteria

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Evaluation Platform

What's New

Local Development (Docker Compose - Recommended)

Local Development (Manual Setup)

1. Backend

2. Frontend

Deploy

Backend → Render (recommended, free tier)

Frontend → Vercel (recommended, free)

Email Setup (Resend — free 100 emails/day)

Admin Panel & Sample Management

Admin API Endpoints

Evaluation Criteria

⚖ Legal (5 criteria)

🧠 NLP/AI (5 criteria)

🌐 Web Fallback (3 criteria)

Adding Custom Criteria

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages