Skip to content

berkal749/rag-eva

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Evaluation Platform

Human evaluation platform for multilingual RAG chatbot responses. Supports Legal, NLP/AI, and Web Fallback domains with structured per-domain criteria.


What's New

  • Admin Panel to manage Q&A samples directly from the frontend (/admin).
  • Syntax Highlighting for chatbot answers containing code blocks.
  • Docker Compose support for easy local setup.

Local Development (Docker Compose - Recommended)

The easiest way to run both backend and frontend locally with persistent data:

# Start the services in the background
docker-compose up -d

# Check logs
docker-compose logs -f

To stop the services:

docker-compose down

Data is stored persistently in the docker-compose managed volume eval-data.


Local Development (Manual Setup)

If you prefer to run services manually:

1. Backend

cd backend
python -m venv .venv
source .venv/bin/activate       # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Create .env from example and configure
cp .env.example .env
# Important: Set your RESEND_API_KEY in .env to enable email notifications

uvicorn main:app --reload --port 8000

2. Frontend

cd frontend
npm install

npm run dev

Open: http://localhost:3000


Deploy

Backend → Render (recommended, free tier)

  1. Push the full repo to GitHub
  2. Go to https://render.com → New → Web Service
  3. Connect your repo, set Root Directory to backend
  4. Build command: pip install -r requirements.txt
  5. Start command: uvicorn main:app --host 0.0.0.0 --port $PORT
  6. Add environment variables:
ADMIN_EMAIL        = your@email.com
ADMIN_PASSWORD     = strong_password_here
RESEND_API_KEY     = re_xxxxxxxxxxxx
FROM_EMAIL         = noreply@rag-eval.resend.dev
CORS_ORIGINS       = https://your-app.vercel.app,http://localhost:3000
EVALUATIONS_FILE   = evaluations.json
SAMPLES_FILE       = samples.json

Frontend → Vercel (recommended, free)

  1. Push to GitHub, then import the project in Vercel.
  2. OR, use the Vercel CLI:
cd frontend
npm install -g vercel
vercel deploy

In Vercel dashboard → Settings → Environment Variables:

NEXT_PUBLIC_API_URL = https://your-backend.onrender.com

Redeploy after setting the variable.


Email Setup (Resend — free 100 emails/day)

  1. Sign up at https://resend.com
  2. Go to API Keys → Create API Key
  3. Copy the key → set as RESEND_API_KEY in backend .env
  4. (Optional) Add your domain in Resend → update FROM_EMAIL

If RESEND_API_KEY is not set, emails are logged to console instead (good for local dev).


Admin Panel & Sample Management

Navigate to http://your-app.com/admin to access the Admin Panel. Login with your ADMIN_PASSWORD to:

  • Create new evaluation samples
  • Edit existing queries, sources, and answers
  • Delete outdated samples
  • Filter samples by domain

Admin API Endpoints

Export evaluations:

GET /api/evaluations/export?password=YOUR_ADMIN_PASSWORD

This returns a CSV containing all fields: evaluator info, scores per criterion, notes, timestamp.

Manage Samples:

GET    /api/admin/samples
POST   /api/admin/samples
PUT    /api/admin/samples/{id}
DELETE /api/admin/samples/{id}

Evaluation Criteria

⚖ Legal (5 criteria)

  1. Legal Grounding: Real article numbers / law references cited.
  2. Factual Precision: Exact values (durations, ages, penalties) included.
  3. No Hallucination: No invented quotes, fake journals, wrong law IDs.
  4. Scope Discipline: No unsolicited compliance advice added.
  5. Corpus Gap Honesty: Admits missing law rather than inventing obligations.

🧠 NLP/AI (5 criteria)

  1. Technical Accuracy: Correct concepts, model names, architectures.
  2. Source Grounding: Traceable to a paper/doc in corpus.
  3. Arabic NLP Specificity: Addresses Arabic-specific challenges when relevant.
  4. Depth vs Padding: Mechanisms explained, no "En résumé" repetition.
  5. No Fabrication: No invented APIs, fake benchmarks, hallucinated code.

🌐 Web Fallback (3 criteria)

  1. Source Credibility: Real, verifiable URLs — not plausible-sounding.
  2. Fallback Justification: Web search was actually needed.
  3. Technical Validity: Code provided uses real API signatures.

Adding Custom Criteria

Edit frontend/lib/criteria.ts:

export const CRITERIA: Record<Domain, Criterion[]> = {
  legal: [
    // add new criterion here
    {
      key: "my_new_criterion",
      label: "My Criterion Label",
      description: "What this measures...",
    },
  ],
  // ...
};

And add the corresponding key to your scoring expectations in backend/models.py if you want server-side validation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors