Added evaluation readme by rkritika1508 · Pull Request #82 · ProjectTech4DevAI/kaapi-guardrails

rkritika1508 · 2026-04-01T07:49:35Z

Summary

Target issue is #77.
Explain the motivation for making this change. What existing problem does the pull request solve?
We have different ways of evaluating each validator. We also have different datasets for each validator. So, we should have a separate markdown in the evaluations folder which contains details about script evaluations, the details about the datasets, how to execute the scripts and how to infer the metrics, etc.

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

Documentation
- Consolidated and expanded evaluation docs: moved detailed evaluation workflow into a dedicated evaluation guide and simplified top-level instructions to link there. Covers offline and end-to-end multi-validator evaluation, prerequisites and dataset placement, how to run validators, generated outputs (predictions and metrics), and guidance for interpreting classification, PII, topic-relevance, and performance metrics.

coderabbitai · 2026-04-01T07:49:49Z

📝 Walkthrough

Walkthrough

Consolidated evaluation docs by removing detailed "Running evaluation tests" from backend/README.md and adding a comprehensive offline evaluation guide at backend/app/evaluation/README.md describing folder layout, prerequisites, validator runs, datasets, outputs, and metrics interpretation.

Changes

Cohort / File(s)	Summary
Backend README change `backend/README.md`	Removed detailed "Running evaluation tests" instructions and redirected readers to the new evaluation README.
Evaluation documentation (new) `backend/app/evaluation/README.md`	Added comprehensive offline evaluation guide: folder layout, local prerequisites, dataset filenames/locations, per-validator `run.py` invocation details, `scripts/run_all_evaluations.sh` usage, ban-list and topic-relevance specifics, multi-validator live-API run, expected `predictions.csv`/`metrics.json` outputs, and metrics interpretation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

Add a separate markdown file for evaluation #77: Adds a comprehensive backend/app/evaluation/README.md documenting per-validator evaluation scripts and workflows—aligns with this PR's new evaluation documentation.

Possibly related PRs

Added ban list evaluation #62: Related to evaluation subsystem changes and run_all_evaluations usage referenced by the new README.
Added evaluation script for topic relevance #74: Implements evaluators and runner behavior (e.g., ban_list, topic_relevance) that the new README documents.
Added gender assumption bias evaluation #59: Adds gender_assumption_bias and related evaluation scripts referenced in the new guide.

Suggested labels

enhancement

Suggested reviewers

nishika26
AkhileshNegi

Poem

🥕📘 I hopped through docs with tidy delight,
Moved tests to their room and set them right.
Validators hum in a neat little row,
Datasets lined up, metrics all aglow.
A rabbit’s applause for documentation bright!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Added evaluation readme' is vague and generic, using a minimal descriptor that doesn't specify which evaluation component or the key purpose of the documentation change.	Consider a more specific title like 'Add comprehensive evaluation guide for validator suites' or 'Document offline evaluation workflow and validator setup' to better convey the scope and purpose of the new evaluation documentation.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/evaluation-readme

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

backend/app/evaluation/README.md (1)
99-103: Standardize run commands to uv run python for consistency.

The guide mixes virtualenv activation guidance with python3 invocations, while earlier it states scripts are run via uv run python. Using one convention avoids interpreter mismatch.
♻️ Suggested doc update
-```bash
-python3 app/evaluation/<validator_folder>/run.py
-```
+```bash
+uv run python app/evaluation/<validator_folder>/run.py
+```
Also applies to: 127-129, 158-159, 186-187, 215-216, 251-252
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/README.md` around lines 99 - 103, Replace occurrences
of the direct python3 run command with the standardized uv run python
invocation: change instances like "python3
app/evaluation/<validator_folder>/run.py" to "uv run python
app/evaluation/<validator_folder>/run.py" in the README examples (the block
shown at lines 99-103) and the other listed examples (lines 127-129, 158-159,
186-187, 215-216, 251-252) to ensure consistency with the earlier stated "uv run
python" convention.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/README.md`:
- Around line 301-306: The README currently shows running
app/evaluation/multiple_validators/run.py with a plaintext --auth_token which
can leak secrets; update the example to use an environment variable (e.g.,
export AUTH_TOKEN="<your-token>") and pass it into the script as --auth_token
"$AUTH_TOKEN" or describe using a secure prompt/secret manager instead, and
update the README snippet around run.py and the --auth_token usage to recommend
env var injection rather than hardcoding tokens on the command line.

---

Nitpick comments:
In `@backend/app/evaluation/README.md`:
- Around line 99-103: Replace occurrences of the direct python3 run command with
the standardized uv run python invocation: change instances like "python3
app/evaluation/<validator_folder>/run.py" to "uv run python
app/evaluation/<validator_folder>/run.py" in the README examples (the block
shown at lines 99-103) and the other listed examples (lines 127-129, 158-159,
186-187, 215-216, 251-252) to ensure consistency with the earlier stated "uv run
python" convention.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 234340a6-da44-465f-80bc-0bcdcf204b9d

📥 Commits

Reviewing files that changed from the base of the PR and between 791820f and deee37b.

📒 Files selected for processing (2)

backend/README.md
backend/app/evaluation/README.md

backend/app/evaluation/README.md

nishika26 · 2026-04-08T07:18:30Z

backend/app/evaluation/README.md

+### Setup
+
+1. Ensure `GUARDRAILS_API_URL` is set in your `.env` file (see `.env.example`). Optionally set `GUARDRAILS_TIMEOUT_SECONDS` (default: `60`).
+2. Ensure the API is running and accessible at the configured URL.


API? do you mean the server?

The guardrails endpoint

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/README.md`:
- Around line 7-52: The README's fenced code blocks in
backend/app/evaluation/README.md are missing language identifiers (triggering
MD040); update each triple-backtick block (e.g., the directory tree block
starting with "backend/app/evaluation/" and the multiple outputs examples like
"outputs/lexical_slur/predictions.csv", "outputs/pii_remover/metrics.json",
"outputs/ban_list/<name>-metrics.json",
"outputs/topic_relevance/<domain>-metrics.json", and the
multi_validator_whatsapp outputs) to include appropriate languages (use text for
file/path listings, json for .json snippets, bash for command examples) so all
shown blocks have a language tag. Ensure every affected block mentioned in the
comment (around lines 127-130, 155-158, 185-188, 214-217, 248-251, 316-318) is
updated.
- Line 312: The documentation currently uses a code span with a trailing space
around the Bearer prefix (`` `Bearer ` ``) which violates MD038; update the
README text that describes the `--auth_token` argument to use a code span
without the trailing space (`` `Bearer` ``) so the phrase reads "without the
`Bearer` prefix" and ensure the `--auth_token` reference remains unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9f89636-228e-402b-9d93-f886daa6ce9b

📥 Commits

Reviewing files that changed from the base of the PR and between deee37b and 4855059.

📒 Files selected for processing (1)

backend/app/evaluation/README.md

backend/app/evaluation/README.md

added evaluation readme

deee37b

rkritika1508 changed the title ~~added evaluation readme~~ Added evaluation readme Apr 1, 2026

rkritika1508 self-assigned this Apr 1, 2026

rkritika1508 linked an issue Apr 1, 2026 that may be closed by this pull request

Add a separate markdown file for evaluation #77

Closed

rkritika1508 added documentation Improvements or additions to documentation ready-for-review labels Apr 1, 2026

rkritika1508 added this to Kaapi-dev Apr 1, 2026

rkritika1508 moved this to To Do in Kaapi-dev Apr 1, 2026

rkritika1508 moved this from To Do to In Progress in Kaapi-dev Apr 1, 2026

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

backend/app/evaluation/README.md Show resolved Hide resolved

Merge branch 'main' into feat/evaluation-readme

7e70121

dennyabrain reviewed Apr 7, 2026

View reviewed changes

backend/app/evaluation/README.md Show resolved Hide resolved

dennyabrain reviewed Apr 7, 2026

View reviewed changes

backend/app/evaluation/README.md Outdated Show resolved Hide resolved

dennyabrain reviewed Apr 7, 2026

View reviewed changes

backend/app/evaluation/README.md Outdated Show resolved Hide resolved

rkritika1508 added 2 commits April 7, 2026 20:09

resolved comments

99ec091

updated readme

1767ed6

nishika26 requested changes Apr 8, 2026

View reviewed changes

resolved comments

4855059

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

backend/app/evaluation/README.md Show resolved Hide resolved

backend/app/evaluation/README.md Show resolved Hide resolved

nishika26 approved these changes Apr 10, 2026

View reviewed changes

nishika26 moved this from In Progress to In Review in Kaapi-dev Apr 10, 2026

rkritika1508 merged commit 7e7fce6 into main Apr 10, 2026
2 checks passed

rkritika1508 deleted the feat/evaluation-readme branch April 10, 2026 06:45

github-project-automation bot moved this from In Review to Closed in Kaapi-dev Apr 10, 2026

rkritika1508 added a commit that referenced this pull request Apr 10, 2026

Added evaluation readme (#82)

0ce6ebb

Conversation

rkritika1508 commented Apr 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nishika26 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

rkritika1508 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rkritika1508 commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading