Skip to content

Toxicity Detection validators#80

Merged
rkritika1508 merged 15 commits intomainfrom
feat/toxicity-hub-validators
Apr 10, 2026
Merged

Toxicity Detection validators#80
rkritika1508 merged 15 commits intomainfrom
feat/toxicity-hub-validators

Conversation

@rkritika1508
Copy link
Copy Markdown
Collaborator

@rkritika1508 rkritika1508 commented Apr 1, 2026

Summary

Target issue is #81.
Explain the motivation for making this change. What existing problem does the pull request solve?
As we expand safety coverage in our validation pipeline, we need stronger and more layered defenses against harmful content. The current system relies primarily on rule-based and lexical validators, which are effective but have limitations:

  • They may miss nuanced or context-dependent harmful content (e.g., indirect violence, coded language).
  • They may overfit to keyword matching, leading to false positives or missed cases.
  • They lack a model-level understanding of intent and semantics.

The validators introduced in this PR aim to mitigate the following categories of harm:

  • Violence / Hate Speech – abusive, threatening, or discriminatory content
  • Sexual Content – explicit or inappropriate material
  • Criminal Planning / Weapons – instructions or facilitation of illegal activity
  • Self-harm Encouragement – harmful mental health content
  • Profanity / Toxic Language – offensive or inappropriate language

This PR introduces two complementary validators to address these gaps:

LLAMA Guard 7B
Uses the Meta AI LlamaGuard-7B model via Guardrails Hub to classify text as safe or unsafe.
Evaluates content against configurable safety policies:

  • Violence / hate
  • Sexual content
  • Criminal planning
  • Weapons
  • Illegal drugs
  • Self-harm encouragement

Profanity Free

  • Detects profanity using a linear SVM model (alt-profanity-check)
  • Fails validation if profane content is detected

Additional Changes

  • Added unit tests for each validator
  • Added integration tests for validator combinations
  • Updated API documentation to include:
    • New validator types
    • Configuration options
    • Policy controls

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Added LlamaGuard 7B validator (configurable human-readable policies) and Profanity Free validator (automatic profanity handling with configurable on-fail behavior).
  • Bug Fixes

    • Validator logging no longer attempts to persist when no validation result exists.
  • Documentation

    • API and runtime docs updated to list the new validators, policy options, default behaviors, and configuration examples.
  • Tests

    • Added unit and integration tests covering the new validators and on-fail behaviors.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two Guardrails Hub validators—llamaguard_7b and profanity_free—including config classes, schema and enum updates, manifest entries, tests, docs, a runtime dependency, and a small API change to skip logging when a validator returns no result.

Changes

Cohort / File(s) Summary
Documentation
backend/app/api/API_USAGE.md, backend/app/core/validators/README.md, backend/app/api/docs/guardrails/run_guardrails.md
Documented new validators (llamaguard_7b, profanity_free), config fields (policies, on_fail), default strategies, and run-time behavior.
New Validator Configs
backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py, backend/app/core/validators/config/profanity_free_safety_validator_config.py
Added LlamaGuard7BSafetyValidatorConfig (with policy name→code mapping and resolver) and ProfanityFreeSafetyValidatorConfig; both implement build() to instantiate Hub validators.
Schema & Core
backend/app/schemas/guardrail_config.py, backend/app/core/enum.py, backend/app/core/validators/validators.json
Extended ValidatorConfigItem discriminator union; added ValidatorType members (llm_critic, llamaguard_7b, profanity_free); appended new validators to manifest.
API route
backend/app/api/routes/guardrails.py
add_validator_logs() now skips creating/persisting logs when log.validation_result is None.
Tests
backend/app/tests/test_toxicity_hub_validators.py, backend/app/tests/test_guardrails_api_integration.py
Added unit tests for new config classes (Pydantic validation, build behavior, policy mapping, on_fail semantics) and integration tests covering single and combined validator flows and on_fail variants.
Dependencies
backend/pyproject.toml
Added huggingface-hub>=1.5.0,<2.0 to runtime dependencies.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client
participant API as API Server
participant ValidatorManager as Validator Config / Builder
participant GuardrailsHub as Guardrails Hub (remote/local)
participant DB as DB / Logs

Client->>API: POST /api/v1/guardrails/run (input + configs)
API->>ValidatorManager: Parse configs (discriminator by type)
ValidatorManager->>ValidatorManager: Build validator instances (LlamaGuard7B, ProfanityFree)
API->>GuardrailsHub: Execute validator(s) (pass policies/on_fail)
GuardrailsHub-->>API: Validation result (PassResult / FailResult / None)
alt validation_result is None
API->>DB: (skip) no log created
else validation_result present
API->>DB: Persist ValidatorLog (result, on_fail)
end
API-->>Client: Response (success, output, metadata)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • nishika26
  • AkhileshNegi

Poem

🐰 I hopped through code with tiny paws,
Two new guards to mind the laws,
Llama checks and policies arrayed,
Profanity trimmed, no mess made,
Cheers from me — a tidy deploy hooray! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Toxicity Detection validators' directly matches the PR's main objective of adding toxicity/profanity detection validators (LlamaGuard7B and ProfanityFree) to the guardrails pipeline.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/toxicity-hub-validators

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
backend/app/core/validators/README.md (1)

377-377: Minor: Consider hyphenating "Profanity-Free" for consistency.

The section title uses "Profanity Free" but compound adjectives typically use hyphens.

📝 Suggested fix
-### 9) Profanity Free Validator (`profanity_free`)
+### 9) Profanity-Free Validator (`profanity_free`)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/validators/README.md` at line 377, The section title
"Profanity Free Validator (`profanity_free`)" should use a hyphen for the
compound adjective; update the heading to "Profanity-Free Validator
(`profanity_free`)" so it reads consistently with other compound-adjective
headings and matches the validator identifier `profanity_free`.
backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py (1)

1-1: Consider using built-in list instead of typing.List.

Python 3.10+ (the project's minimum version) supports generic type hints directly on built-in types. Using list[str] instead of List[str] is the modern approach and aligns with Ruff's UP035 rule.

♻️ Suggested refactor
-from typing import List, Literal, Optional
+from typing import Literal, Optional

And on line 10:

-    policies: Optional[List[str]] = None
+    policies: Optional[list[str]] = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`
at line 1, Replace usages of typing.List with the built-in generic type: remove
List from the import list in the top import line and change all type annotations
that use List (e.g., List[str], List[int], etc.) to use list[...] (e.g.,
list[str]) in llamaguard_7b_safety_validator_config.py; ensure you update the
import statement to keep only Literal and Optional (or any other still-used
typing names) and scan the file for any remaining occurrences of the symbol List
to convert them to the built-in list form.
backend/app/tests/test_toxicity_hub_validators.py (2)

37-148: Consider parameterizing repeated assertion patterns to reduce drift.

There is substantial duplication across classes (default/custom build args, on_fail mapping, invalid-on_fail, wrong type, extra fields). Converting repeated cases to pytest.mark.parametrize + shared helper would reduce maintenance cost and make future validator additions safer.

Refactor sketch
+@pytest.mark.parametrize(
+    "config_cls,type_value,patch_target,kwargs,expected_kwargs",
+    [
+        (NSFWTextSafetyValidatorConfig, "nsfw_text", _NSFW_PATCH, {}, {"threshold": 0.8, "validation_method": "sentence", "device": "cpu", "model_name": "michellejieli/NSFW_text_classifier"}),
+        (ToxicLanguageSafetyValidatorConfig, "toxic_language", _TOXIC_PATCH, {}, {"threshold": 0.5, "validation_method": "sentence", "device": "cpu", "model_name": "unbiased-small"}),
+    ],
+)
+def test_build_forwards_expected_kwargs(config_cls, type_value, patch_target, kwargs, expected_kwargs):
+    config = config_cls(type=type_value, **kwargs)
+    with patch(patch_target) as mock_validator:
+        config.build()
+    _, actual = mock_validator.call_args
+    for k, v in expected_kwargs.items():
+        assert actual[k] == v

Also applies to: 155-277, 284-362, 369-504

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/test_toxicity_hub_validators.py` around lines 37 - 148,
Refactor the repeated tests for LlamaGuard7BSafetyValidatorConfig by converting
duplicate assertion patterns into parametrized tests: create
pytest.mark.parametrize cases for policies (None, [], ["O1"], all_policies), for
on_fail mapping ("fix"/"exception"/"rephrase"/invalid) and for schema validation
(wrong type literal, extra fields), and replace the repeated with
patch(_LLAMAGUARD_PATCH) calls with a small helper that builds the config and
returns mock_validator.call_args; update assertions to use that helper and
reference LlamaGuard7BSafetyValidatorConfig, _LLAMAGUARD_PATCH, OnFailAction,
and pytest.mark.parametrize so each behavior (default/custom policies, on_fail
resolution, invalid on_fail, wrong type, extra fields) is covered by
parametrized cases instead of duplicated test methods.

187-203: Add explicit out-of-range threshold tests (-0.01 / 1.01).

You currently validate numeric type and boundary inclusion (0.0, 1.0), but there is no assertion that out-of-range values are rejected. If thresholds are intended to be constrained to [0, 1], add negative and above-one cases to lock that contract.

Also applies to: 401-421, 274-277, 502-504

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/test_toxicity_hub_validators.py` around lines 187 - 203,
Add tests that assert out-of-range thresholds are rejected: create two new tests
(e.g., test_build_with_threshold_below_zero and
test_build_with_threshold_above_one) that instantiate
NSFWTextSafetyValidatorConfig with threshold=-0.01 and threshold=1.01
respectively, patch _NSFW_PATCH as in the existing tests, and assert that
calling config.build() raises a validation exception (use ValueError or the
specific validation exception your code uses). Apply the same pattern to the
other validator test blocks mentioned (the ranges around lines 274-277, 401-421,
and 502-504) so each validator verifies thresholds outside [0,1] are rejected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`:
- Line 1: Replace usages of typing.List with the built-in generic type: remove
List from the import list in the top import line and change all type annotations
that use List (e.g., List[str], List[int], etc.) to use list[...] (e.g.,
list[str]) in llamaguard_7b_safety_validator_config.py; ensure you update the
import statement to keep only Literal and Optional (or any other still-used
typing names) and scan the file for any remaining occurrences of the symbol List
to convert them to the built-in list form.

In `@backend/app/core/validators/README.md`:
- Line 377: The section title "Profanity Free Validator (`profanity_free`)"
should use a hyphen for the compound adjective; update the heading to
"Profanity-Free Validator (`profanity_free`)" so it reads consistently with
other compound-adjective headings and matches the validator identifier
`profanity_free`.

In `@backend/app/tests/test_toxicity_hub_validators.py`:
- Around line 37-148: Refactor the repeated tests for
LlamaGuard7BSafetyValidatorConfig by converting duplicate assertion patterns
into parametrized tests: create pytest.mark.parametrize cases for policies
(None, [], ["O1"], all_policies), for on_fail mapping
("fix"/"exception"/"rephrase"/invalid) and for schema validation (wrong type
literal, extra fields), and replace the repeated with patch(_LLAMAGUARD_PATCH)
calls with a small helper that builds the config and returns
mock_validator.call_args; update assertions to use that helper and reference
LlamaGuard7BSafetyValidatorConfig, _LLAMAGUARD_PATCH, OnFailAction, and
pytest.mark.parametrize so each behavior (default/custom policies, on_fail
resolution, invalid on_fail, wrong type, extra fields) is covered by
parametrized cases instead of duplicated test methods.
- Around line 187-203: Add tests that assert out-of-range thresholds are
rejected: create two new tests (e.g., test_build_with_threshold_below_zero and
test_build_with_threshold_above_one) that instantiate
NSFWTextSafetyValidatorConfig with threshold=-0.01 and threshold=1.01
respectively, patch _NSFW_PATCH as in the existing tests, and assert that
calling config.build() raises a validation exception (use ValueError or the
specific validation exception your code uses). Apply the same pattern to the
other validator test blocks mentioned (the ranges around lines 274-277, 401-421,
and 502-504) so each validator verifies thresholds outside [0,1] are rejected.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9e3faf93-8398-459b-a53a-fa511b15fc40

📥 Commits

Reviewing files that changed from the base of the PR and between 791820f and 650369c.

📒 Files selected for processing (9)
  • backend/app/api/API_USAGE.md
  • backend/app/core/validators/README.md
  • backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py
  • backend/app/core/validators/config/nsfw_text_safety_validator_config.py
  • backend/app/core/validators/config/profanity_free_safety_validator_config.py
  • backend/app/core/validators/config/toxic_language_safety_validator_config.py
  • backend/app/core/validators/validators.json
  • backend/app/schemas/guardrail_config.py
  • backend/app/tests/test_toxicity_hub_validators.py

@dennyabrain dennyabrain marked this pull request as draft April 1, 2026 04:31
@rkritika1508 rkritika1508 marked this pull request as ready for review April 1, 2026 04:32
@rkritika1508 rkritika1508 marked this pull request as draft April 1, 2026 04:48
@rkritika1508 rkritika1508 marked this pull request as ready for review April 7, 2026 09:03
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/core/validators/README.md`:
- Around line 399-430: Update the README wording for the Profanity Free
Validator (the section titled "Profanity Free Validator" referring to
profanity_free_safety_validator_config.py and hub://guardrails/profanity_free)
to use hyphenated compound adjectives where appropriate: change phrases like
"Profanity Free" to "Profanity-Free", "model based" to "model-based", and
"matching based" to "matching-based" (and apply the same hyphenation to similar
compound forms such as "first-pass" if inconsistent) so the grammar is tightened
throughout that validator's documentation.

In `@backend/app/tests/test_guardrails_api_integration.py`:
- Around line 345-364: The docstrings for the tests are mislabelled against the
validator policy mapping: update the docstring in
test_input_guardrails_with_llamaguard_7b_geography_policy (the first test) and
the docstring in test_input_guardrails_with_llamaguard_7b_violence_policy (the
next test) so the policy identifiers (O2 vs O3) correctly describe the active
policy in each test and match the validator guide; locate the two functions by
their names and swap or rewrite the inline descriptions so the geography test
references the geography policy and the violence/sex-crimes test references the
correct O2/O3 label.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 96811ae3-73c9-4853-b1f7-191d9ac0af0c

📥 Commits

Reviewing files that changed from the base of the PR and between 949647d and 141e5fc.

📒 Files selected for processing (8)
  • backend/app/api/API_USAGE.md
  • backend/app/api/routes/guardrails.py
  • backend/app/core/enum.py
  • backend/app/core/validators/README.md
  • backend/app/core/validators/validators.json
  • backend/app/schemas/guardrail_config.py
  • backend/app/tests/test_guardrails_api_integration.py
  • backend/app/tests/test_toxicity_hub_validators.py
✅ Files skipped from review due to trivial changes (1)
  • backend/app/core/enum.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • backend/app/schemas/guardrail_config.py
  • backend/app/core/validators/validators.json
  • backend/app/api/API_USAGE.md

@rkritika1508 rkritika1508 self-assigned this Apr 7, 2026
@rkritika1508 rkritika1508 linked an issue Apr 7, 2026 that may be closed by this pull request
@rkritika1508 rkritika1508 added enhancement New feature or request ready-for-review labels Apr 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
backend/app/api/docs/guardrails/run_guardrails.md (1)

11-20: Add blank line after the table to comply with Markdown formatting standards.

The table at lines 13-20 should be followed by a blank line before the next content (line 21).

📝 Proposed fix
   | `no_illegal_drugs`          | No illegal drugs                 |
   | `no_encourage_self_harm`    | No encouragement of self-harm    |
+
 - `rephrase_needed=true` means the system could not safely auto-fix the input/output and wants the user to retry with a rephrased query.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/docs/guardrails/run_guardrails.md` around lines 11 - 20, Add
a single blank line after the policies markdown table in the llamaguard_7b
section so the table (the rows starting with `no_violence_hate` through
`no_encourage_self_harm`) is followed by an empty line before the next
paragraph; update the run_guardrails.md content to insert one newline after the
table to comply with Markdown formatting standards.
backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py (1)

1-1: Use modern list type hint instead of typing.List.

typing.List is deprecated since Python 3.9. Use the built-in list for type hints.

♻️ Proposed fix
-from typing import List, Literal, Optional
+from typing import Literal, Optional

And update the type hints:

-    policies: Optional[List[str]] = None
+    policies: Optional[list[str]] = None

-    def _resolve_policies(self) -> Optional[List[str]]:
+    def _resolve_policies(self) -> Optional[list[str]]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`
at line 1, Replace deprecated typing.List with the built-in list: remove List
from the import line in llamaguard_7b_safety_validator_config.py and update any
type annotations that reference List[...] to use list[...]; keep Literal and
Optional imports as-is (or import them from typing if still used) and ensure all
occurrences (e.g., in class attributes, function signatures or return types) are
converted to the modern list[...] form.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`:
- Around line 21-32: The _resolve_policies() currently only maps human-readable
names via POLICY_NAME_MAP and rejects raw codes like "O1", so update
_resolve_policies (in llamaguard_7b_safety_validator_config.py) to accept both
forms: for each policy, first check POLICY_NAME_MAP.get(policy.lower()) and if
that returns None, then check if the policy (uppercased) matches a raw policy
code (e.g., "O1".."O6") and if so append the uppercased code unchanged;
otherwise raise the same ValueError. This keeps existing mapping behavior while
allowing tests that pass raw codes to succeed.

---

Nitpick comments:
In `@backend/app/api/docs/guardrails/run_guardrails.md`:
- Around line 11-20: Add a single blank line after the policies markdown table
in the llamaguard_7b section so the table (the rows starting with
`no_violence_hate` through `no_encourage_self_harm`) is followed by an empty
line before the next paragraph; update the run_guardrails.md content to insert
one newline after the table to comply with Markdown formatting standards.

In `@backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`:
- Line 1: Replace deprecated typing.List with the built-in list: remove List
from the import line in llamaguard_7b_safety_validator_config.py and update any
type annotations that reference List[...] to use list[...]; keep Literal and
Optional imports as-is (or import them from typing if still used) and ensure all
occurrences (e.g., in class attributes, function signatures or return types) are
converted to the modern list[...] form.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4662bec1-1cd2-4eb3-84cc-265fb76badcb

📥 Commits

Reviewing files that changed from the base of the PR and between 141e5fc and 74f8a82.

📒 Files selected for processing (4)
  • backend/app/api/docs/guardrails/run_guardrails.md
  • backend/app/core/validators/README.md
  • backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py
  • backend/app/tests/test_guardrails_api_integration.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/tests/test_guardrails_api_integration.py

@rkritika1508 rkritika1508 merged commit 60f5067 into main Apr 10, 2026
2 checks passed
@rkritika1508 rkritika1508 deleted the feat/toxicity-hub-validators branch April 10, 2026 11:57
rkritika1508 added a commit that referenced this pull request Apr 10, 2026
Co-authored-by: dennyabrain <denny.george90@gmail.com>
@coderabbitai coderabbitai bot mentioned this pull request Apr 10, 2026
2 tasks
rkritika1508 added a commit that referenced this pull request Apr 10, 2026
Co-authored-by: dennyabrain <denny.george90@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add toxicity validators from Guardrails Hub

3 participants