Added NSFW text validator by rkritika1508 · Pull Request #83 · ProjectTech4DevAI/kaapi-guardrails

rkritika1508 · 2026-04-02T13:10:47Z

Summary

Target issue is #84
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a new validator: nsfw_text, powered by Guardrails AI, with support for custom HuggingFace models.

We configure it to use:
textdetox/xlmr-large-toxicity-classifier

This significantly improves our ability to detect multilingual, code-mixed, and context-aware unsafe content, addressing gaps in the current validator stack.

Our existing validator suite (profanity, slur detection, toxic language, etc.) primarily relies on:

rule-based approaches
keyword matching
lighter ML models

While effective for straightforward cases, they fail in more complex real-world scenarios, such as:

multilingual inputs (Hindi + English mix, etc.)
paraphrased toxicity (non-explicit abusive phrasing)
regional slang / dialects
implicit or contextual NSFW content

Existing Validators (Context)

To make this PR self-contained, here’s a quick overview of what we already use and where they fall short:

Validator	What it does	Limitation
Profanity Free	Detects explicit swear words	Misses implicit or paraphrased toxicity
NSFW (existing basic)	Flags explicit content	Not robust for nuanced or mixed-language input
Llama Guard (policy)	Policy-based filtering	Not specialized for fine-grained NSFW detection

We need a strong model-based classifier to complement these.

What This PR Adds

1. New Validator: `nsfw_text`

File: backend/app/core/validators/config/nsfw_text_safety_validator_config.py

2. Custom Model Integration

Instead of using default models, we configure: textdetox/xlmr-large-toxicity-classifier

This model is

Multilingual (based on XLM-R)
Handles code-mixed inputs well
Better contextual understanding vs keyword-based systems

Design Decision

Previous Approach (Discarded)

Directly calling HuggingFace models separately
Added extra dependency + integration overhead

Current Approach

Use Guardrails NSFWText validator
Plug in custom HuggingFace model via model_name

Benefits

No additional abstraction layer
Fully compatible with existing Guardrails pipeline
Centralized validator management
Easy to swap models in future

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Added three new safety validators: LlamaGuard7B for unsafe content detection, ProfanityFree for profanity filtering, and NSFWText for explicit content detection with configurable thresholds.
- New validator configuration options for fine-tuning safety policies and behaviors.
Documentation
- Updated API documentation to reflect newly supported validator types.
Tests
- Comprehensive test coverage for new validators and their failure handling modes.
Chores
- Added dependencies for model support and pre-cached safety models in Docker image.

coderabbitai · 2026-04-02T13:10:55Z

Warning

Rate limit exceeded

@rkritika1508 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 52 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 8 minutes and 52 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 627f67ce-c49b-44f7-be99-1e2a04435692

📥 Commits

Reviewing files that changed from the base of the PR and between 60f5067 and 7264771.

📒 Files selected for processing (10)

backend/Dockerfile
backend/app/api/API_USAGE.md
backend/app/core/enum.py
backend/app/core/validators/README.md
backend/app/core/validators/config/nsfw_text_safety_validator_config.py
backend/app/core/validators/validators.json
backend/app/schemas/guardrail_config.py
backend/app/tests/test_guardrails_api_integration.py
backend/app/tests/test_toxicity_hub_validators.py
backend/pyproject.toml

📝 Walkthrough

Walkthrough

This PR adds support for three new toxicity/safety validators (llamaguard_7b, profanity_free, and nsfw_text) by introducing configuration classes, updating the validator registry, extending the enum, pre-caching the NSFW model in Docker, and adding comprehensive integration and unit tests.

Changes

Cohort / File(s)	Summary
Docker Configuration `backend/Dockerfile`	Defines `HF_HOME=/app/hf_cache` environment variable and adds a build step to pre-download/cache the `textdetox/xlmr-large-toxicity-classifier` model.
Validator Configurations `backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`, `backend/app/core/validators/config/nsfw_text_safety_validator_config.py`	Introduces new config classes for `LlamaGuard7B` and `NSFWText` validators with policy mapping, parameter validation, and `build()` methods.
Validator Registry & Enum `backend/app/core/enum.py`, `backend/app/core/validators/validators.json`	Extends `ValidatorType` enum with `LLMCritic`, `LlamaGuard7B`, `ProfanityFree`, and `NSFWText`; adds corresponding entries to the validators manifest (contains unresolved merge conflicts for `NSFWText`).
Schema Integration `backend/app/schemas/guardrail_config.py`	Updates polymorphic `ValidatorConfigItem` union to include new validator config types (contains unresolved merge conflicts).
Dependencies `backend/pyproject.toml`	Adds `huggingface-hub>=1.5.0,<2.0` and configures custom PyTorch CPU-only index; contains unresolved merge conflicts for `transformers` and `torch` versions.
Documentation `backend/app/api/API_USAGE.md`, `backend/app/core/validators/README.md`	Updates API documentation and validator README to include new validator types, policies, and configuration details (README contains unresolved merge conflicts for `nsfw_text`).
Integration Tests `backend/app/tests/test_guardrails_api_integration.py`	Adds ~10+ test cases covering `profanity_free`, `llamaguard_7b`, `nsfw_text`, and validator combinations with various `on_fail` behaviors; contains unresolved merge conflicts for `nsfw_text` tests.
Unit Tests `backend/app/tests/test_toxicity_hub_validators.py`	New test suite validating config construction, policy mapping, `on_fail` resolution, and error handling for validator configs; contains unresolved merge conflicts around `nsfw_text` tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Note: This diff contains multiple unresolved Git merge conflicts across 7+ files (enum.py, validators.json, README.md, guardrail_config.py, pyproject.toml, and test files), which significantly impacts review complexity. Conflicts must be resolved before merge. Additionally, the heterogeneity of changes (Docker, enums, configs, schemas, tests, docs) and logic density (policy mapping, validator instantiation) warrant elevated scrutiny.

Possibly related PRs

PR #80: Adds the same Guardrails Hub validators (LlamaGuard7B, ProfanityFree) across overlapping files (config classes, validators.json, enum.py, schemas, tests).
PR #39: Modifies the ValidatorType enum in backend/app/core/enum.py by introducing different validator members.
PR #20: Implements the "rephrase" on_fail feature including test coverage and on_fail behavior wiring.

Suggested reviewers

nishika26
AkhileshNegi
dennyabrain

Poem

🐇 Three validators hop into the fold,
llamaguard and profanity tales told,
While nsfw guards toxicity away,
Pre-cached models in Docker's display,
Toxicity checked, safe content's the way! 🛡️✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title 'Added NSFW text validator' is misleading given the presence of unresolved merge conflicts and incomplete changes throughout the codebase.	Resolve all merge conflicts in enum.py, validators.json, guardrail_config.py, test files, and pyproject.toml before merging. Update the title or resolve conflicts to accurately reflect the final state of the PR.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/toxicity-huggingface-model

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…gface-model

backend/app/core/validators/README.md

…gface-model

backend/app/core/validators/README.md

backend/app/core/validators/config/nsfw_text_safety_validator_config.py

backend/app/core/validators/README.md

…gface-model

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (3)

backend/scripts/install_guardrails_from_hub.sh (1)
9-9: Consider the implications of enabling remote inferencing by default.

Changing ENABLE_REMOTE_INFERENCING default from "false" to "true" means data will be sent to Guardrails AI's remote inference endpoints by default. While the AI summary indicates this aligns with updated documentation, ensure this is an intentional decision given:

Privacy: User input data may be transmitted to external services

Latency: Remote calls add network overhead

Cost: Remote inferencing may incur API usage charges

If this is intentional for the new NSFW/toxicity validators, consider documenting why remote inferencing is now preferred in the script comments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/scripts/install_guardrails_from_hub.sh` at line 9, The default for
ENABLE_REMOTE_INFERENCING was changed to "true", which causes user data to be
sent to Guardrails AI remote endpoints by default; either revert the default
back to "false" or, if leaving it enabled intentionally for NSFW/toxicity
validators, add a clear comment above the ENABLE_REMOTE_INFERENCING declaration
and update relevant docs explaining privacy/latency/cost implications and that
enabling remote inferencing sends user input to external services—refer to the
ENABLE_REMOTE_INFERENCING variable in this script and the
install_guardrails_from_hub.sh header for where to add the explanatory comment
and link to documentation or opt-in instructions.
backend/app/api/routes/guardrails.py (1)
186-190: Only the first validator's metadata is returned.

When multiple validators have validator_metadata set (e.g., multiple validators failed with empty fix_value), only the first one's metadata is surfaced to the API response. This may obscure which validators contributed to the outcome.

Consider aggregating metadata from all validators that set it, or at minimum document this "first-wins" behavior:
♻️ Optional: Aggregate all validator metadata
-            meta = next(
-                (v.validator_metadata for v in validators if v.validator_metadata),
-                None,
-            )
+            all_meta = [v.validator_metadata for v in validators if v.validator_metadata]
+            meta = {"validators": all_meta} if all_meta else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/guardrails.py` around lines 186 - 190, The current
code uses next(...) to pick only the first non-empty validator_metadata from
validators and passes it as meta to APIResponse.success_response, which hides
other validators' metadata; replace that logic in the function that builds the
response (referencing validators, validator_metadata, meta, response_model and
the call to APIResponse.success_response) with aggregation: collect all
non-empty v.validator_metadata entries (e.g., into a list or merged dict
depending on schema), deduplicate/merge as appropriate, and pass the aggregated
metadata to APIResponse.success_response instead of the single-first value (or
explicitly document the first-wins behavior if aggregation is undesired).
backend/README.md (1)
215-215: Use descriptive link text instead of "here".

The link text "here" is non-descriptive. Screen readers and users benefit from meaningful link text that describes the destination.
📝 Suggested improvement
-1. Ensure that the `.env` file contains the correct value for `GUARDRAILS_HUB_API_KEY`. The key can be fetched from [here](https://hub.guardrailsai.com/keys).
+1. Ensure that the `.env` file contains the correct value for `GUARDRAILS_HUB_API_KEY`. The key can be fetched from the [Guardrails Hub API Keys page](https://hub.guardrailsai.com/keys).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` at line 215, Replace the non-descriptive link text "here"
with meaningful descriptive text that explains the destination (for example
"Guardrails Hub API keys page" or "Guardrails Hub API key management") in the
README sentence about GUARDRAILS_HUB_API_KEY so the link reads like "The key can
be fetched from Guardrails Hub API keys page" and improves accessibility and
clarity.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/api/API_USAGE.md`:
- Around line 103-107: Remove the unresolved merge conflict markers and update
the `type=` filter list to include `nsfw_text` (i.e., ensure the line reads with
`type=uli_slur_match|pii_remover|gender_assumption_bias|ban_list|llm_critic|topic_relevance|llamaguard_7b|profanity_free|nsfw_text`),
keeping the rest of the documentation text unchanged so the PR objective (adding
nsfw_text) is reflected.
- Around line 468-471: Resolve the unresolved merge conflict in the validator
types list by removing the Git conflict markers (<<<<<<<, =======, >>>>>>>) and
ensuring the `nsfw_text` entry is present in the list of validators; update the
block that currently contains the conflict markers so it reads a clean list that
includes `nsfw_text` alongside the other validator type names.

In `@backend/app/core/enum.py`:
- Around line 38-41: Remove the git conflict markers and restore the missing
enum member by keeping the line defining NSFWText = "nsfw_text" inside the
appropriate Enum (the ValidatorType/enum class in backend/app/core/enum.py);
delete the "<<<<<<< HEAD", "=======", and ">>>>>>> ..." lines so the file is
valid Python and the NSFWText member is present for
ValidatorBase/ValidatorUpdate and API validation to work.

In `@backend/app/core/validators/config/base_validator_config.py`:
- Around line 22-36: Remove the unresolved merge conflict markers and ensure the
branch that sets validator_metadata also returns the empty string; specifically,
inside the _on_fix method where validator_metadata is assigned (using self.type
in the reason), keep the version that includes the return "" immediately after
setting validator_metadata so _on_fix returns "" instead of falling through to
return the falsy fix_value; remove the conflict markers (<<<<<<<, =======,
>>>>>>>) and any duplicate blocks so the method is a single consistent
implementation.

In `@backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`:
- Around line 5-9: There is an unresolved merge conflict and an unused import:
remove the conflict markers (<<<<<<< HEAD, =======, >>>>>>> ...) and delete the
unused import of GuardrailOnFail so only the necessary import remains (keep
BaseValidatorConfig). Ensure the import block is clean and linter-safe and run
tests/linters to confirm no unused-import errors remain.

In `@backend/app/core/validators/README.md`:
- Around line 17-20: The Supported Validators section in
backend/app/core/validators/README.md contains unresolved merge conflict
markers; remove the conflict markers (<<<<<<<, =======, >>>>>>>) and restore the
`nsfw_text` entry (source: `hub://guardrails/nsfw_text`) so the README lists
`nsfw_text` as intended in that section.
- Around line 416-458: Resolve the unresolved merge conflict in the README's
NSFW Text Validator section by removing the conflict markers (<<<<<<< HEAD,
=======, >>>>>>> 60f5067) and keeping the HEAD content under the "NSFW Text
Validator (`nsfw_text`)" heading (the full explanatory block including Config,
Source, Parameters, Notes/limitations and Recommendation). Ensure there are no
leftover conflict markers and the section reads as a single cohesive
paragraph/block as in the HEAD version.
- Around line 541-545: Remove the unresolved merge conflict markers (<<<<<<<
HEAD, =======, >>>>>>> ...) from the Related Files section in the README and
ensure the file list contains the intended entries; replace the conflict block
with a clean list that includes both `nsfw_text_safety_validator_config.py` and
`profanity_free_safety_validator_config.py` (or only the correct one if project
intends a single entry), making the Related Files section syntactically valid
Markdown.

In `@backend/app/core/validators/validators.json`:
- Around line 42-50: Resolve the Git merge conflict in validators.json by
removing the conflict markers (<<<<<<< HEAD, =======, >>>>>>>) and keeping the
HEAD entry for the new validator object (the "nsfw_text" object with "type":
"nsfw_text", "version": "0.1.0", "source": "hub://guardrails/nsfw_text"); ensure
the surrounding JSON array/object punctuation is corrected (commas/braces) so
the file is valid JSON and the validators manifest loads successfully.

In `@backend/app/schemas/guardrail_config.py`:
- Around line 30-38: There is an unresolved merge conflict in
guardrail_config.py around the imports: remove the conflict markers (<<<<<<<
HEAD and >>>>>>>) and restore the intended imports so both
NSFWTextSafetyValidatorConfig and ProfanityFreeSafetyValidatorConfig are
imported; ensure the import block contains NSFWTextSafetyValidatorConfig (as
requested) alongside ProfanityFreeSafetyValidatorConfig and that there are no
leftover conflict tokens.
- Around line 47-55: The discriminated union list for the validator configs has
an unresolved merge conflict that omitted NSFWTextSafetyValidatorConfig,
breaking recognition of payloads with type "nsfw_text"; fix by removing the
conflict markers and ensuring NSFWTextSafetyValidatorConfig is included
alongside LlamaGuard7BSafetyValidatorConfig, ProfanityFreeSafetyValidatorConfig,
and TopicRelevanceSafetyValidatorConfig in the union used with
Field(discriminator="type") so Pydantic can properly dispatch on type.

In `@backend/app/tests/test_guardrails_api_integration.py`:
- Around line 327-388: The file contains unresolved Git merge conflict markers
(<<<<<<<, =======, >>>>>>>) around the NSFWText test block which breaks pytest
import; remove the conflict markers and reconcile the duplicated sections so
only one coherent set of tests remains (keep the intended tests like
test_input_guardrails_with_nsfw_text_on_explicit_content,
test_input_guardrails_with_nsfw_text_with_low_threshold,
test_input_guardrails_with_nsfw_text_exception_action and any other tests
referenced later), ensuring there are no leftover markers elsewhere (notably the
similar region flagged at lines ~438-537) and that the module imports and test
function names are intact.
- Around line 349-366: The test
test_input_guardrails_with_nsfw_text_with_low_threshold currently uses the same
explicit prompt as the default-threshold test so it doesn't prove threshold
handling; replace the input with a borderline/ambiguous NSFW sample (e.g.,
suggestive but not explicit) and assert that the default-threshold validator
(the other test) returns success while this test, which posts the same
borderline input but with
{"type":"nsfw_text","threshold":0.1,"on_fail":"exception"} to VALIDATE_API_PATH
(using the same request_id, organization_id, project_id) returns failure —
ensuring different outcomes prove the threshold parameter is respected.
- Around line 229-245: The test
test_input_guardrails_with_profanity_free_on_profane_text currently only asserts
that the sanitized value differs from the original, which can hide a broken fix
path where fix_value falls back to "" in base_validator_config.py; update the
assertions to (1) assert that body["data"][SAFE_TEXT_FIELD] is a non-empty
string and (2) assert that known profane tokens from the original input (e.g.,
"damn", "fucking") are not present in body["data"][SAFE_TEXT_FIELD] so the fix
is both usable and actually removed profanities.

In `@backend/app/tests/test_toxicity_hub_validators.py`:
- Around line 13-27: Remove the unresolved merge markers and restore the HEAD
version: keep the import of NSFWTextSafetyValidatorConfig and the _NSFW_PATCH
assignment, retain the _LLAMAGUARD_PATCH constant, and delete the conflict lines
(<<<<<<<, =======, >>>>>>>). Ensure the module defines
NSFWTextSafetyValidatorConfig, _LLAMAGUARD_PATCH (value
"app.core.validators.config.llamaguard_7b_safety_validator_config.LlamaGuard7B")
and _NSFW_PATCH (value
"app.core.validators.config.nsfw_text_safety_validator_config.NSFWText") exactly
as in HEAD so imports and patch constants resolve.
- Around line 294-451: Resolve the merge conflict by removing the conflict
markers and keeping the HEAD version of the TestNSFWTextSafetyValidatorConfig
class (delete the >>>>>>> and <<<<<<< blocks and the redundant empty region),
and update the failing expectation in test_build_with_defaults to assert
model_name == "textdetox/xlmr-large-toxicity-classifier" so the test matches the
config default; ensure all tests in TestNSFWTextSafetyValidatorConfig (including
test_build_with_defaults) refer to the correct default model_name and that there
are no leftover conflict markers anywhere in that class.

In `@backend/pyproject.toml`:
- Around line 53-61: Resolve the merge conflict in pyproject.toml so the
CPU-only PyTorch source config is activated: ensure the dependency line for
torch remains as "torch>=2.0.0" in the [tool.poetry.dependencies] (or equivalent
dependencies section) and keep the [tool.uv.sources] and [[tool.uv.index]]
blocks intact (the entries torch = [{ index = "pytorch-cpu" }] and the
pytorch-cpu index URL). Remove conflict markers and ensure only the incoming
CPU-wheel index configuration and the torch>=2.0.0 dependency coexist so the
project pulls CPU-only wheels.
- Around line 35-39: The pyproject.toml contains unresolved git conflict markers
(<<<<<<< HEAD, =======, >>>>>>>) around the dependency block; remove the
conflict markers and restore the required dependencies by keeping
"transformers>=5.0.0" and "torch>=2.0.0" in the [project] / dependencies list so
dependency resolution succeeds and the Dockerfile and NSFWText validator that
rely on transformers/torch continue to work.

---

Nitpick comments:
In `@backend/app/api/routes/guardrails.py`:
- Around line 186-190: The current code uses next(...) to pick only the first
non-empty validator_metadata from validators and passes it as meta to
APIResponse.success_response, which hides other validators' metadata; replace
that logic in the function that builds the response (referencing validators,
validator_metadata, meta, response_model and the call to
APIResponse.success_response) with aggregation: collect all non-empty
v.validator_metadata entries (e.g., into a list or merged dict depending on
schema), deduplicate/merge as appropriate, and pass the aggregated metadata to
APIResponse.success_response instead of the single-first value (or explicitly
document the first-wins behavior if aggregation is undesired).

In `@backend/README.md`:
- Line 215: Replace the non-descriptive link text "here" with meaningful
descriptive text that explains the destination (for example "Guardrails Hub API
keys page" or "Guardrails Hub API key management") in the README sentence about
GUARDRAILS_HUB_API_KEY so the link reads like "The key can be fetched from
Guardrails Hub API keys page" and improves accessibility and clarity.

In `@backend/scripts/install_guardrails_from_hub.sh`:
- Line 9: The default for ENABLE_REMOTE_INFERENCING was changed to "true", which
causes user data to be sent to Guardrails AI remote endpoints by default; either
revert the default back to "false" or, if leaving it enabled intentionally for
NSFW/toxicity validators, add a clear comment above the
ENABLE_REMOTE_INFERENCING declaration and update relevant docs explaining
privacy/latency/cost implications and that enabling remote inferencing sends
user input to external services—refer to the ENABLE_REMOTE_INFERENCING variable
in this script and the install_guardrails_from_hub.sh header for where to add
the explanatory comment and link to documentation or opt-in instructions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0055de2d-430a-49bd-94b5-0c5735342722

📥 Commits

Reviewing files that changed from the base of the PR and between 60f5067 and 59dcd2d.

📒 Files selected for processing (17)

backend/Dockerfile
backend/README.md
backend/app/api/API_USAGE.md
backend/app/api/docs/guardrails/run_guardrails.md
backend/app/api/routes/guardrails.py
backend/app/core/enum.py
backend/app/core/validators/README.md
backend/app/core/validators/config/base_validator_config.py
backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py
backend/app/core/validators/config/nsfw_text_safety_validator_config.py
backend/app/core/validators/config/profanity_free_safety_validator_config.py
backend/app/core/validators/validators.json
backend/app/schemas/guardrail_config.py
backend/app/tests/test_guardrails_api_integration.py
backend/app/tests/test_toxicity_hub_validators.py
backend/pyproject.toml
backend/scripts/install_guardrails_from_hub.sh

backend/app/api/API_USAGE.md

backend/app/core/enum.py

backend/app/core/validators/config/base_validator_config.py

backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py

backend/app/tests/test_guardrails_api_integration.py

backend/app/tests/test_toxicity_hub_validators.py

backend/pyproject.toml

Co-authored-by: dennyabrain <denny.george90@gmail.com>

Resolved conflicts keeping nsfw_text validator features from this branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…fier Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rkritika1508 · 2026-04-10T15:36:49Z

@coderabbitai full review

coderabbitai · 2026-04-10T15:36:55Z

✅ Actions performed

Full review triggered.

rkritika1508 added 5 commits April 1, 2026 09:44

added toxicity detection validators

650369c

fixed import error

949647d

removed redundant validators

da50537

Added NSFW text validator

9ab64c7

fixed test

b64d0e9

rkritika1508 and others added 15 commits April 2, 2026 18:45

Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…

57d97b2

…gface-model

fix: profanity free validator description

09b6a05

doc: updated details of sentence parameter

f4a11fa

fix: remove vscode files

f330f1b

Added integration tests

51c9266

Merge branch 'main' into feat/toxicity-hub-validators

141e5fc

added integration tests

c76f829

fix: profanity free validator description

baac9e4

Added integration tests

627fb4f

validator config: add name to config (#79)

8b3da89

added integration tests

cc0bb14

Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…

3037eb8

…gface-model

added integration tests

b69883d

updated readme

8f67176

Added installation of huggingface model in dockerfile

affe72d

dennyabrain reviewed Apr 7, 2026

View reviewed changes

backend/app/core/validators/README.md Outdated Show resolved Hide resolved

rkritika1508 self-assigned this Apr 7, 2026

rkritika1508 added enhancement New feature or request ready-for-review labels Apr 7, 2026

rkritika1508 linked an issue Apr 7, 2026 that may be closed by this pull request

Add toxicity validator for HuggingFace models #84

Open

rkritika1508 added 4 commits April 7, 2026 17:09

resolved comment

8b0a183

removed blank line

14f6dc1

updated policies for llama guard

74f8a82

fixed tests

6676414

rkritika1508 added 4 commits April 7, 2026 20:02

Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…

0d15d0c

…gface-model

updated readme and fixed llama guard inference

6443c1b

fixed test

af933ef

Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…

9b6616a

…gface-model

nishika26 reviewed Apr 10, 2026

View reviewed changes

rkritika1508 added 7 commits April 10, 2026 15:47

Merge branch 'main' into feat/toxicity-hub-validators

9aca5f2

resolved comments

664ded8

Added evaluation readme (#82)

0ce6ebb

resolved comments

ba27b80

resolved comments

d7c5eba

fixed llama guard

02fd043

Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…

d9569ba

…gface-model

Base automatically changed from feat/toxicity-hub-validators to main April 10, 2026 11:57

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

Toxicity Detection validators (#80)

31af2f6

Co-authored-by: dennyabrain <denny.george90@gmail.com>

rkritika1508 force-pushed the feat/toxicity-huggingface-model branch from 59dcd2d to 31af2f6 Compare April 10, 2026 15:15

rkritika1508 and others added 3 commits April 10, 2026 20:48

Merge branch 'main' into feat/toxicity-huggingface-model

a061af8

Resolved conflicts keeping nsfw_text validator features from this branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

removed unnecessary changes

88c1b56

fix: update default nsfw_text model to michellejieli/NSFW_text_classi…

5b2fe3b

…fier Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rkritika1508 force-pushed the feat/toxicity-huggingface-model branch from 354cb64 to 5b2fe3b Compare April 10, 2026 15:30

rkritika1508 and others added 2 commits April 10, 2026 21:02

fix: use textdetox/xlmr-large-toxicity-classifier as default nsfw_tex…

fd3cddc

…t model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

updated readme

7264771

Conversation

rkritika1508 commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Existing Validators (Context)

What This PR Adds

1. New Validator: nsfw_text

2. Custom Model Integration

Design Decision

Previous Approach (Discarded)

Current Approach

Benefits

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkritika1508 commented Apr 10, 2026

Uh oh!

coderabbitai bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rkritika1508 commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

1. New Validator: `nsfw_text`

coderabbitai bot commented Apr 2, 2026 •

edited

Loading