Skip to content

Added NSFW text validator#83

Open
rkritika1508 wants to merge 41 commits intomainfrom
feat/toxicity-huggingface-model
Open

Added NSFW text validator#83
rkritika1508 wants to merge 41 commits intomainfrom
feat/toxicity-huggingface-model

Conversation

@rkritika1508
Copy link
Copy Markdown
Collaborator

@rkritika1508 rkritika1508 commented Apr 2, 2026

Summary

Target issue is #84
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a new validator: nsfw_text, powered by Guardrails AI, with support for custom HuggingFace models.

We configure it to use:
textdetox/xlmr-large-toxicity-classifier

This significantly improves our ability to detect multilingual, code-mixed, and context-aware unsafe content, addressing gaps in the current validator stack.

Our existing validator suite (profanity, slur detection, toxic language, etc.) primarily relies on:

  • rule-based approaches
  • keyword matching
  • lighter ML models

While effective for straightforward cases, they fail in more complex real-world scenarios, such as:

  • multilingual inputs (Hindi + English mix, etc.)
  • paraphrased toxicity (non-explicit abusive phrasing)
  • regional slang / dialects
  • implicit or contextual NSFW content

Existing Validators (Context)

To make this PR self-contained, here’s a quick overview of what we already use and where they fall short:

Validator What it does Limitation
Profanity Free Detects explicit swear words Misses implicit or paraphrased toxicity
NSFW (existing basic) Flags explicit content Not robust for nuanced or mixed-language input
Llama Guard (policy) Policy-based filtering Not specialized for fine-grained NSFW detection

We need a strong model-based classifier to complement these.

What This PR Adds

1. New Validator: nsfw_text

File: backend/app/core/validators/config/nsfw_text_safety_validator_config.py

2. Custom Model Integration

Instead of using default models, we configure: textdetox/xlmr-large-toxicity-classifier

This model is

  • Multilingual (based on XLM-R)
  • Handles code-mixed inputs well
  • Better contextual understanding vs keyword-based systems

Design Decision

Previous Approach (Discarded)

  • Directly calling HuggingFace models separately
  • Added extra dependency + integration overhead

Current Approach

  • Use Guardrails NSFWText validator
  • Plug in custom HuggingFace model via model_name

Benefits

  • No additional abstraction layer
  • Fully compatible with existing Guardrails pipeline
  • Centralized validator management
  • Easy to swap models in future

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Added three new safety validators: LlamaGuard7B for unsafe content detection, ProfanityFree for profanity filtering, and NSFWText for explicit content detection with configurable thresholds.
    • New validator configuration options for fine-tuning safety policies and behaviors.
  • Documentation

    • Updated API documentation to reflect newly supported validator types.
  • Tests

    • Comprehensive test coverage for new validators and their failure handling modes.
  • Chores

    • Added dependencies for model support and pre-cached safety models in Docker image.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

Warning

Rate limit exceeded

@rkritika1508 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 52 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 8 minutes and 52 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 627f67ce-c49b-44f7-be99-1e2a04435692

📥 Commits

Reviewing files that changed from the base of the PR and between 60f5067 and 7264771.

📒 Files selected for processing (10)
  • backend/Dockerfile
  • backend/app/api/API_USAGE.md
  • backend/app/core/enum.py
  • backend/app/core/validators/README.md
  • backend/app/core/validators/config/nsfw_text_safety_validator_config.py
  • backend/app/core/validators/validators.json
  • backend/app/schemas/guardrail_config.py
  • backend/app/tests/test_guardrails_api_integration.py
  • backend/app/tests/test_toxicity_hub_validators.py
  • backend/pyproject.toml
📝 Walkthrough

Walkthrough

This PR adds support for three new toxicity/safety validators (llamaguard_7b, profanity_free, and nsfw_text) by introducing configuration classes, updating the validator registry, extending the enum, pre-caching the NSFW model in Docker, and adding comprehensive integration and unit tests.

Changes

Cohort / File(s) Summary
Docker Configuration
backend/Dockerfile
Defines HF_HOME=/app/hf_cache environment variable and adds a build step to pre-download/cache the textdetox/xlmr-large-toxicity-classifier model.
Validator Configurations
backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py, backend/app/core/validators/config/nsfw_text_safety_validator_config.py
Introduces new config classes for LlamaGuard7B and NSFWText validators with policy mapping, parameter validation, and build() methods.
Validator Registry & Enum
backend/app/core/enum.py, backend/app/core/validators/validators.json
Extends ValidatorType enum with LLMCritic, LlamaGuard7B, ProfanityFree, and NSFWText; adds corresponding entries to the validators manifest (contains unresolved merge conflicts for NSFWText).
Schema Integration
backend/app/schemas/guardrail_config.py
Updates polymorphic ValidatorConfigItem union to include new validator config types (contains unresolved merge conflicts).
Dependencies
backend/pyproject.toml
Adds huggingface-hub>=1.5.0,<2.0 and configures custom PyTorch CPU-only index; contains unresolved merge conflicts for transformers and torch versions.
Documentation
backend/app/api/API_USAGE.md, backend/app/core/validators/README.md
Updates API documentation and validator README to include new validator types, policies, and configuration details (README contains unresolved merge conflicts for nsfw_text).
Integration Tests
backend/app/tests/test_guardrails_api_integration.py
Adds ~10+ test cases covering profanity_free, llamaguard_7b, nsfw_text, and validator combinations with various on_fail behaviors; contains unresolved merge conflicts for nsfw_text tests.
Unit Tests
backend/app/tests/test_toxicity_hub_validators.py
New test suite validating config construction, policy mapping, on_fail resolution, and error handling for validator configs; contains unresolved merge conflicts around nsfw_text tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Note: This diff contains multiple unresolved Git merge conflicts across 7+ files (enum.py, validators.json, README.md, guardrail_config.py, pyproject.toml, and test files), which significantly impacts review complexity. Conflicts must be resolved before merge. Additionally, the heterogeneity of changes (Docker, enums, configs, schemas, tests, docs) and logic density (policy mapping, validator instantiation) warrant elevated scrutiny.

Possibly related PRs

  • PR #80: Adds the same Guardrails Hub validators (LlamaGuard7B, ProfanityFree) across overlapping files (config classes, validators.json, enum.py, schemas, tests).
  • PR #39: Modifies the ValidatorType enum in backend/app/core/enum.py by introducing different validator members.
  • PR #20: Implements the "rephrase" on_fail feature including test coverage and on_fail behavior wiring.

Suggested reviewers

  • nishika26
  • AkhileshNegi
  • dennyabrain

Poem

🐇 Three validators hop into the fold,
llamaguard and profanity tales told,
While nsfw guards toxicity away,
Pre-cached models in Docker's display,
Toxicity checked, safe content's the way! 🛡️✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'Added NSFW text validator' is misleading given the presence of unresolved merge conflicts and incomplete changes throughout the codebase. Resolve all merge conflicts in enum.py, validators.json, guardrail_config.py, test files, and pyproject.toml before merging. Update the title or resolve conflicts to accurately reflect the final state of the PR.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/toxicity-huggingface-model

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rkritika1508 rkritika1508 self-assigned this Apr 7, 2026
@rkritika1508 rkritika1508 added enhancement New feature or request ready-for-review labels Apr 7, 2026
@rkritika1508 rkritika1508 linked an issue Apr 7, 2026 that may be closed by this pull request
Base automatically changed from feat/toxicity-hub-validators to main April 10, 2026 11:57
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (3)
backend/scripts/install_guardrails_from_hub.sh (1)

9-9: Consider the implications of enabling remote inferencing by default.

Changing ENABLE_REMOTE_INFERENCING default from "false" to "true" means data will be sent to Guardrails AI's remote inference endpoints by default. While the AI summary indicates this aligns with updated documentation, ensure this is an intentional decision given:

  1. Privacy: User input data may be transmitted to external services
  2. Latency: Remote calls add network overhead
  3. Cost: Remote inferencing may incur API usage charges

If this is intentional for the new NSFW/toxicity validators, consider documenting why remote inferencing is now preferred in the script comments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/scripts/install_guardrails_from_hub.sh` at line 9, The default for
ENABLE_REMOTE_INFERENCING was changed to "true", which causes user data to be
sent to Guardrails AI remote endpoints by default; either revert the default
back to "false" or, if leaving it enabled intentionally for NSFW/toxicity
validators, add a clear comment above the ENABLE_REMOTE_INFERENCING declaration
and update relevant docs explaining privacy/latency/cost implications and that
enabling remote inferencing sends user input to external services—refer to the
ENABLE_REMOTE_INFERENCING variable in this script and the
install_guardrails_from_hub.sh header for where to add the explanatory comment
and link to documentation or opt-in instructions.
backend/app/api/routes/guardrails.py (1)

186-190: Only the first validator's metadata is returned.

When multiple validators have validator_metadata set (e.g., multiple validators failed with empty fix_value), only the first one's metadata is surfaced to the API response. This may obscure which validators contributed to the outcome.

Consider aggregating metadata from all validators that set it, or at minimum document this "first-wins" behavior:

♻️ Optional: Aggregate all validator metadata
-            meta = next(
-                (v.validator_metadata for v in validators if v.validator_metadata),
-                None,
-            )
+            all_meta = [v.validator_metadata for v in validators if v.validator_metadata]
+            meta = {"validators": all_meta} if all_meta else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/guardrails.py` around lines 186 - 190, The current
code uses next(...) to pick only the first non-empty validator_metadata from
validators and passes it as meta to APIResponse.success_response, which hides
other validators' metadata; replace that logic in the function that builds the
response (referencing validators, validator_metadata, meta, response_model and
the call to APIResponse.success_response) with aggregation: collect all
non-empty v.validator_metadata entries (e.g., into a list or merged dict
depending on schema), deduplicate/merge as appropriate, and pass the aggregated
metadata to APIResponse.success_response instead of the single-first value (or
explicitly document the first-wins behavior if aggregation is undesired).
backend/README.md (1)

215-215: Use descriptive link text instead of "here".

The link text "here" is non-descriptive. Screen readers and users benefit from meaningful link text that describes the destination.

📝 Suggested improvement
-1. Ensure that the `.env` file contains the correct value for `GUARDRAILS_HUB_API_KEY`. The key can be fetched from [here](https://hub.guardrailsai.com/keys).
+1. Ensure that the `.env` file contains the correct value for `GUARDRAILS_HUB_API_KEY`. The key can be fetched from the [Guardrails Hub API Keys page](https://hub.guardrailsai.com/keys).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` at line 215, Replace the non-descriptive link text "here"
with meaningful descriptive text that explains the destination (for example
"Guardrails Hub API keys page" or "Guardrails Hub API key management") in the
README sentence about GUARDRAILS_HUB_API_KEY so the link reads like "The key can
be fetched from Guardrails Hub API keys page" and improves accessibility and
clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/api/API_USAGE.md`:
- Around line 103-107: Remove the unresolved merge conflict markers and update
the `type=` filter list to include `nsfw_text` (i.e., ensure the line reads with
`type=uli_slur_match|pii_remover|gender_assumption_bias|ban_list|llm_critic|topic_relevance|llamaguard_7b|profanity_free|nsfw_text`),
keeping the rest of the documentation text unchanged so the PR objective (adding
nsfw_text) is reflected.
- Around line 468-471: Resolve the unresolved merge conflict in the validator
types list by removing the Git conflict markers (<<<<<<<, =======, >>>>>>>) and
ensuring the `nsfw_text` entry is present in the list of validators; update the
block that currently contains the conflict markers so it reads a clean list that
includes `nsfw_text` alongside the other validator type names.

In `@backend/app/core/enum.py`:
- Around line 38-41: Remove the git conflict markers and restore the missing
enum member by keeping the line defining NSFWText = "nsfw_text" inside the
appropriate Enum (the ValidatorType/enum class in backend/app/core/enum.py);
delete the "<<<<<<< HEAD", "=======", and ">>>>>>> ..." lines so the file is
valid Python and the NSFWText member is present for
ValidatorBase/ValidatorUpdate and API validation to work.

In `@backend/app/core/validators/config/base_validator_config.py`:
- Around line 22-36: Remove the unresolved merge conflict markers and ensure the
branch that sets validator_metadata also returns the empty string; specifically,
inside the _on_fix method where validator_metadata is assigned (using self.type
in the reason), keep the version that includes the return "" immediately after
setting validator_metadata so _on_fix returns "" instead of falling through to
return the falsy fix_value; remove the conflict markers (<<<<<<<, =======,
>>>>>>>) and any duplicate blocks so the method is a single consistent
implementation.

In `@backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`:
- Around line 5-9: There is an unresolved merge conflict and an unused import:
remove the conflict markers (<<<<<<< HEAD, =======, >>>>>>> ...) and delete the
unused import of GuardrailOnFail so only the necessary import remains (keep
BaseValidatorConfig). Ensure the import block is clean and linter-safe and run
tests/linters to confirm no unused-import errors remain.

In `@backend/app/core/validators/README.md`:
- Around line 17-20: The Supported Validators section in
backend/app/core/validators/README.md contains unresolved merge conflict
markers; remove the conflict markers (<<<<<<<, =======, >>>>>>>) and restore the
`nsfw_text` entry (source: `hub://guardrails/nsfw_text`) so the README lists
`nsfw_text` as intended in that section.
- Around line 416-458: Resolve the unresolved merge conflict in the README's
NSFW Text Validator section by removing the conflict markers (<<<<<<< HEAD,
=======, >>>>>>> 60f5067) and keeping the HEAD content under the "NSFW Text
Validator (`nsfw_text`)" heading (the full explanatory block including Config,
Source, Parameters, Notes/limitations and Recommendation). Ensure there are no
leftover conflict markers and the section reads as a single cohesive
paragraph/block as in the HEAD version.
- Around line 541-545: Remove the unresolved merge conflict markers (<<<<<<<
HEAD, =======, >>>>>>> ...) from the Related Files section in the README and
ensure the file list contains the intended entries; replace the conflict block
with a clean list that includes both `nsfw_text_safety_validator_config.py` and
`profanity_free_safety_validator_config.py` (or only the correct one if project
intends a single entry), making the Related Files section syntactically valid
Markdown.

In `@backend/app/core/validators/validators.json`:
- Around line 42-50: Resolve the Git merge conflict in validators.json by
removing the conflict markers (<<<<<<< HEAD, =======, >>>>>>>) and keeping the
HEAD entry for the new validator object (the "nsfw_text" object with "type":
"nsfw_text", "version": "0.1.0", "source": "hub://guardrails/nsfw_text"); ensure
the surrounding JSON array/object punctuation is corrected (commas/braces) so
the file is valid JSON and the validators manifest loads successfully.

In `@backend/app/schemas/guardrail_config.py`:
- Around line 30-38: There is an unresolved merge conflict in
guardrail_config.py around the imports: remove the conflict markers (<<<<<<<
HEAD and >>>>>>>) and restore the intended imports so both
NSFWTextSafetyValidatorConfig and ProfanityFreeSafetyValidatorConfig are
imported; ensure the import block contains NSFWTextSafetyValidatorConfig (as
requested) alongside ProfanityFreeSafetyValidatorConfig and that there are no
leftover conflict tokens.
- Around line 47-55: The discriminated union list for the validator configs has
an unresolved merge conflict that omitted NSFWTextSafetyValidatorConfig,
breaking recognition of payloads with type "nsfw_text"; fix by removing the
conflict markers and ensuring NSFWTextSafetyValidatorConfig is included
alongside LlamaGuard7BSafetyValidatorConfig, ProfanityFreeSafetyValidatorConfig,
and TopicRelevanceSafetyValidatorConfig in the union used with
Field(discriminator="type") so Pydantic can properly dispatch on type.

In `@backend/app/tests/test_guardrails_api_integration.py`:
- Around line 327-388: The file contains unresolved Git merge conflict markers
(<<<<<<<, =======, >>>>>>>) around the NSFWText test block which breaks pytest
import; remove the conflict markers and reconcile the duplicated sections so
only one coherent set of tests remains (keep the intended tests like
test_input_guardrails_with_nsfw_text_on_explicit_content,
test_input_guardrails_with_nsfw_text_with_low_threshold,
test_input_guardrails_with_nsfw_text_exception_action and any other tests
referenced later), ensuring there are no leftover markers elsewhere (notably the
similar region flagged at lines ~438-537) and that the module imports and test
function names are intact.
- Around line 349-366: The test
test_input_guardrails_with_nsfw_text_with_low_threshold currently uses the same
explicit prompt as the default-threshold test so it doesn't prove threshold
handling; replace the input with a borderline/ambiguous NSFW sample (e.g.,
suggestive but not explicit) and assert that the default-threshold validator
(the other test) returns success while this test, which posts the same
borderline input but with
{"type":"nsfw_text","threshold":0.1,"on_fail":"exception"} to VALIDATE_API_PATH
(using the same request_id, organization_id, project_id) returns failure —
ensuring different outcomes prove the threshold parameter is respected.
- Around line 229-245: The test
test_input_guardrails_with_profanity_free_on_profane_text currently only asserts
that the sanitized value differs from the original, which can hide a broken fix
path where fix_value falls back to "" in base_validator_config.py; update the
assertions to (1) assert that body["data"][SAFE_TEXT_FIELD] is a non-empty
string and (2) assert that known profane tokens from the original input (e.g.,
"damn", "fucking") are not present in body["data"][SAFE_TEXT_FIELD] so the fix
is both usable and actually removed profanities.

In `@backend/app/tests/test_toxicity_hub_validators.py`:
- Around line 13-27: Remove the unresolved merge markers and restore the HEAD
version: keep the import of NSFWTextSafetyValidatorConfig and the _NSFW_PATCH
assignment, retain the _LLAMAGUARD_PATCH constant, and delete the conflict lines
(<<<<<<<, =======, >>>>>>>). Ensure the module defines
NSFWTextSafetyValidatorConfig, _LLAMAGUARD_PATCH (value
"app.core.validators.config.llamaguard_7b_safety_validator_config.LlamaGuard7B")
and _NSFW_PATCH (value
"app.core.validators.config.nsfw_text_safety_validator_config.NSFWText") exactly
as in HEAD so imports and patch constants resolve.
- Around line 294-451: Resolve the merge conflict by removing the conflict
markers and keeping the HEAD version of the TestNSFWTextSafetyValidatorConfig
class (delete the >>>>>>> and <<<<<<< blocks and the redundant empty region),
and update the failing expectation in test_build_with_defaults to assert
model_name == "textdetox/xlmr-large-toxicity-classifier" so the test matches the
config default; ensure all tests in TestNSFWTextSafetyValidatorConfig (including
test_build_with_defaults) refer to the correct default model_name and that there
are no leftover conflict markers anywhere in that class.

In `@backend/pyproject.toml`:
- Around line 53-61: Resolve the merge conflict in pyproject.toml so the
CPU-only PyTorch source config is activated: ensure the dependency line for
torch remains as "torch>=2.0.0" in the [tool.poetry.dependencies] (or equivalent
dependencies section) and keep the [tool.uv.sources] and [[tool.uv.index]]
blocks intact (the entries torch = [{ index = "pytorch-cpu" }] and the
pytorch-cpu index URL). Remove conflict markers and ensure only the incoming
CPU-wheel index configuration and the torch>=2.0.0 dependency coexist so the
project pulls CPU-only wheels.
- Around line 35-39: The pyproject.toml contains unresolved git conflict markers
(<<<<<<< HEAD, =======, >>>>>>>) around the dependency block; remove the
conflict markers and restore the required dependencies by keeping
"transformers>=5.0.0" and "torch>=2.0.0" in the [project] / dependencies list so
dependency resolution succeeds and the Dockerfile and NSFWText validator that
rely on transformers/torch continue to work.

---

Nitpick comments:
In `@backend/app/api/routes/guardrails.py`:
- Around line 186-190: The current code uses next(...) to pick only the first
non-empty validator_metadata from validators and passes it as meta to
APIResponse.success_response, which hides other validators' metadata; replace
that logic in the function that builds the response (referencing validators,
validator_metadata, meta, response_model and the call to
APIResponse.success_response) with aggregation: collect all non-empty
v.validator_metadata entries (e.g., into a list or merged dict depending on
schema), deduplicate/merge as appropriate, and pass the aggregated metadata to
APIResponse.success_response instead of the single-first value (or explicitly
document the first-wins behavior if aggregation is undesired).

In `@backend/README.md`:
- Line 215: Replace the non-descriptive link text "here" with meaningful
descriptive text that explains the destination (for example "Guardrails Hub API
keys page" or "Guardrails Hub API key management") in the README sentence about
GUARDRAILS_HUB_API_KEY so the link reads like "The key can be fetched from
Guardrails Hub API keys page" and improves accessibility and clarity.

In `@backend/scripts/install_guardrails_from_hub.sh`:
- Line 9: The default for ENABLE_REMOTE_INFERENCING was changed to "true", which
causes user data to be sent to Guardrails AI remote endpoints by default; either
revert the default back to "false" or, if leaving it enabled intentionally for
NSFW/toxicity validators, add a clear comment above the
ENABLE_REMOTE_INFERENCING declaration and update relevant docs explaining
privacy/latency/cost implications and that enabling remote inferencing sends
user input to external services—refer to the ENABLE_REMOTE_INFERENCING variable
in this script and the install_guardrails_from_hub.sh header for where to add
the explanatory comment and link to documentation or opt-in instructions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0055de2d-430a-49bd-94b5-0c5735342722

📥 Commits

Reviewing files that changed from the base of the PR and between 60f5067 and 59dcd2d.

📒 Files selected for processing (17)
  • backend/Dockerfile
  • backend/README.md
  • backend/app/api/API_USAGE.md
  • backend/app/api/docs/guardrails/run_guardrails.md
  • backend/app/api/routes/guardrails.py
  • backend/app/core/enum.py
  • backend/app/core/validators/README.md
  • backend/app/core/validators/config/base_validator_config.py
  • backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py
  • backend/app/core/validators/config/nsfw_text_safety_validator_config.py
  • backend/app/core/validators/config/profanity_free_safety_validator_config.py
  • backend/app/core/validators/validators.json
  • backend/app/schemas/guardrail_config.py
  • backend/app/tests/test_guardrails_api_integration.py
  • backend/app/tests/test_toxicity_hub_validators.py
  • backend/pyproject.toml
  • backend/scripts/install_guardrails_from_hub.sh

Co-authored-by: dennyabrain <denny.george90@gmail.com>
@rkritika1508 rkritika1508 force-pushed the feat/toxicity-huggingface-model branch from 59dcd2d to 31af2f6 Compare April 10, 2026 15:15
rkritika1508 and others added 3 commits April 10, 2026 20:48
Resolved conflicts keeping nsfw_text validator features from this branch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fier

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rkritika1508 rkritika1508 force-pushed the feat/toxicity-huggingface-model branch from 354cb64 to 5b2fe3b Compare April 10, 2026 15:30
rkritika1508 and others added 2 commits April 10, 2026 21:02
…t model

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rkritika1508
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

✅ Actions performed

Full review triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add toxicity validator for HuggingFace models

3 participants