feat: add Agent Threat Rules (ATR) detection library rail#1996
feat: add Agent Threat Rules (ATR) detection library rail#1996Oxygen56 wants to merge 5 commits into
Conversation
Implements a new input rail for detecting agent-specific threats following the ATR framework. The rail evaluates user messages against the bundled ATR rule set via the pyatr package. - Lazy-imports pyatr with clear install instructions - Configurable severity threshold (default: critical, high) - Colang 1.0 and 2.0 flow definitions - 16 tests (13 unit + 3 E2E) with mocked pyatr engine Fixes NVIDIA-NeMo#1991 Signed-off-by: Oxygen56 <1391083091@qq.com> Signed-off-by: Oxygen <1391083091@qq.com>
📝 WalkthroughWalkthroughThis PR introduces a new Agent Threat Rules (ATR) detection library rail that scans user input for AI-agent attacks using the optional ChangesATR Detection Library Rail
Sequence DiagramsequenceDiagram
participant User
participant Flow as atr check input
participant Action as atr_detection
participant Engine as ATREngine
participant Response
User->>Flow: user input
Flow->>Action: evaluate text
Action->>Engine: create AgentEvent & evaluate
Engine-->>Action: severity-labeled matches
Action->>Action: filter by config severities
alt is_threat == true
alt exceptions enabled
Action-->>Response: ATRDetectionRailException
else
Action->>User: "Threat detected: rule1, rule2"
Flow->>Flow: stop
end
else
Flow->>Response: continue processing
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemoguardrails/library/atr/actions.py`:
- Around line 149-150: The loop that validates severities currently calls
sev.lower() unguarded and will raise AttributeError for non-string items; update
the validation (in the loop that iterates "for sev in severities" in
nemoguardrails/library/atr/actions.py) to first check that sev is a str (e.g.,
if not isinstance(sev, str): raise ValueError(...)) before calling sev.lower(),
then verify sev.lower() is in VALID_SEVERITIES and raise a ValueError with a
clear message if not.
In `@tests/test_atr_detection.py`:
- Around line 281-309: The tests simulate user input but never execute the
bot/app generation, so ATR blocking/pass-through isn't actually validated;
update both tests (using TestChat) to call the code path that triggers ATR
evaluation (e.g., invoke chat.bot(...) or chat.app.generate(...) after queuing
input) instead of only using chat >> "…", ensuring the mocked
_ATREngine.evaluate is exercised and the response is produced and asserted
(refer to TestChat, chat.bot, chat.app.generate and _ATREngine.evaluate to
locate the spots to change).
- Around line 284-286: The tests patch only _ATREngine but _evaluate_atr() also
constructs _AgentEvent, which is None when pyatr isn't installed; update the E2E
test blocks (the ones currently patching "_ATREngine" in
tests/test_atr_detection.py around the blocks at lines ~284, ~296, ~327) to
patch "_AgentEvent" as well (e.g., with
patch("nemoguardrails.library.atr.actions._AgentEvent") and set its return_value
appropriately) so that both _ATREngine and _AgentEvent are mocked in pyatr-less
environments and event creation won't raise.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2e5da617-cc6d-473a-98d2-3ac844dd08cd
📒 Files selected for processing (5)
nemoguardrails/library/atr/__init__.pynemoguardrails/library/atr/actions.pynemoguardrails/library/atr/flows.conemoguardrails/library/atr/flows.v1.cotests/test_atr_detection.py
Greptile SummaryAdds an Agent Threat Rules (ATR) input rail that evaluates user messages against the
|
| Filename | Overview |
|---|---|
| nemoguardrails/library/atr/actions.py | Core ATR detection action with lazy pyatr import, severity filtering, and module-level engine cache; has a missing-config hard-error that differs from peer rails, and an unguarded engine-init race under thread-pool executors. |
| nemoguardrails/library/atr/flows.co | Colang 1.0 input-rail flow; declares variables with $ sigil but Jinja template references them without $, diverging from the injection_detection reference pattern and risking an empty rule-list in the blocked message. |
| nemoguardrails/library/atr/flows.v1.co | Colang 2.0 flow; correctly uses $ variables and bot say; mirrors the injection_detection pattern cleanly. |
| nemoguardrails/rails/llm/config.py | Adds extra="allow" to RailsConfigData so arbitrary rail config keys (like atr_detection) are accepted without a Pydantic field; minimal and safe change. |
| tests/test_atr_detection.py | Comprehensive unit and E2E tests with full pyatr mocking; covers config validation, severity filtering, and blocked/allowed message paths. |
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
nemoguardrails/library/atr/actions.py:127-145
**`ValueError` on missing config prevents opt-in use of the action**
`_validate_atr_config` raises `ValueError` when the `atr_detection` section is absent from the config, making the section mandatory. Every other rail in the library (e.g. `jailbreak_detection`, `injection_detection`) silently uses defaults when no config section is present. A user who wants to try ATR detection without any YAML edits will get a confusing `ValueError` instead of working detection with sensible defaults. Consider returning `DEFAULT_SEVERITIES` when the section is missing rather than treating its absence as a hard error.
### Issue 2 of 3
nemoguardrails/library/atr/actions.py:289-290
The `_cached_engine` global is read and written without any synchronisation. In environments that run the async action inside a thread-pool executor, two threads can both observe `_cached_engine is None` simultaneously and both construct a new `_ATREngine()`, discarding one after loading rules from disk. This is wasteful and potentially inconsistent if the rule-load itself is stateful. A `threading.Lock` around the creation block eliminates the race.
```suggestion
if _cached_engine is None:
import threading
_engine_lock = getattr(atr_detection, "_engine_lock", None)
if _engine_lock is None:
atr_detection._engine_lock = threading.Lock()
with atr_detection._engine_lock:
if _cached_engine is None:
_cached_engine = _ATREngine()
```
### Issue 3 of 3
nemoguardrails/library/atr/flows.co:7-8
The Jinja template refers to `response.detections` but the Colang variable is `$response`. The `injection_detection/flows.co` reference implementation declares variables without the `$` sigil (`response = await ...`) and uses the same template. If the Colang 1.0 runtime only populates the Jinja context from non-`$` locals, the template will silently produce an empty string for the rule list. Aligning the declaration style with the reference avoids this ambiguity.
```suggestion
response = await ATRDetectionAction(text=$user_message)
join_separator = ", "
```
Reviews (5): Last reviewed commit: "fix: handle AttributeError when pyatr AP..." | Re-trigger Greptile
- Colang 2.0: use instead of bare for refusal message (silent failure in Colang 2.0 otherwise) - Make DEFAULT_SEVERITIES and VALID_SEVERITIES frozenset (immutable) - Add isinstance(str) guard before calling sev.lower() in validation - Remove unreachable None-branch in _extract_atr_config; guard defensively for callers that bypass validation - Cache ATREngine at module level to avoid per-request rule loading - E2E tests: also patch _AgentEvent so tests work in pyatr-less CI - Add _reset_cache fixture for test isolation with cached engine Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
All four issues from the Greptile review have been addressed in commit
The review was based on commit |
…t assertions - Add missing Set import to typing imports (line 35 of actions.py) - Fix test_clean_input_passes_through to actually assert pass-through behavior using chat << instead of bare chat >> - Fix test_threat_input_is_blocked to use chat << for proper execution instead of checking chat.history[-1] which is never populated Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Add model_config = ConfigDict(extra="allow") to RailsConfigData so extra fields like atr_detection are preserved and accessible via getattr - Auto-fix ruff/ruff-format issues: unused imports, line length, import ordering, with-statement formatting
Summary
Implements the Agent Threat Rules (ATR) detection library rail as proposed in #1991.
Adds a new input rail that evaluates user messages against the bundled ATR rule set via the
pyatrpackage, covering: prompt injection, jailbreak, tool poisoning, MCP attacks, skill compromise, and more.Design
injection_detectionrail patternpyatrwith a clearpip install pyatrhintrails.config.atr_detection.severities(default: critical, high)enable_rails_exceptionsis true: raisesATRDetectionRailExceptionFiles
nemoguardrails/library/atr/__init__.py— License headernemoguardrails/library/atr/actions.py— Core detection action with config validation, severity filtering, lazy pyatr importnemoguardrails/library/atr/flows.co— Colang 1.0 input-rail flownemoguardrails/library/atr/flows.v1.co— Colang 2.0 input-rail flowtests/test_atr_detection.py— 16 tests (13 unit + 3 E2E)Fixes #1991
Summary by CodeRabbit
Release Notes
New Features
Tests