openai
diff --git a/‎docs/benchmarking/NSFW_roc_curve.png‎
193 KB b/‎docs/benchmarking/NSFW_roc_curve.png‎
193 KB
diff --git a/‎docs/benchmarking/alignment_roc_curves.png‎
-24 KB b/‎docs/benchmarking/alignment_roc_curves.png‎
-24 KB
diff --git a/‎docs/benchmarking/nsfw.md‎
Lines changed: 0 additions & 31 deletions b/‎docs/benchmarking/nsfw.md‎
Lines changed: 0 additions & 31 deletions
diff --git a/‎docs/ref/checks/nsfw.md‎
Lines changed: 6 additions & 4 deletions b/‎docs/ref/checks/nsfw.md‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎docs/ref/checks/prompt_injection_detection.md‎
Lines changed: 10 additions & 10 deletions b/‎docs/ref/checks/prompt_injection_detection.md‎
Lines changed: 10 additions & 10 deletions
diff --git a/‎mkdocs.yml‎
Lines changed: 2 additions & 1 deletion b/‎mkdocs.yml‎
Lines changed: 2 additions & 1 deletion
@@ -82,10 +82,12 @@ This benchmark evaluates model performance on a balanced set of social media pos
 
 | Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
 |--------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-4.1      | 0.989   | 0.976       | 0.962       | 0.962       | 0.717           |
-| gpt-4.1-mini (default) | 0.984   | 0.977       | 0.977       | 0.943       | 0.653           |
-| gpt-4.1-nano | 0.952   | 0.972       | 0.823       | 0.823       | 0.429           |
-| gpt-4o-mini  | 0.965   | 0.977       | 0.955       | 0.945       | 0.842           |
+| gpt-5        | 0.9532  | 0.9195      | 0.9096      | 0.9068      | 0.0339          |
+| gpt-5-mini   | 0.9629  | 0.9321      | 0.9168      | 0.9149      | 0.0998          |
+| gpt-5-nano   | 0.9600  | 0.9297      | 0.9216      | 0.9175      | 0.1078          |
+| gpt-4.1      | 0.9603  | 0.9312      | 0.9249      | 0.9192      | 0.0439          |
+| gpt-4.1-mini (default) | 0.9520  | 0.9180      | 0.9130      | 0.9049      | 0.0459          |
+| gpt-4.1-nano | 0.9502  | 0.9262      | 0.9094      | 0.9043      | 0.0379          |
 
 **Notes:**
 
 
@@ -65,6 +65,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
     "observation": "The assistant is calling get_weather function with location parameter",
     "flagged": false,
     "confidence": 0.1,
+    "evidence": null,
     "threshold": 0.7,
     "user_goal": "What's the weather in Tokyo?",
     "action": [
@@ -81,6 +82,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
 - **`observation`**: What the AI action is doing
 - **`flagged`**: Whether the action is misaligned (boolean)
 - **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
+- **`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned)
 - **`threshold`**: The confidence threshold that was configured
 - **`user_goal`**: The tracked user intent from conversation
 - **`action`**: The list of function calls or tool outputs analyzed for alignment
@@ -92,10 +94,8 @@ Returns a `GuardrailResult` with the following `info` dictionary:
 
 This benchmark evaluates model performance on agent conversation traces:
 
-- **Synthetic dataset**: 1,000 samples with 500 positive cases (50% prevalence) simulating realistic agent traces
-- **AgentDojo dataset**: 1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
-- **Test scenarios**: Multi-turn conversations with function calls and tool outputs across realistic workplace domains
-- **Misalignment examples**: Unrelated function calls, harmful operations, and data leakage
+- **[AgentDojo dataset](https://github.com/ethz-spylab/agentdojo)**: 1,046 samples generated from running AgentDojo's benchmark script on workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
+- **Internal synthetic dataset**: 537 positive cases simulating realistic, multi-turn agent conversation traces
 
 **Example of misaligned conversation:**
 
@@ -113,12 +113,12 @@ This benchmark evaluates model performance on agent conversation traces:
 
 | Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
 |---------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-5         | 0.9604  | 0.998       | 0.995       | 0.963       | 0.431           |
-| gpt-5-mini    | 0.9796  | 0.999       | 0.999       | 0.966       | 0.000           |
-| gpt-5-nano    | 0.8651  | 0.963       | 0.963       | 0.951       | 0.056           |
-| gpt-4.1       | 0.9846  | 0.998       | 0.998       | 0.998       | 0.000           |
-| gpt-4.1-mini (default) | 0.9728  | 0.995       | 0.995       | 0.995       | 0.000           |
-| gpt-4.1-nano  | 0.8677  | 0.974       | 0.974       | 0.974       | 0.000           |
+| gpt-5         | 0.9931  | 0.9992      | 0.9992      | 0.9992      | 0.5845          |
+| gpt-5-mini    | 0.9536  | 0.9951      | 0.9951      | 0.9951      | 0.0000          |
+| gpt-5-nano    | 0.9283  | 0.9913      | 0.9913      | 0.9717      | 0.0350          |
+| gpt-4.1       | 0.9794  | 0.9973      | 0.9973      | 0.9973      | 0.0000          |
+| gpt-4.1-mini (default) | 0.9865  | 0.9986      | 0.9986      | 0.9986      | 0.0000          |
+| gpt-4.1-nano  | 0.9142  | 0.9948      | 0.9948      | 0.9387      | 0.0000          |
 
 **Notes:**
 
 
@@ -38,13 +38,14 @@ nav:
     - "Streaming vs Blocking": streaming_output.md
     - Tripwires: tripwires.md
     - Checks:
-        - Prompt Injection Detection: ref/checks/prompt_injection_detection.md
         - Contains PII: ref/checks/pii.md
         - Custom Prompt Check: ref/checks/custom_prompt_check.md
         - Hallucination Detection: ref/checks/hallucination_detection.md
         - Jailbreak Detection: ref/checks/jailbreak.md
         - Moderation: ref/checks/moderation.md
+        - NSFW: ref/checks/nsfw.md
         - Off Topic Prompts: ref/checks/off_topic_prompts.md
+        - Prompt Injection Detection: ref/checks/prompt_injection_detection.md
         - URL Filter: ref/checks/urls.md
     - Evaluation Tool: evals.md
   - API Reference: