Skip to content

Commit 2057972

Browse files
authored
Merge branch 'main' into dev/steven/pii_update
2 parents 7a0a13f + c8cae92 commit 2057972

File tree

14 files changed

+480
-144
lines changed

14 files changed

+480
-144
lines changed
193 KB
Loading
-24 KB
Loading

docs/benchmarking/nsfw.md

Lines changed: 0 additions & 31 deletions
This file was deleted.

docs/ref/checks/nsfw.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -82,10 +82,12 @@ This benchmark evaluates model performance on a balanced set of social media pos
8282

8383
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
8484
|--------------|---------|-------------|-------------|-------------|-----------------|
85-
| gpt-4.1 | 0.989 | 0.976 | 0.962 | 0.962 | 0.717 |
86-
| gpt-4.1-mini (default) | 0.984 | 0.977 | 0.977 | 0.943 | 0.653 |
87-
| gpt-4.1-nano | 0.952 | 0.972 | 0.823 | 0.823 | 0.429 |
88-
| gpt-4o-mini | 0.965 | 0.977 | 0.955 | 0.945 | 0.842 |
85+
| gpt-5 | 0.9532 | 0.9195 | 0.9096 | 0.9068 | 0.0339 |
86+
| gpt-5-mini | 0.9629 | 0.9321 | 0.9168 | 0.9149 | 0.0998 |
87+
| gpt-5-nano | 0.9600 | 0.9297 | 0.9216 | 0.9175 | 0.1078 |
88+
| gpt-4.1 | 0.9603 | 0.9312 | 0.9249 | 0.9192 | 0.0439 |
89+
| gpt-4.1-mini (default) | 0.9520 | 0.9180 | 0.9130 | 0.9049 | 0.0459 |
90+
| gpt-4.1-nano | 0.9502 | 0.9262 | 0.9094 | 0.9043 | 0.0379 |
8991

9092
**Notes:**
9193

docs/ref/checks/prompt_injection_detection.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
6565
"observation": "The assistant is calling get_weather function with location parameter",
6666
"flagged": false,
6767
"confidence": 0.1,
68+
"evidence": null,
6869
"threshold": 0.7,
6970
"user_goal": "What's the weather in Tokyo?",
7071
"action": [
@@ -81,6 +82,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
8182
- **`observation`**: What the AI action is doing
8283
- **`flagged`**: Whether the action is misaligned (boolean)
8384
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
85+
- **`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned)
8486
- **`threshold`**: The confidence threshold that was configured
8587
- **`user_goal`**: The tracked user intent from conversation
8688
- **`action`**: The list of function calls or tool outputs analyzed for alignment
@@ -92,10 +94,8 @@ Returns a `GuardrailResult` with the following `info` dictionary:
9294

9395
This benchmark evaluates model performance on agent conversation traces:
9496

95-
- **Synthetic dataset**: 1,000 samples with 500 positive cases (50% prevalence) simulating realistic agent traces
96-
- **AgentDojo dataset**: 1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
97-
- **Test scenarios**: Multi-turn conversations with function calls and tool outputs across realistic workplace domains
98-
- **Misalignment examples**: Unrelated function calls, harmful operations, and data leakage
97+
- **[AgentDojo dataset](https://github.com/ethz-spylab/agentdojo)**: 1,046 samples generated from running AgentDojo's benchmark script on workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
98+
- **Internal synthetic dataset**: 537 positive cases simulating realistic, multi-turn agent conversation traces
9999

100100
**Example of misaligned conversation:**
101101

@@ -113,12 +113,12 @@ This benchmark evaluates model performance on agent conversation traces:
113113

114114
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
115115
|---------------|---------|-------------|-------------|-------------|-----------------|
116-
| gpt-5 | 0.9604 | 0.998 | 0.995 | 0.963 | 0.431 |
117-
| gpt-5-mini | 0.9796 | 0.999 | 0.999 | 0.966 | 0.000 |
118-
| gpt-5-nano | 0.8651 | 0.963 | 0.963 | 0.951 | 0.056 |
119-
| gpt-4.1 | 0.9846 | 0.998 | 0.998 | 0.998 | 0.000 |
120-
| gpt-4.1-mini (default) | 0.9728 | 0.995 | 0.995 | 0.995 | 0.000 |
121-
| gpt-4.1-nano | 0.8677 | 0.974 | 0.974 | 0.974 | 0.000 |
116+
| gpt-5 | 0.9931 | 0.9992 | 0.9992 | 0.9992 | 0.5845 |
117+
| gpt-5-mini | 0.9536 | 0.9951 | 0.9951 | 0.9951 | 0.0000 |
118+
| gpt-5-nano | 0.9283 | 0.9913 | 0.9913 | 0.9717 | 0.0350 |
119+
| gpt-4.1 | 0.9794 | 0.9973 | 0.9973 | 0.9973 | 0.0000 |
120+
| gpt-4.1-mini (default) | 0.9865 | 0.9986 | 0.9986 | 0.9986 | 0.0000 |
121+
| gpt-4.1-nano | 0.9142 | 0.9948 | 0.9948 | 0.9387 | 0.0000 |
122122

123123
**Notes:**
124124

mkdocs.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,14 @@ nav:
3838
- "Streaming vs Blocking": streaming_output.md
3939
- Tripwires: tripwires.md
4040
- Checks:
41-
- Prompt Injection Detection: ref/checks/prompt_injection_detection.md
4241
- Contains PII: ref/checks/pii.md
4342
- Custom Prompt Check: ref/checks/custom_prompt_check.md
4443
- Hallucination Detection: ref/checks/hallucination_detection.md
4544
- Jailbreak Detection: ref/checks/jailbreak.md
4645
- Moderation: ref/checks/moderation.md
46+
- NSFW: ref/checks/nsfw.md
4747
- Off Topic Prompts: ref/checks/off_topic_prompts.md
48+
- Prompt Injection Detection: ref/checks/prompt_injection_detection.md
4849
- URL Filter: ref/checks/urls.md
4950
- Evaluation Tool: evals.md
5051
- API Reference:

0 commit comments

Comments
 (0)