Skip to content

Commit 9a5a72b

Browse files
committed
Merge branch 'main' of github.com:openai/openai-guardrails-python into dev/steven/pii_updates
2 parents d992fc5 + d2ba595 commit 9a5a72b

26 files changed

+163
-82
lines changed
-9.25 KB
Loading
-46.4 KB
Loading
-89.2 KB
Loading
-80.3 KB
Loading

docs/evals.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ guardrails-evals \
2727
--config-path guardrails_config.json \
2828
--dataset-path data.jsonl \
2929
--mode benchmark \
30-
--models gpt-5 gpt-5-mini gpt-5-nano
30+
--models gpt-5 gpt-5-mini
3131
```
3232

3333
Test with included demo files in our [github repository](https://github.com/openai/openai-guardrails-python/tree/main/src/guardrails/evals/eval_demo)

docs/ref/checks/hallucination_detection.md

Lines changed: 2 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -173,10 +173,8 @@ The statements cover various types of factual claims including:
173173
|--------------|---------|-------------|-------------|-------------|
174174
| gpt-5 | 0.854 | 0.732 | 0.686 | 0.670 |
175175
| gpt-5-mini | 0.934 | 0.813 | 0.813 | 0.770 |
176-
| gpt-5-nano | 0.566 | 0.540 | 0.540 | 0.533 |
177176
| gpt-4.1 | 0.870 | 0.785 | 0.785 | 0.785 |
178177
| gpt-4.1-mini (default) | 0.876 | 0.806 | 0.789 | 0.789 |
179-
| gpt-4.1-nano | 0.537 | 0.526 | 0.526 | 0.526 |
180178

181179
**Notes:**
182180
- ROC AUC: Area under the ROC curve (higher is better)
@@ -190,10 +188,8 @@ The following table shows latency measurements for each model using the hallucin
190188
|--------------|--------------|--------------|
191189
| gpt-5 | 34,135 | 525,854 |
192190
| gpt-5-mini | 23,013 | 59,316 |
193-
| gpt-5-nano | 17,079 | 26,317 |
194191
| gpt-4.1 | 7,126 | 33,464 |
195192
| gpt-4.1-mini (default) | 7,069 | 43,174 |
196-
| gpt-4.1-nano | 4,809 | 6,869 |
197193

198194
- **TTC P50**: Median time to completion (50% of requests complete within this time)
199195
- **TTC P95**: 95th percentile time to completion (95% of requests complete within this time)
@@ -215,10 +211,8 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
215211
|--------------|---------------------|----------------------|---------------------|---------------------------|
216212
| gpt-5 | 28,762 / 396,472 | 34,135 / 525,854 | 37,104 / 75,684 | 40,909 / 645,025 |
217213
| gpt-5-mini | 19,240 / 39,526 | 23,013 / 59,316 | 24,217 / 65,904 | 37,314 / 118,564 |
218-
| gpt-5-nano | 13,436 / 22,032 | 17,079 / 26,317 | 17,843 / 35,639 | 21,724 / 37,062 |
219214
| gpt-4.1 | 7,437 / 15,721 | 7,126 / 33,464 | 6,993 / 30,315 | 6,688 / 127,481 |
220215
| gpt-4.1-mini (default) | 6,661 / 14,827 | 7,069 / 43,174 | 7,032 / 46,354 | 7,374 / 37,769 |
221-
| gpt-4.1-nano | 4,296 / 6,378 | 4,809 / 6,869 | 4,171 / 6,609 | 4,650 / 6,201 |
222216

223217
- **Vector store size impact varies by model**: GPT-4.1 series shows minimal latency impact across vector store sizes, while GPT-5 series shows significant increases.
224218

@@ -238,10 +232,6 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
238232
| | Medium (3 MB) | 0.934 | 0.813 | 0.813 | 0.770 |
239233
| | Large (11 MB) | 0.919 | 0.817 | 0.817 | 0.817 |
240234
| | Extra Large (105 MB) | 0.909 | 0.793 | 0.793 | 0.711 |
241-
| **gpt-5-nano** | Small (1 MB) | 0.590 | 0.547 | 0.545 | 0.536 |
242-
| | Medium (3 MB) | 0.566 | 0.540 | 0.540 | 0.533 |
243-
| | Large (11 MB) | 0.564 | 0.534 | 0.532 | 0.507 |
244-
| | Extra Large (105 MB) | 0.603 | 0.570 | 0.558 | 0.550 |
245235
| **gpt-4.1** | Small (1 MB) | 0.907 | 0.839 | 0.839 | 0.839 |
246236
| | Medium (3 MB) | 0.870 | 0.785 | 0.785 | 0.785 |
247237
| | Large (11 MB) | 0.846 | 0.753 | 0.753 | 0.753 |
@@ -250,15 +240,11 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
250240
| | Medium (3 MB) | 0.876 | 0.806 | 0.789 | 0.789 |
251241
| | Large (11 MB) | 0.862 | 0.791 | 0.757 | 0.757 |
252242
| | Extra Large (105 MB) | 0.802 | 0.722 | 0.722 | 0.722 |
253-
| **gpt-4.1-nano** | Small (1 MB) | 0.605 | 0.528 | 0.528 | 0.528 |
254-
| | Medium (3 MB) | 0.537 | 0.526 | 0.526 | 0.526 |
255-
| | Large (11 MB) | 0.618 | 0.531 | 0.531 | 0.531 |
256-
| | Extra Large (105 MB) | 0.636 | 0.528 | 0.528 | 0.528 |
257243

258244
**Key Insights:**
259245

260246
- **Best Performance**: gpt-5-mini consistently achieves the highest ROC AUC scores across all vector store sizes (0.909-0.939)
261-
- **Best Latency**: gpt-4.1-nano shows the most consistent and lowest latency across all scales (4,171-4,809ms P50) but shows poor performance
247+
- **Best Latency**: gpt-4.1-mini (default) provides the lowest median latencies while maintaining strong accuracy
262248
- **Most Stable**: gpt-4.1-mini (default) maintains relatively stable performance across vector store sizes with good accuracy-latency balance
263249
- **Scale Sensitivity**: gpt-5 shows the most variability in performance across vector store sizes, with performance dropping significantly at larger scales
264250
- **Performance vs Scale**: Most models show decreasing performance as vector store size increases, with gpt-5-mini being the most resilient
@@ -268,4 +254,4 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
268254
- **Signal-to-noise ratio degradation**: Larger vector stores contain more irrelevant documents that may not be relevant to the specific factual claims being validated
269255
- **Semantic search limitations**: File search retrieves semantically similar documents, but with a large diverse knowledge source, these may not always be factually relevant
270256
- **Document quality matters more than quantity**: The relevance and accuracy of documents is more important than the total number of documents
271-
- **Performance plateaus**: Beyond a certain size (11 MB), the performance impact becomes less severe
257+
- **Performance plateaus**: Beyond a certain size (11 MB), the performance impact becomes less severe

docs/ref/checks/jailbreak.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -91,23 +91,19 @@ This benchmark evaluates model performance on a diverse set of prompts:
9191

9292
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
9393
|--------------|---------|-------------|-------------|-------------|-----------------|
94-
| gpt-5 | 0.979 | 0.973 | 0.970 | 0.970 | 0.733 |
95-
| gpt-5-mini | 0.954 | 0.990 | 0.900 | 0.900 | 0.768 |
96-
| gpt-5-nano | 0.962 | 0.973 | 0.967 | 0.965 | 0.048 |
97-
| gpt-4.1 | 0.990 | 1.000 | 1.000 | 0.984 | 0.946 |
98-
| gpt-4.1-mini (default) | 0.982 | 0.992 | 0.992 | 0.954 | 0.444 |
99-
| gpt-4.1-nano | 0.934 | 0.924 | 0.924 | 0.848 | 0.000 |
94+
| gpt-5 | 0.982 | 0.984 | 0.977 | 0.977 | 0.743 |
95+
| gpt-5-mini | 0.980 | 0.980 | 0.976 | 0.975 | 0.734 |
96+
| gpt-4.1 | 0.979 | 0.975 | 0.975 | 0.975 | 0.661 |
97+
| gpt-4.1-mini (default) | 0.979 | 0.974 | 0.972 | 0.972 | 0.654 |
10098

10199
#### Latency Performance
102100

103101
| Model | TTC P50 (ms) | TTC P95 (ms) |
104102
|--------------|--------------|--------------|
105103
| gpt-5 | 4,569 | 7,256 |
106104
| gpt-5-mini | 5,019 | 9,212 |
107-
| gpt-5-nano | 4,702 | 6,739 |
108105
| gpt-4.1 | 841 | 1,861 |
109106
| gpt-4.1-mini | 749 | 1,291 |
110-
| gpt-4.1-nano | 683 | 890 |
111107

112108
**Notes:**
113109

docs/ref/checks/nsfw.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -80,12 +80,10 @@ This benchmark evaluates model performance on a balanced set of social media pos
8080

8181
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
8282
|--------------|---------|-------------|-------------|-------------|-----------------|
83-
| gpt-5 | 0.9532 | 0.9195 | 0.9096 | 0.9068 | 0.0339 |
84-
| gpt-5-mini | 0.9629 | 0.9321 | 0.9168 | 0.9149 | 0.0998 |
85-
| gpt-5-nano | 0.9600 | 0.9297 | 0.9216 | 0.9175 | 0.1078 |
86-
| gpt-4.1 | 0.9603 | 0.9312 | 0.9249 | 0.9192 | 0.0439 |
87-
| gpt-4.1-mini (default) | 0.9520 | 0.9180 | 0.9130 | 0.9049 | 0.0459 |
88-
| gpt-4.1-nano | 0.9502 | 0.9262 | 0.9094 | 0.9043 | 0.0379 |
83+
| gpt-5 | 0.953 | 0.919 | 0.910 | 0.907 | 0.034 |
84+
| gpt-5-mini | 0.963 | 0.932 | 0.917 | 0.915 | 0.100 |
85+
| gpt-4.1 | 0.960 | 0.931 | 0.925 | 0.919 | 0.044 |
86+
| gpt-4.1-mini (default) | 0.952 | 0.918 | 0.913 | 0.905 | 0.046 |
8987

9088
**Notes:**
9189

docs/ref/checks/prompt_injection_detection.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -109,12 +109,10 @@ This benchmark evaluates model performance on agent conversation traces:
109109

110110
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
111111
|---------------|---------|-------------|-------------|-------------|-----------------|
112-
| gpt-5 | 0.9931 | 0.9992 | 0.9992 | 0.9992 | 0.5845 |
113-
| gpt-5-mini | 0.9536 | 0.9951 | 0.9951 | 0.9951 | 0.0000 |
114-
| gpt-5-nano | 0.9283 | 0.9913 | 0.9913 | 0.9717 | 0.0350 |
115-
| gpt-4.1 | 0.9794 | 0.9973 | 0.9973 | 0.9973 | 0.0000 |
116-
| gpt-4.1-mini (default) | 0.9865 | 0.9986 | 0.9986 | 0.9986 | 0.0000 |
117-
| gpt-4.1-nano | 0.9142 | 0.9948 | 0.9948 | 0.9387 | 0.0000 |
112+
| gpt-5 | 0.993 | 0.999 | 0.999 | 0.999 | 0.584 |
113+
| gpt-5-mini | 0.954 | 0.995 | 0.995 | 0.995 | 0.000 |
114+
| gpt-4.1 | 0.979 | 0.997 | 0.997 | 0.997 | 0.000 |
115+
| gpt-4.1-mini (default) | 0.987 | 0.999 | 0.999 | 0.999 | 0.000 |
118116

119117
**Notes:**
120118

@@ -126,12 +124,10 @@ This benchmark evaluates model performance on agent conversation traces:
126124

127125
| Model | TTC P50 (ms) | TTC P95 (ms) |
128126
|---------------|--------------|--------------|
129-
| gpt-4.1-nano | 1,159 | 2,534 |
130127
| gpt-4.1-mini (default) | 1,481 | 2,563 |
131128
| gpt-4.1 | 1,742 | 2,296 |
132129
| gpt-5 | 3,994 | 6,654 |
133130
| gpt-5-mini | 5,895 | 9,031 |
134-
| gpt-5-nano | 5,911 | 10,134 |
135131

136132
- **TTC P50**: Median time to completion (50% of requests complete within this time)
137133
- **TTC P95**: 95th percentile time to completion (95% of requests complete within this time)

examples/basic/agents_sdk.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
{
3434
"name": "Custom Prompt Check",
3535
"config": {
36-
"model": "gpt-4.1-nano-2025-04-14",
36+
"model": "gpt-4.1-mini-2025-04-14",
3737
"confidence_threshold": 0.7,
3838
"system_prompt_details": "Check if the text contains any math problems.",
3939
},

0 commit comments

Comments
 (0)