| framework | dataset | metric_family | auc | best_f1 | spearman | n | mean_latency_ms | total_cost_usd | rank |
|---|---|---|---|---|---|---|---|---|---|
| checkllm | halubench | hallucination | 0.783 | 0.796 | 0.544 | 200 | 2415 | 0.0343 | 1 |
| deepeval | halubench | hallucination | 0.553 | 0.701 | 0.151 | 200 | 4457 | 0.0000 | 3 |
| promptfoo | halubench | hallucination | 0.753 | 0.791 | 0.510 | 200 | 1802 | 0.0292 | 2 |
| deepeval | ragtruth | context_relevance | 0.435 | 0.854 | -0.100 | 200 | 20572 | 0.0000 | 3 |
| promptfoo | ragtruth | context_relevance | 0.500 | 0.854 | nan | 200 | 1364 | 0.0423 | 2 |
| checkllm | ragtruth | context_relevance | 0.565 | 0.856 | 0.125 | 200 | 2351 | 0.0623 | 1 |
| checkllm | ragtruth | faithfulness | 0.754 | 0.861 | 0.424 | 200 | 11878 | 0.0613 | 1 |
| deepeval | ragtruth | faithfulness | 0.631 | 0.854 | 0.205 | 200 | 17191 | 0.0000 | 2 |
| promptfoo | ragtruth | faithfulness | 0.534 | 0.856 | 0.090 | 200 | 1693 | 0.0441 | 3 |
| checkllm | ragtruth | hallucination | 0.663 | 0.871 | 0.398 | 200 | 2728 | 0.0442 | 1 |
| deepeval | ragtruth | hallucination | 0.588 | 0.869 | 0.311 | 200 | 3669 | 0.0000 | 2 |
| promptfoo | ragtruth | hallucination | 0.513 | 0.855 | 0.081 | 200 | 1602 | 0.0441 | 3 |
| checkllm | truthfulqa | answer_relevancy | 0.546 | 0.667 | 0.085 | 400 | 6643 | 0.0213 | 1 |
| deepeval | truthfulqa | answer_relevancy | 0.438 | 0.667 | -0.122 | 400 | 30596 | 0.0000 | 2 |
| promptfoo | truthfulqa | answer_relevancy | 0.392 | 0.667 | -0.233 | 400 | 1176 | 0.0247 | 3 |
- Judge model:
gpt-4o-mini, run with 8-way concurrency and per-command--budget-usd 5.0caps. - DeepEval cost column reports $0.00 because the DeepEval adapter does not expose token usage through its metric API; the real API spend is roughly proportional to CheckLLM's reported cost for the same family.
- Ragas is omitted. Importing
ragaspulls intorch, which hangs on Windows in this environment, so the Ragas column is left empty in the current publish. Unit tests cover the Ragas adapter offline. - JailbreakBench is omitted from this run (Scenario A). The family
jailbreak_resistanceis only supported by promptfoo today, theJBB-Behaviorsdataset ships no LLM-under-test answers (only harmful goals), and a meaningful comparison requires generating target-model responses before grading. Tracked indocs/benchmarks/enhancements/remaining-gaps.md. - TruthfulQA is scored as a balanced binary task. Each source row
emits a
best_answersample (label 1.0) and anincorrect_answers[0]sample (label 0.0), so ROC-AUC is well-defined.--limit 200yields 400 graded samples per framework. - RAGTruth
context_relevanceis scored answer-aware for CheckLLM. The retrieved context alone does not carry a retrieval-relevance label, so CheckLLM folds the system answer into the judge prompt and grades whether the context precisely justifies that answer. DeepEval and promptfoo keep their original context-only semantics.