|
| 1 | +# Evaluate your model with Inspect-AI |
| 2 | + |
| 3 | +Pick the right benchmarks with our benchmark finder: |
| 4 | +Search by language, task type, dataset name, or keywords. |
| 5 | + |
| 6 | +> [!WARNING] |
| 7 | +> Not all tasks are compatible with inspect-ai's API as of yet, we are working on converting all of them ! |
| 8 | +
|
| 9 | + |
| 10 | +<iframe |
| 11 | + src="https://openevals-open-benchmark-index.hf.space" |
| 12 | + frameborder="0" |
| 13 | + width="850" |
| 14 | + height="450" |
| 15 | +></iframe> |
| 16 | + |
| 17 | +Once you've chosen a benchmark, run it with `lighteval eval`. Below are examples for common setups. |
| 18 | + |
| 19 | +### Examples |
| 20 | + |
| 21 | +1. Evaluate a model via Hugging Face Inference Providers. |
| 22 | + |
| 23 | +```bash |
| 24 | +lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0" |
| 25 | +``` |
| 26 | + |
| 27 | +2. Run multiple evals at the same time. |
| 28 | + |
| 29 | +```bash |
| 30 | +lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0,lighteval|aime25|0" |
| 31 | +``` |
| 32 | + |
| 33 | +3. Compare providers for the same model. |
| 34 | + |
| 35 | +```bash |
| 36 | +lighteval eval \ |
| 37 | + hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \ |
| 38 | + hf-inference-providers/openai/gpt-oss-20b:together \ |
| 39 | + hf-inference-providers/openai/gpt-oss-20b:nebius \ |
| 40 | + "lighteval|gpqa:diamond|0" |
| 41 | +``` |
| 42 | + |
| 43 | +4. Evaluate a vLLM or SGLang model. |
| 44 | + |
| 45 | +```bash |
| 46 | +lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct "lighteval|gpqa:diamond|0" |
| 47 | +``` |
| 48 | + |
| 49 | +5. See the impact of few-shot on your model. |
| 50 | + |
| 51 | +```bash |
| 52 | +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0,lighteval|gsm8k|5" |
| 53 | +``` |
| 54 | + |
| 55 | +6. Optimize custom server connections. |
| 56 | + |
| 57 | +```bash |
| 58 | +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0" \ |
| 59 | + --max-connections 50 \ |
| 60 | + --timeout 30 \ |
| 61 | + --retry-on-error 1 \ |
| 62 | + --max-retries 1 \ |
| 63 | + --max-samples 10 |
| 64 | +``` |
| 65 | + |
| 66 | +7. Use multiple epochs for more reliable results. |
| 67 | + |
| 68 | +```bash |
| 69 | +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --epochs 16 --epochs-reducer "pass_at_4" |
| 70 | +``` |
| 71 | + |
| 72 | +8. Push to the Hub to share results. |
| 73 | + |
| 74 | +```bash |
| 75 | +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|hle|0" \ |
| 76 | + --bundle-dir gpt-oss-bundle \ |
| 77 | + --repo-id OpenEvals/evals \ |
| 78 | + --max-samples 100 |
| 79 | +``` |
| 80 | + |
| 81 | +Resulting Space: |
| 82 | + |
| 83 | +<iframe |
| 84 | + src="https://openevals-evals.static.hf.space" |
| 85 | + frameborder="0" |
| 86 | + width="850" |
| 87 | + height="450" |
| 88 | +></iframe> |
| 89 | + |
| 90 | +9. Change model behaviour |
| 91 | + |
| 92 | +You can use any argument defined in inspect-ai's API. |
| 93 | + |
| 94 | +```bash |
| 95 | +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --temperature 0.1 |
| 96 | +``` |
| 97 | + |
| 98 | +10. Use model-args to use any inference provider specific argument. |
| 99 | + |
| 100 | +```bash |
| 101 | +lighteval eval google/gemini-2.5-pro "lighteval|aime25|0" --model-args location=us-east5 |
| 102 | +``` |
| 103 | + |
| 104 | +```bash |
| 105 | +lighteval eval openai/gpt-4o "lighteval|gpqa:diamond|0" --model-args service_tier=flex,client_timeout=1200 |
| 106 | +``` |
| 107 | + |
| 108 | + |
| 109 | +LightEval prints a per-model results table: |
| 110 | + |
| 111 | +``` |
| 112 | +Completed all tasks in 'lighteval-logs' successfully |
| 113 | +
|
| 114 | +| Model |gpqa|gpqa:diamond| |
| 115 | +|---------------------------------------|---:|-----------:| |
| 116 | +|vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01| 0.01| |
| 117 | +
|
| 118 | +results saved to lighteval-logs |
| 119 | +run "inspect view --log-dir lighteval-logs" to view the results |
| 120 | +``` |
0 commit comments