Skip to content

Commit 880bebe

Browse files
NathanHBCopilot
andauthored
Adds inspectai (#1022)
adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance - this allows for: - better logs - better paralelixzation - easier to add tasks tasks compatible with inspect ai (at term all the tasks will be compatible): - gpqa (fewshot compatible) - ifeval - hle - gsm8k (fewshot compatible) - agieval - aime24,25 ### run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`: ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` result: ``` | Model |agieval|aime25|gpqa| |----------------------------------------------------------------------|------:|-----:|---:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.53| 0|0.33| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.71| 1|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.53| 0|0.20| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.65| 0|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.35| 0|0.25| ``` ### compare few shots diff on gsm8k ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gsm8k|0,lighteval|gsm8k|3" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` ``` | Model |gsm8k|gsm8k_3_shots| |----------------------------------------------------------------------|----:|------------:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.7| 0.8| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.5| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.4| 0.8| ``` --------- Co-authored-by: Copilot <[email protected]>
1 parent 3cd31fd commit 880bebe

File tree

21 files changed

+1181
-52
lines changed

21 files changed

+1181
-52
lines changed

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
<a href="https://huggingface.co/docs/lighteval/main/en/index" target="_blank">
2626
<img alt="Documentation" src="https://img.shields.io/badge/Documentation-4F4F4F?style=for-the-badge&logo=readthedocs&logoColor=white" />
2727
</a>
28-
<a href="https://huggingface.co/spaces/SaylorTwift/benchmark_finder" target="_blank">
28+
<a href="https://huggingface.co/spaces/OpenEvals/open_benchmark_index" target="_blank">
2929
<img alt="Open Benchmark Index" src="https://img.shields.io/badge/Open%20Benchmark%20Index-4F4F4F?style=for-the-badge&logo=huggingface&logoColor=white" />
3030
</a>
3131
</p>
@@ -44,7 +44,7 @@ sample-by-sample results* to debug and see how your models stack-up.
4444

4545
Lighteval supports **1000+ evaluation tasks** across multiple domains and
4646
languages. Use [this
47-
space](https://huggingface.co/spaces/SaylorTwift/benchmark_finder) to find what
47+
space](https://huggingface.co/spaces/OpenEvals/open_benchmark_index) to find what
4848
you need, or, here's an overview of some *popular benchmarks*:
4949

5050

@@ -107,6 +107,7 @@ huggingface-cli login
107107

108108
Lighteval offers the following entry points for model evaluation:
109109

110+
- `lighteval eval`: Evaluation models using [inspect-ai](https://inspect.aisi.org.uk/) as a backend (prefered).
110111
- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗
111112
Accelerate](https://github.com/huggingface/accelerate)
112113
- `lighteval nanotron`: Evaluate models in distributed settings using [⚡️
@@ -126,9 +127,7 @@ Did not find what you need ? You can always make your custom model API by follow
126127
Here's a **quick command** to evaluate using the *Accelerate backend*:
127128

128129
```shell
129-
lighteval accelerate \
130-
"model_name=gpt2" \
131-
"leaderboard|truthfulqa:mc|0"
130+
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0"
132131
```
133132

134133
Or use the **Python API** to run a model *already loaded in memory*!

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
title: Quicktour
88
title: Getting started
99
- sections:
10+
- local: inspect-ai
11+
title: Examples using Inspect-AI
1012
- local: saving-and-reading-results
1113
title: Save and read results
1214
- local: caching

docs/source/available-tasks.mdx

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,30 @@
1+
# Available tasks
12

3+
Browse and inspect tasks available in LightEval.
24
<iframe
3-
src="https://saylortwift-benchmark-finder.hf.space"
5+
src="https://openevals-benchmark-finder.hf.space"
46
frameborder="0"
57
width="850"
68
height="450"
79
></iframe>
810

911

1012

11-
You can get a list of all available tasks by running:
13+
List all tasks:
1214

1315
```bash
1416
lighteval tasks list
1517
```
1618

17-
### Inspect Specific Tasks
19+
### Inspect specific tasks
1820

19-
You can inspect a specific task to see its configuration, metrics, and requirements by running:
21+
Inspect a task to view its config, metrics, and requirements:
2022

2123
```bash
2224
lighteval tasks inspect <task_name>
2325
```
2426

25-
For example:
27+
Example:
2628
```bash
2729
lighteval tasks inspect "lighteval|truthfulqa:mc|0"
2830
```

docs/source/index.mdx

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ and see how your models stack up.
99

1010
### 🚀 **Multi-Backend Support**
1111
Evaluate your models using the most popular and efficient inference backends:
12+
- `eval`: Use [inspect-ai](https://inspect.aisi.org.uk/) as backend to evaluate and inspect your models ! (prefered way)
1213
- `transformers`: Evaluate models on CPU or one or more GPUs using [🤗
1314
Accelerate](https://github.com/huggingface/transformers)
1415
- `nanotron`: Evaluate models in distributed settings using [⚡️
@@ -45,26 +46,29 @@ pip install lighteval
4546

4647
### Basic Usage
4748

48-
```bash
49-
# Evaluate a model using Transformers backend
50-
lighteval accelerate \
51-
"model_name=openai-community/gpt2" \
52-
"leaderboard|truthfulqa:mc|0"
53-
```
49+
#### Find a task
50+
51+
<iframe
52+
src="https://openevals-open-benchmark-index.hf.space"
53+
frameborder="0"
54+
width="850"
55+
height="450"
56+
></iframe>
5457

55-
### Save Results
58+
#### Run your benchmark and push details to the hub
5659

5760
```bash
58-
# Save locally
59-
lighteval accelerate \
60-
"model_name=openai-community/gpt2" \
61-
"leaderboard|truthfulqa:mc|0" \
62-
--output-dir ./results
63-
64-
# Push to Hugging Face Hub
65-
lighteval accelerate \
66-
"model_name=openai-community/gpt2" \
67-
"leaderboard|truthfulqa:mc|0" \
68-
--push-to-hub \
69-
--results-org your-username
61+
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" \
62+
"lighteval|gpqa:diamond|0" \
63+
--bundle-dir gpt-oss-bundle \
64+
--repo-id OpenEvals/evals
7065
```
66+
67+
Resulting Space:
68+
69+
<iframe
70+
src="https://openevals-evals.static.hf.space"
71+
frameborder="0"
72+
width="850"
73+
height="450"
74+
></iframe>

docs/source/inspect-ai.mdx

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Evaluate your model with Inspect-AI
2+
3+
Pick the right benchmarks with our benchmark finder:
4+
Search by language, task type, dataset name, or keywords.
5+
6+
> [!WARNING]
7+
> Not all tasks are compatible with inspect-ai's API as of yet, we are working on converting all of them !
8+
9+
10+
<iframe
11+
src="https://openevals-open-benchmark-index.hf.space"
12+
frameborder="0"
13+
width="850"
14+
height="450"
15+
></iframe>
16+
17+
Once you've chosen a benchmark, run it with `lighteval eval`. Below are examples for common setups.
18+
19+
### Examples
20+
21+
1. Evaluate a model via Hugging Face Inference Providers.
22+
23+
```bash
24+
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0"
25+
```
26+
27+
2. Run multiple evals at the same time.
28+
29+
```bash
30+
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0,lighteval|aime25|0"
31+
```
32+
33+
3. Compare providers for the same model.
34+
35+
```bash
36+
lighteval eval \
37+
hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \
38+
hf-inference-providers/openai/gpt-oss-20b:together \
39+
hf-inference-providers/openai/gpt-oss-20b:nebius \
40+
"lighteval|gpqa:diamond|0"
41+
```
42+
43+
4. Evaluate a vLLM or SGLang model.
44+
45+
```bash
46+
lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct "lighteval|gpqa:diamond|0"
47+
```
48+
49+
5. See the impact of few-shot on your model.
50+
51+
```bash
52+
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0,lighteval|gsm8k|5"
53+
```
54+
55+
6. Optimize custom server connections.
56+
57+
```bash
58+
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0" \
59+
--max-connections 50 \
60+
--timeout 30 \
61+
--retry-on-error 1 \
62+
--max-retries 1 \
63+
--max-samples 10
64+
```
65+
66+
7. Use multiple epochs for more reliable results.
67+
68+
```bash
69+
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --epochs 16 --epochs-reducer "pass_at_4"
70+
```
71+
72+
8. Push to the Hub to share results.
73+
74+
```bash
75+
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|hle|0" \
76+
--bundle-dir gpt-oss-bundle \
77+
--repo-id OpenEvals/evals \
78+
--max-samples 100
79+
```
80+
81+
Resulting Space:
82+
83+
<iframe
84+
src="https://openevals-evals.static.hf.space"
85+
frameborder="0"
86+
width="850"
87+
height="450"
88+
></iframe>
89+
90+
9. Change model behaviour
91+
92+
You can use any argument defined in inspect-ai's API.
93+
94+
```bash
95+
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --temperature 0.1
96+
```
97+
98+
10. Use model-args to use any inference provider specific argument.
99+
100+
```bash
101+
lighteval eval google/gemini-2.5-pro "lighteval|aime25|0" --model-args location=us-east5
102+
```
103+
104+
```bash
105+
lighteval eval openai/gpt-4o "lighteval|gpqa:diamond|0" --model-args service_tier=flex,client_timeout=1200
106+
```
107+
108+
109+
LightEval prints a per-model results table:
110+
111+
```
112+
Completed all tasks in 'lighteval-logs' successfully
113+
114+
| Model |gpqa|gpqa:diamond|
115+
|---------------------------------------|---:|-----------:|
116+
|vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01| 0.01|
117+
118+
results saved to lighteval-logs
119+
run "inspect view --log-dir lighteval-logs" to view the results
120+
```

docs/source/quicktour.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Lighteval can be used with several different commands, each optimized for differ
1111
## Find your benchmark
1212

1313
<iframe
14-
src="https://saylortwift-benchmark-finder.hf.space"
14+
src="https://openevals-open-benchmark-index.hf.space"
1515
frameborder="0"
1616
width="850"
1717
height="450"

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ keywords = ["evaluation", "nlp", "llm"]
5757
dependencies = [
5858
# Base dependencies
5959
"transformers>=4.54.0",
60+
"inspect-ai",
6061
"accelerate",
6162
"huggingface_hub[hf_xet]>=0.30.2",
6263
"torch>=2.0,<3.0",

src/lighteval/__main__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
import lighteval.main_baseline
3030
import lighteval.main_custom
3131
import lighteval.main_endpoint
32+
import lighteval.main_inspect
3233
import lighteval.main_nanotron
3334
import lighteval.main_sglang
3435
import lighteval.main_tasks
@@ -69,6 +70,7 @@
6970
app.command(rich_help_panel="Evaluation Backends")(lighteval.main_vllm.vllm)
7071
app.command(rich_help_panel="Evaluation Backends")(lighteval.main_custom.custom)
7172
app.command(rich_help_panel="Evaluation Backends")(lighteval.main_sglang.sglang)
73+
app.command(rich_help_panel="Evaluation Backends")(lighteval.main_inspect.eval)
7274
app.add_typer(
7375
lighteval.main_endpoint.app,
7476
name="endpoint",

0 commit comments

Comments
 (0)