How to exactly reproduce the results on the openllm leaderboard?

Hi, 

I am trying to reproduce the results on the [openllm leaderboard of huggingface](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), however, I found some inconsistency between the results generated by harness and the results shown on the leaderboard. For example, for `meta-llama/Meta-Llama-3.1-70B-Instruct`, its average bbh acc_norm is 0.6915466064919285 (see first attached figure) in the raw results generated by harness, but on the leaderboard, it is shown as 55.93%. I am wondering how the 55.93% is calculated from 0.6915466064919285, thanks!

<img width="543" alt="image" src="https://github.com/user-attachments/assets/0e6b8fc2-59bc-4470-911e-ab8c2fb8ab0f" />
<img width="1637" alt="image" src="https://github.com/user-attachments/assets/fdc49628-2aef-4d1e-b1d9-431c3b25031f" />

Similarly, for mmlu-pro, the raw results is 0.5309175531914894 but the leaderboard is shown as 47.88%. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to exactly reproduce the results on the openllm leaderboard? #2583

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to exactly reproduce the results on the openllm leaderboard? #2583

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions