Skip to content

How to exactly reproduce the results on the openllm leaderboard? #2583

@Zilinghan

Description

@Zilinghan

Hi,

I am trying to reproduce the results on the openllm leaderboard of huggingface, however, I found some inconsistency between the results generated by harness and the results shown on the leaderboard. For example, for meta-llama/Meta-Llama-3.1-70B-Instruct, its average bbh acc_norm is 0.6915466064919285 (see first attached figure) in the raw results generated by harness, but on the leaderboard, it is shown as 55.93%. I am wondering how the 55.93% is calculated from 0.6915466064919285, thanks!

image image

Similarly, for mmlu-pro, the raw results is 0.5309175531914894 but the leaderboard is shown as 47.88%.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions