You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to reproduce the results on the openllm leaderboard of huggingface, however, I found some inconsistency between the results generated by harness and the results shown on the leaderboard. For example, for meta-llama/Meta-Llama-3.1-70B-Instruct, its average bbh acc_norm is 0.6915466064919285 (see first attached figure) in the raw results generated by harness, but on the leaderboard, it is shown as 55.93%. I am wondering how the 55.93% is calculated from 0.6915466064919285, thanks!
Similarly, for mmlu-pro, the raw results is 0.5309175531914894 but the leaderboard is shown as 47.88%.