Skip to content

pfnet-research/pfgen-bench

Repository files navigation

Preferred Generation Benchmark

pfgen-benchmark is a benchmark designed to evaluate Japanese text generation, specifically for pretrained models. Unlike conventional benchmarks that use templates containing instructions, this benchmark relies solely on numerous examples. By conveying expectations such as the question-answering nature of the task, responses of approximately 100 characters, and outputs resembling formal public documents purely through examples, it minimizes the influence of differences in instructions or templates. Additionally, output evaluation is conducted using n-gram-based methods, enabling quick, cost-effective, and deterministic evaluations, unlike the LLM as a Judge approach.

To enable comparisons across as many models as possible, the leaderboard actively includes a wide range of models. These include openly accessible models, models cited in academic papers, and those announced by companies through press releases. Contributions of model outputs are encouraged, and results can be submitted via pull requests. For detailed instructions on how to contribute, please refer to the "How to Contribute" section.

See more details: arXiv:2502.09316

pfgen-benchmark は事前学習モデル向けに設計された日本語の生成文を評価するベンチマークです。通常のベンチマークでは指示文を含むテンプレートを使いますが、このベンチマークでは多数の例示のみを行います。質問応答タスクであることや、約100字の回答、公用文に近い出力を期待していることを例示のみで伝えることで、指示文やテンプレートの差異による影響を小さくしています。また、出力文の評価は n-gram を用いた方法を用いており、LLM as a Judge の手法と異なり、短時間、低コストでかつ決定的な評価を可能にしています。

詳しくはこちら: Jxiv preprint

できる限り多くのモデルを同じ軸で比較できるように、リーダーボードには積極的に多くのモデル掲載しています。オープンにアクセス可能なモデル、論文で言及されているモデル、企業がプレスリリースを出しているモデルなど、比較の価値があると思われるモデルについては、是非プルリクエストで出力を追加してください。追加方法については「How to contribute」を参照ください。

License of LLM Output

The license for parts of this repository, except for LLM-generated outputs, is Apache License Version 2.0. The license for LLM-generated outputs depends on the license of each model.

How to Evaluate a Model

You can evaluate the model using either run-hf.py (which uses transformers) or run-vllm.py (which uses vLLM). For detailed parameters, refer to --help. The --num-trials parameter, which determines the number of patterns for which the model will generate answers, should be decided considering the trade-off between execution time and required accuracy.

For pretrained models:

# Run a model using Huggingface library or vLLM.
python ./run-hf.py --model=llm-jp/llm-jp-3-150m --num-trials=5

# Evaluate output and update leaderboard.
make

For instruction models:

# Run a model using Huggingface library or vLLM with three templates.
python ./run-hf.py --model=llm-jp/llm-jp-3-150m-instruct3 --num-trials=5
python ./run-hf.py --model=llm-jp/llm-jp-3-150m-instruct3 --num-trials=5 --mode=qa
python ./run-hf.py --model=llm-jp/llm-jp-3-150m-instruct3 --num-trials=5 --mode=chat

# Evaluate output and update leaderboard.
make

Command-line Arguments

  • --model={{model name}} ... The model name. (Required)
  • --path={{path to model directory}} ... The path to a local model directory. (Default: None)
  • --num-trials={{number of trials}} ... The number of trials. (Default: 10)
  • --mode={{mode}} ... Must be one of completion, qa, and chat. (Default: completion)
    • qa and chat can be used only when the model has a chat template.
    • The instruction message will be included in a user message for qa and in a system message for chat.

How to Contribute

Follow the instructions in the "How to Evaluate a Model" section to run the evaluation. This process will generate config.json and trials.jsonl.xz files under the result directory. Please create a pull request containing only these two files.

To ensure more accurate ranking among models, the number of executions (--num-trials) should be as many as possible, within the limit of 100 trials.

Leaderboard

🟢 ... completion mode, 💬 ... qa/chat mode.

Rank Score                    Model                                       Length           Fluency Truthfulness Helpfulness
N/A 1.0501 (±0.0000/√1) 👑 system/ground-truth 100.0 (±0.0) 1.155 0.996 1.000
1 0.9338 (±0.0145/√10) 🟢 DeepSeek-V3 100.8 (±6.2) 1.009 0.969 0.822
2 0.9307 (±0.0083/√18) 💬 chatgpt-4o-latest 99.1 (±14.8) 0.954 0.968 0.870
3 0.9303 (±0.0083/√10) 💬 anthropic/claude-3-5-sonnet-20240620 102.2 (±10.4) 0.949 0.959 0.883
4 0.8615 (±0.0092/√10) 💬 openai/gpt-4o 84.5 (±18.6) 0.919 0.980 0.686
5 0.8584 (±0.0163/√10) 💬 deepseek-ai/DeepSeek-R1 106.1 (±13.5) 0.839 0.929 0.807
N/A 0.8494 (±0.0253/√1000) 🎯 system/criteria 100.0 (±3.4) 0.936 0.978 0.505
6 0.8359 (±0.0216/√10) 💬 Qwen/Qwen-Max-2025-01-25 89.6 (±18.7) 0.864 0.968 0.676
7 0.8352 (±0.0107/√10) 💬 Qwen/Qwen-Max 88.8 (±18.7) 0.862 0.964 0.679
8 0.8279 (±0.0131/√10) 💬 MiniMax-Text-01 77.8 (±22.2) 0.858 0.988 0.638
9 0.8270 (±0.0229/√10) 💬 anthropic/claude-3-opus-20240229 102.3 (±9.5) 0.911 0.944 0.627
10 0.8192 (±0.0207/√10) 💬 google/gemini-1.5-pro-002 76.3 (±17.4) 0.826 0.976 0.656
11 0.8157 (±0.0119/√10) 💬 MiniMax-Text-01 78.9 (±25.5) 0.850 0.986 0.611
12 0.8128 (±0.0192/√100) 🟢 Qwen/Qwen3-235B-A22B 97.7 (±12.2) 0.902 0.952 0.585
13 0.8036 (±0.0133/√10) 💬 openai/gpt-4-turbo 86.5 (±17.4) 0.820 0.959 0.632
14 0.7916 (±0.0146/√10) 💬 openai/gpt-4 107.2 (±11.6) 0.888 0.951 0.536
15 0.7827 (±0.0129/√100) 💬 Qwen/Qwen2.5-72B-Instruct 98.7 (±14.8) 0.871 0.936 0.540
16 0.7789 (±0.0213/√100) 🟢 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 109.1 (±36.8) 0.890 0.941 0.506
17 0.7782 (±0.0154/√100) 💬 Qwen/Qwen2.5-72B-Instruct 96.5 (±17.8) 0.847 0.939 0.549
18 0.7773 (±0.0168/√100) 💬 pfnet/plamo-1.0-prime 178.2 (±114.5) 0.874 0.942 0.516
19 0.7768 (±0.0113/√5) 💬 mlx-community/Qwen2.5-72B-Instruct-4bit 100.8 (±17.7) 0.860 0.933 0.538
20 0.7766 (±0.0276/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-hf 104.1 (±17.9) 0.884 0.938 0.507
21 0.7756 (±0.0264/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-instruc... 104.1 (±18.5) 0.878 0.938 0.510
22 0.7748 (±0.0000/√1) 💬 openai/chatgpt-o1 76.3 (±17.7) 0.755 0.960 0.610
23 0.7748 (±0.0299/√100) 🟢 sbintuitions/sarashina2-8x70b 105.7 (±21.5) 0.867 0.937 0.520
24 0.7735 (±0.0254/√50) 🟢 abeja/ABEJA-Qwen2.5-32b-Japanese-v0.1 154.6 (±121.1) 0.845 0.923 0.553
25 0.7650 (±0.0263/√100) 🟢 tokyotech-llm/Swallow-70b-instruct-hf 102.5 (±14.4) 0.872 0.929 0.494
26 0.7643 (±0.0000/√1) 💬 openai/chatgpt-o1-pro 79.5 (±17.3) 0.748 0.955 0.590
27 0.7628 (±0.0275/√100) 🟢 tokyotech-llm/Swallow-70b-hf 103.5 (±16.1) 0.876 0.930 0.483
28 0.7601 (±0.0289/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-v0.1 106.3 (±21.0) 0.864 0.925 0.492
29 0.7538 (±0.0251/√100) 🟢 turing-motors/Llama-3-heron-brain-70B... 101.1 (±16.9) 0.857 0.925 0.479
30 0.7526 (±0.0243/√100) 🟢 pfnet/plamo-2-8b 103.7 (±17.3) 0.863 0.939 0.456
31 0.7509 (±0.0253/√100) 🟢 sbintuitions/sarashina2.2-3b-instruct... 119.0 (±25.1) 0.844 0.893 0.515
32 0.7501 (±0.0237/√100) 💬 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 181.0 (±87.4) 0.847 0.923 0.480
33 0.7469 (±0.0270/√100) 🟢 pfnet/plamo-100b-base 115.2 (±64.0) 0.861 0.920 0.460
34 0.7458 (±0.0244/√100) 🟢 llm-jp/llm-jp-3-172b-instruct2 105.8 (±21.8) 0.850 0.929 0.458
35 0.7444 (±0.0260/√100) 🟢 sbintuitions/sarashina2-70b 120.0 (±49.4) 0.825 0.923 0.485
36 0.7423 (±0.0302/√100) 💬 cyberagent/Llama-3.1-70B-Japanese-Ins... 199.2 (±110.3) 0.817 0.905 0.505
37 0.7407 (±0.0170/√10) 💬 google/gemini-1.5-flash-002 68.4 (±20.2) 0.742 0.960 0.519
38 0.7392 (±0.0232/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 93.6 (±23.5) 0.847 0.941 0.429
39 0.7370 (±0.0217/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 97.5 (±19.8) 0.846 0.932 0.433
40 0.7365 (±0.0218/√100) 🟢 CohereForAI/c4ai-command-r-plus 107.5 (±42.3) 0.818 0.913 0.478
41 0.7336 (±0.0254/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-v0.1 108.2 (±24.7) 0.837 0.908 0.456
42 0.7329 (±0.0191/√100) 💬 mistralai/Mistral-Large-Instruct-2411 124.5 (±28.2) 0.828 0.902 0.469
43 0.7325 (±0.0229/√100) 🟢 llm-jp/llm-jp-3-13b-instruct3 110.0 (±21.9) 0.823 0.905 0.469
44 0.7320 (±0.0201/√10) 💬 anthropic/claude-3-sonnet-20240229 114.3 (±18.9) 0.810 0.910 0.476
45 0.7297 (±0.0225/√100) 🟢 sbintuitions/sarashina2.2-3b 108.3 (±19.5) 0.817 0.905 0.467
46 0.7294 (±0.0229/√100) 🟢 llm-jp/llm-jp-3-172b 101.8 (±17.4) 0.826 0.921 0.441
47 0.7273 (±0.0233/√10) 💬 google/gemini-2.0-flash-exp 60.7 (±16.3) 0.727 0.978 0.476
48 0.7262 (±0.0215/√100) 💬 mistralai/Mistral-Large-Instruct-2411 120.8 (±25.8) 0.822 0.899 0.458
49 0.7250 (±0.0261/√100) 🟢 llm-jp/llm-jp-3-13b-instruct2 108.8 (±21.4) 0.827 0.906 0.442
50 0.7249 (±0.0247/√100) 💬 cyberagent/calm3-22b-chat 136.8 (±46.7) 0.813 0.907 0.455
51 0.7246 (±0.0250/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 89.8 (±33.9) 0.812 0.940 0.422
52 0.7217 (±0.0219/√100) 🟢 cyberagent/calm3-22b-chat 105.0 (±13.1) 0.824 0.916 0.425
53 0.7194 (±0.0321/√10) 💬 google/text-bison 77.6 (±31.9) 0.790 0.968 0.401
54 0.7191 (±0.0194/√100) 💬 sbintuitions/sarashina2.2-3b-instruct... 171.7 (±62.0) 0.814 0.879 0.464
55 0.7185 (±0.0000/√1) 💬 elyza/Llama-3-ELYZA-JP-70B 98.6 (±33.8) 0.837 0.931 0.388
56 0.7175 (±0.0257/√100) 🟢 nvidia/nemotron-4-340b-instruct 107.3 (±28.4) 0.816 0.908 0.429
57 0.7174 (±0.0243/√100) 🟢 llm-jp/llm-jp-3-13b-instruct 108.3 (±21.1) 0.807 0.906 0.439
58 0.7166 (±0.0305/√100) 🟢 llm-jp/llm-jp-3-172b-beta2 101.6 (±20.5) 0.814 0.918 0.417
59 0.7086 (±0.0192/√100) 🟢 mistralai/Mistral-Large-Instruct-2411 104.5 (±16.2) 0.810 0.900 0.415
60 0.7084 (±0.0207/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 95.9 (±19.7) 0.835 0.930 0.360
61 0.7073 (±0.0239/√100) 🟢 llm-jp/llm-jp-3-172b-instruct3 108.6 (±23.1) 0.799 0.908 0.414
62 0.7061 (±0.0205/√100) 🟢 AXCXEPT/EZO-Qwen2.5-72B-Instruct 140.5 (±62.0) 0.796 0.894 0.428
63 0.7046 (±0.0248/√100) 💬 nvidia/nemotron-4-340b-instruct 94.5 (±39.1) 0.768 0.910 0.435
64 0.7029 (±0.0258/√100) 🟢 mlx-community/plamo-2-8b-4bit 105.1 (±36.1) 0.821 0.909 0.379
65 0.7024 (±0.0238/√100) 🟢 rinna/nekomata-14b 104.3 (±18.0) 0.812 0.912 0.383
66 0.7023 (±0.0271/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.2 112.6 (±33.2) 0.818 0.901 0.388
67 0.7016 (±0.0212/√100) 🟢 llm-jp/llm-jp-3-7.2b-instruct2 106.5 (±20.0) 0.810 0.902 0.393
68 0.7008 (±0.0318/√100) 🟢 tokyotech-llm/Swallow-13b-instruct-hf 104.5 (±13.0) 0.812 0.898 0.392
69 0.7000 (±0.0271/√100) 💬 llm-jp/llm-jp-3-13b-instruct 192.0 (±114.0) 0.780 0.890 0.430
70 0.6990 (±0.0288/√100) 🟢 tokyotech-llm/Swallow-13b-NVE-hf 106.2 (±19.2) 0.820 0.906 0.371
71 0.6980 (±0.0252/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 98.7 (±50.0) 0.798 0.927 0.369
72 0.6969 (±0.0219/√100) 🟢 llm-jp/llm-jp-3-7.2b-instruct3 107.3 (±18.4) 0.798 0.896 0.396
73 0.6958 (±0.0236/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 92.9 (±20.0) 0.814 0.931 0.343
74 0.6945 (±0.0300/√100) 🟢 sbintuitions/sarashina2-13b 107.8 (±28.3) 0.794 0.900 0.390
75 0.6938 (±0.0217/√100) 🟢 weblab-GENIAC/Tanuki-8B-dpo-v1.0 111.5 (±22.8) 0.800 0.893 0.389
76 0.6924 (±0.0232/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 74.1 (±31.4) 0.755 0.948 0.373
77 0.6891 (±0.0255/√100) 🟢 tokyotech-llm/Swallow-13b-hf 104.8 (±17.7) 0.811 0.901 0.355
78 0.6853 (±0.0201/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 96.6 (±18.8) 0.815 0.919 0.322
79 0.6844 (±0.0239/√100) 🟢 llm-jp/llm-jp-3-172b-beta1 103.0 (±16.0) 0.785 0.900 0.369
80 0.6820 (±0.0232/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct 182.5 (±105.7) 0.781 0.883 0.381
81 0.6808 (±0.0228/√100) 💬 llm-jp/llm-jp-3-172b-instruct2 254.5 (±138.6) 0.780 0.887 0.376
82 0.6794 (±0.0243/√100) 🟢 cyberagent/Llama-3.1-70B-Japanese-Ins... 128.8 (±72.2) 0.764 0.883 0.391
83 0.6787 (±0.0267/√100) 💬 llm-jp/llm-jp-3-13b-instruct3 245.0 (±129.9) 0.770 0.875 0.391
84 0.6764 (±0.0217/√100) 🟢 llm-jp/llm-jp-3-7.2b-instruct 104.7 (±19.4) 0.775 0.890 0.364
85 0.6759 (±0.0232/√10) 🟢 meta-llama/Meta-Llama-3.1-405B 101.2 (±15.1) 0.767 0.892 0.368
86 0.6746 (±0.0215/√100) 💬 llm-jp/llm-jp-3-172b-instruct3 216.1 (±98.9) 0.756 0.875 0.393
87 0.6737 (±0.0276/√100) 🟢 sbintuitions/sarashina1-13b 105.4 (±23.4) 0.775 0.882 0.364
88 0.6715 (±0.0284/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.1 107.5 (±22.2) 0.787 0.881 0.347
89 0.6697 (±0.0277/√100) 🟢 nvidia/nemotron-4-340b-base 106.9 (±26.5) 0.768 0.884 0.357
90 0.6677 (±0.0250/√100) 🟢 llm-jp/llm-jp-3-13b 101.1 (±9.7) 0.770 0.884 0.349
91 0.6673 (±0.0221/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct3 234.2 (±116.7) 0.768 0.872 0.363
92 0.6673 (±0.0225/√100) 🟢 sbintuitions/sarashina1-65b 104.2 (±20.0) 0.776 0.894 0.332
93 0.6663 (±0.0262/√100) 🟢 tokyotech-llm/Swallow-7b-plus-hf 106.1 (±18.1) 0.780 0.880 0.339
94 0.6640 (±0.0292/√100) 💬 llm-jp/llm-jp-3-13b-instruct2 256.5 (±153.0) 0.755 0.870 0.368
95 0.6634 (±0.0252/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct2 249.5 (±141.8) 0.768 0.872 0.351
96 0.6632 (±0.0238/√100) 🟢 Qwen/Qwen3-32B 101.5 (±15.2) 0.736 0.876 0.378
97 0.6625 (±0.0140/√10) 💬 anthropic/claude-3-haiku-20240307 81.9 (±31.0) 0.747 0.943 0.298
98 0.6624 (±0.0000/√1) 💬 openai/chatgpt-o3-mini-high 68.1 (±14.5) 0.632 0.925 0.430
99 0.6616 (±0.0378/√10) 💬 google/gemini-1.0-pro-002 118.7 (±90.9) 0.689 0.894 0.402
100 0.6590 (±0.0133/√10) 💬 google/gemini-2.0-flash-thinking-exp-... 49.8 (±11.0) 0.639 0.984 0.354
101 0.6572 (±0.0518/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 108.9 (±63.7) 0.764 0.895 0.313
102 0.6494 (±0.0260/√100) 🟢 Qwen/Qwen2.5-72b 106.8 (±48.2) 0.749 0.863 0.337
103 0.6473 (±0.0182/√100) 💬 Qwen/Qwen2-72B-Instruct 108.7 (±24.8) 0.703 0.853 0.386
104 0.6456 (±0.0255/√100) 🟢 sbintuitions/sarashina2-7b 105.6 (±22.8) 0.746 0.874 0.316
105 0.6447 (±0.0251/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 74.3 (±31.3) 0.706 0.934 0.294
106 0.6445 (±0.0241/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-v0.1 110.3 (±28.4) 0.748 0.867 0.319
107 0.6420 (±0.0259/√100) 🟢 microsoft/phi-4 104.2 (±15.2) 0.754 0.864 0.309
108 0.6407 (±0.0242/√100) 🟢 AXCXEPT/Llama-3.1-70B-EZO-1.1-it 147.8 (±92.9) 0.721 0.844 0.357
109 0.6406 (±0.0139/√100) 💬 Qwen/QwQ-32B-Preview 119.1 (±72.2) 0.730 0.897 0.294
110 0.6399 (±0.1763/√100) 💬 turing-motors/Llama-3-heron-brain-70B... 155.4 (±101.8) 0.718 0.805 0.397
111 0.6379 (±0.0263/√100) 🟢 llm-jp/llm-jp-3-3.7b-instruct2 106.8 (±22.2) 0.743 0.867 0.304
112 0.6368 (±0.0207/√100) 🟢 tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 105.5 (±21.0) 0.753 0.870 0.287
113 0.6350 (±0.0260/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-instruct... 104.0 (±16.9) 0.755 0.863 0.287
114 0.6337 (±0.0265/√100) 🟢 tokyotech-llm/Swallow-7b-hf 106.5 (±18.7) 0.746 0.866 0.289
115 0.6335 (±0.0252/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 103.2 (±16.6) 0.766 0.872 0.263
116 0.6318 (±0.0264/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-Ins... 119.2 (±74.3) 0.724 0.861 0.311
117 0.6311 (±0.0226/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct 193.2 (±119.8) 0.732 0.847 0.314
118 0.6310 (±0.0127/√100) 💬 Qwen/Qwen2.5-32B-Instruct 75.4 (±19.3) 0.634 0.898 0.360
119 0.6303 (±0.0252/√100) 🟢 cyberagent/calm2-7b-chat-dpo-experime... 110.0 (±24.3) 0.735 0.863 0.293
120 0.6302 (±0.0233/√100) 🟢 llm-jp/llm-jp-3-3.7b-instruct 102.9 (±18.0) 0.738 0.863 0.289
121 0.6297 (±0.0150/√100) 💬 Qwen/Qwen2.5-32B-Instruct 71.1 (±18.7) 0.634 0.906 0.349
122 0.6295 (±0.0226/√100) 💬 microsoft/phi-4 117.8 (±34.9) 0.706 0.843 0.340
123 0.6294 (±0.0267/√100) 💬 microsoft/phi-4 117.8 (±37.7) 0.705 0.846 0.337
124 0.6291 (±0.0207/√100) 💬 Qwen/QwQ-32B-Preview 229.6 (±135.9) 0.719 0.867 0.301
125 0.6285 (±0.0239/√100) 🟢 pfnet/nekomata-14b-pfn-qfin-inst-merge 124.7 (±47.2) 0.725 0.866 0.295
126 0.6279 (±0.0252/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-hf 108.1 (±24.5) 0.747 0.870 0.267
127 0.6274 (±0.0772/√100) 🟢 rinna/nekomata-14b-instruction 98.3 (±24.2) 0.732 0.855 0.295
128 0.6267 (±0.0263/√100) 🟢 sbintuitions/sarashina1-7b 106.7 (±25.1) 0.737 0.866 0.276
129 0.6252 (±0.0246/√100) 🟢 karakuri-ai/karakuri-lm-70b-v0.1 106.0 (±27.0) 0.713 0.852 0.310
130 0.6202 (±0.0251/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.3 (±19.2) 0.733 0.848 0.280
131 0.6197 (±0.0258/√100) 🟢 stockmark/stockmark-13b 108.9 (±49.3) 0.727 0.860 0.272
132 0.6191 (±0.0284/√100) 🟢 stockmark/stockmark-13b-instruct 108.0 (±46.8) 0.720 0.859 0.278
133 0.6178 (±0.0230/√100) 🟢 karakuri-ai/karakuri-lm-70b-chat-v0.1 104.7 (±27.5) 0.706 0.842 0.306
134 0.6176 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-7b-instruct-hf 106.3 (±17.8) 0.716 0.851 0.285
135 0.6167 (±0.0213/√100) 💬 sbintuitions/sarashina2.2-3b-instruct... 491.1 (±121.0) 0.718 0.829 0.302
136 0.6160 (±0.0195/√100) 🟢 AXCXEPT/EZO-Qwen2.5-32B-Instruct 196.8 (±119.0) 0.690 0.848 0.310
137 0.6149 (±0.0153/√100) 💬 Qwen/Qwen2.5-14B-Instruct 76.5 (±18.4) 0.644 0.893 0.308
138 0.6136 (±0.0143/√10) 💬 openai/gpt-35-turbo 64.0 (±22.2) 0.658 0.944 0.239
139 0.6108 (±0.0263/√100) 🟢 Qwen/Qwen3-30B-A3B-Base 104.5 (±24.1) 0.707 0.833 0.292
140 0.6105 (±0.0288/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct3 189.9 (±101.5) 0.697 0.834 0.301
141 0.6095 (±0.0225/√100) 💬 rinna/llama-3-youko-70b-instruct 135.3 (±46.8) 0.683 0.817 0.328
142 0.6091 (±0.0277/√100) 🟢 pfnet/nekomata-14b-pfn-qfin 85.1 (±28.4) 0.672 0.893 0.262
143 0.6087 (±0.1545/√100) 💬 tokyotech-llm/Swallow-70b-NVE-instruc... 135.7 (±74.0) 0.678 0.804 0.344
144 0.6085 (±0.0387/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct2 207.7 (±130.6) 0.692 0.832 0.301
145 0.6085 (±0.0264/√100) 🟢 llm-jp/llm-jp-3-7.2b 104.0 (±14.7) 0.713 0.851 0.262
146 0.6063 (±0.0213/√100) 💬 Qwen/Qwen2.5-14B-Instruct 80.0 (±21.8) 0.639 0.889 0.290
147 0.6060 (±0.0238/√100) 🟢 Qwen/Qwen2-72B 105.5 (±23.5) 0.703 0.836 0.279
148 0.6037 (±0.0239/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-instruct-hf 105.7 (±16.4) 0.719 0.847 0.245
149 0.6030 (±0.0287/√100) 💬 karakuri-ai/karakuri-lm-8x7b-instruct... 197.4 (±72.1) 0.703 0.832 0.274
150 0.6029 (±0.0223/√100) 🟢 Qwen/Qwen2-72B-Instruct 106.0 (±26.7) 0.684 0.825 0.299
151 0.5987 (±0.0264/√100) 🟢 cyberagent/calm2-7b-chat 107.5 (±20.8) 0.701 0.843 0.253
152 0.5971 (±0.0235/√100) 🟢 stockmark/stockmark-100b 107.2 (±24.7) 0.709 0.842 0.240
153 0.5945 (±0.1370/√100) 💬 tokyotech-llm/Swallow-13b-instruct-hf 167.3 (±116.4) 0.670 0.790 0.323
154 0.5921 (±0.0211/√100) 🟢 elyza/Llama-3-ELYZA-JP-8B 115.6 (±44.8) 0.685 0.831 0.260
155 0.5868 (±0.0243/√100) 🟢 Qwen/Qwen3-14B-Base 102.9 (±18.2) 0.681 0.824 0.255
156 0.5866 (±0.0202/√100) 🟢 Qwen/Qwen2.5-32b 104.7 (±26.9) 0.690 0.820 0.250
157 0.5852 (±0.0208/√100) 💬 llm-jp/llm-jp-3-13b-instruct3 347.6 (±147.8) 0.672 0.806 0.277
158 0.5832 (±0.0220/√100) 🟢 augmxnt/shisa-gamma-7b-v1 106.7 (±21.8) 0.706 0.831 0.213
159 0.5825 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-MS-7b-v0.1 106.4 (±25.9) 0.702 0.828 0.218
160 0.5811 (±0.0218/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 103.6 (±15.6) 0.675 0.816 0.252
161 0.5808 (±0.0220/√100) 🟢 stabilityai/japanese-stablelm-base-ga... 106.9 (±17.2) 0.690 0.822 0.230
162 0.5806 (±0.0254/√100) 🟢 sbintuitions/sarashina2.2-1b 107.4 (±26.2) 0.692 0.827 0.223
163 0.5793 (±0.0202/√100) 💬 llm-jp/llm-jp-3-172b-instruct3 372.5 (±133.4) 0.655 0.806 0.277
164 0.5783 (±0.0217/√100) 🟢 microsoft/Phi-3-medium-4k-instruct 105.9 (±20.0) 0.675 0.826 0.234
165 0.5777 (±0.0228/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 105.2 (±14.5) 0.675 0.811 0.247
166 0.5754 (±0.0182/√100) 🟢 Xwin-LM/Xwin-LM-70B-V0.1 105.4 (±26.8) 0.681 0.833 0.213
167 0.5737 (±0.0209/√100) 🟢 microsoft/Phi-3-medium-128k-instruct 107.7 (±24.7) 0.674 0.825 0.223
168 0.5735 (±0.0216/√100) 🟢 google/gemma-2-9b-it 95.9 (±22.0) 0.674 0.837 0.209
169 0.5734 (±0.1980/√100) 💬 tokyotech-llm/Swallow-70b-instruct-hf 130.9 (±105.0) 0.636 0.758 0.326
170 0.5724 (±0.0209/√100) 🟢 rinna/llama-3-youko-70b 104.6 (±20.6) 0.681 0.826 0.210
171 0.5716 (±0.0230/√100) 🟢 sbintuitions/sarashina2.1-1b 116.9 (±41.3) 0.668 0.821 0.226
172 0.5712 (±0.0194/√100) 💬 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 244.4 (±49.3) 0.678 0.816 0.220
173 0.5710 (±0.0198/√100) 🟢 mistralai/Mistral-Small-24B-Instruct-... 114.2 (±30.2) 0.684 0.797 0.232
174 0.5710 (±0.0226/√100) 🟢 rinna/llama-3-youko-8b-instruct 111.6 (±23.4) 0.672 0.809 0.232
175 0.5659 (±0.0234/√100) 🟢 meta-llama/Meta-Llama-3.1-70B 103.7 (±20.1) 0.665 0.822 0.211
176 0.5656 (±0.0226/√100) 💬 meta-llama/Meta-Llama-3-70B-Instruct 110.2 (±36.4) 0.665 0.777 0.254
177 0.5646 (±0.0240/√100) 💬 microsoft/Phi-3-medium-4k-instruct 131.3 (±50.6) 0.633 0.807 0.253
178 0.5642 (±0.0261/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.1 (±19.5) 0.646 0.799 0.247
179 0.5620 (±0.0254/√100) 🟢 meta-llama/Meta-Llama-3-70B 102.0 (±17.2) 0.664 0.809 0.213
180 0.5602 (±0.0260/√100) 🟢 Qwen/Qwen3-8B-Base 102.8 (±16.7) 0.661 0.789 0.231
181 0.5590 (±0.0456/√100) 💬 mistralai/Mistral-Small-24B-Instruct-... 105.3 (±42.8) 0.648 0.794 0.235
182 0.5588 (±0.0230/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.6 (±17.0) 0.673 0.812 0.191
183 0.5574 (±0.0216/√100) 🟢 rinna/nekomata-7b 108.4 (±18.0) 0.678 0.816 0.178
184 0.5569 (±0.0244/√100) 🟢 rinna/llama-3-youko-8b 104.9 (±17.0) 0.670 0.813 0.188
185 0.5568 (±0.0200/√100) 🟢 meta-llama/Meta-Llama-3-70B-Instruct 111.8 (±55.9) 0.655 0.780 0.236
186 0.5562 (±0.0952/√100) 💬 stockmark/stockmark-13b-instruct 137.2 (±89.6) 0.633 0.798 0.238
187 0.5540 (±0.0773/√100) 💬 mistralai/Mistral-Small-24B-Instruct-... 101.9 (±38.4) 0.640 0.773 0.248
188 0.5537 (±0.0204/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-Inst... 114.4 (±48.5) 0.657 0.812 0.192
189 0.5531 (±0.0215/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct3 389.6 (±127.7) 0.641 0.787 0.231
190 0.5516 (±0.1016/√100) 💬 cyberagent/calm2-7b-chat-dpo-experime... 181.1 (±120.1) 0.644 0.775 0.236
191 0.5514 (±0.0270/√100) 💬 llm-jp/llm-jp-3-13b-instruct2 365.5 (±161.5) 0.630 0.783 0.241
192 0.5511 (±0.0203/√100) 🟢 google/gemma-2-27b-it 110.3 (±56.8) 0.599 0.836 0.218
193 0.5500 (±0.0605/√100) 💬 tokyotech-llm/Llama-3-Swallow-70B-Ins... 156.5 (±106.5) 0.633 0.780 0.237
194 0.5500 (±0.0467/√100) 💬 tokyotech-llm/Swallow-7b-instruct-hf 121.9 (±77.3) 0.612 0.812 0.225
195 0.5486 (±0.0251/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct2 418.2 (±130.6) 0.637 0.786 0.223
196 0.5469 (±0.0271/√100) 💬 llm-jp/llm-jp-3-172b-instruct2 372.9 (±157.4) 0.619 0.780 0.242
197 0.5465 (±0.0244/√100) 🟢 SakanaAI/TinySwallow-1.5B-Instruct 105.0 (±26.9) 0.657 0.807 0.176
198 0.5437 (±0.0218/√100) 💬 Xwin-LM/Xwin-LM-70B-V0.1 200.7 (±63.1) 0.652 0.782 0.198
199 0.5436 (±0.0246/√100) 🟢 llm-jp/llm-jp-3-3.7b 101.3 (±10.4) 0.646 0.795 0.189
200 0.5432 (±0.0208/√100) 💬 CohereForAI/c4ai-command-r-plus 48.9 (±16.5) 0.505 0.931 0.194
201 0.5429 (±0.0238/√100) 🟢 meta-llama/Meta-Llama-3.1-70B-Instruct 157.6 (±221.7) 0.636 0.770 0.222
202 0.5419 (±0.0234/√100) 🟢 Qwen/Qwen2.5-14B 109.3 (±43.0) 0.648 0.790 0.188
203 0.5416 (±0.0232/√100) 🟢 llm-jp/llm-jp-3-1.8b-instruct2 114.0 (±31.8) 0.651 0.797 0.177
204 0.5406 (±0.0287/√100) 💬 llm-jp/llm-jp-3-13b-instruct 382.1 (±163.5) 0.615 0.771 0.236
205 0.5387 (±0.0269/√100) 💬 rinna/llama-3-youko-8b-instruct 265.4 (±104.1) 0.635 0.771 0.210
206 0.5386 (±0.0215/√100) 💬 microsoft/Phi-3-medium-128k-instruct 91.9 (±44.7) 0.589 0.834 0.193
207 0.5377 (±0.0481/√100) 💬 meta-llama/Meta-Llama-3.1-70B-Instruct 135.8 (±194.8) 0.617 0.779 0.218
208 0.5359 (±0.0214/√100) 🟢 llm-jp/llm-jp-3-1.8b-instruct3 117.5 (±35.4) 0.640 0.786 0.181
209 0.5349 (±0.0203/√100) 💬 google/gemma-2-27b-it 74.7 (±42.7) 0.545 0.874 0.186
210 0.5347 (±0.0188/√100) 🟢 rinna/youri-7b 107.6 (±16.3) 0.654 0.802 0.148
211 0.5330 (±0.0238/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct 406.7 (±152.5) 0.621 0.770 0.208
212 0.5316 (±0.0273/√100) 💬 lightblue/karasu-7B-chat 111.8 (±46.5) 0.621 0.800 0.174
213 0.5301 (±0.0476/√100) 💬 lightblue/karasu-7B-chat-plus 107.1 (±46.7) 0.615 0.798 0.178
214 0.5283 (±0.0309/√100) 💬 SakanaAI/TinySwallow-1.5B-Instruct 117.7 (±61.8) 0.616 0.801 0.168
215 0.5283 (±0.0585/√100) 💬 lightblue/karasu-7B-chat-plus-unleashed 104.6 (±45.3) 0.614 0.794 0.177
216 0.5223 (±0.0441/√100) 🟢 Fugaku-LLM/Fugaku-LLM-13B 94.2 (±20.5) 0.588 0.818 0.161
217 0.5199 (±0.0281/√100) 🟢 llm-jp/llm-jp-3-172b-alpha2 104.6 (±22.2) 0.606 0.782 0.171
218 0.5190 (±0.0203/√100) 🟢 mistralai/Mistral-Small-24B-Base-2501 107.2 (±32.7) 0.626 0.771 0.160
219 0.5179 (±0.0264/√100) 🟢 cyberagent/calm2-7b 106.0 (±26.2) 0.601 0.770 0.182
220 0.5164 (±0.0209/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 109.3 (±33.5) 0.606 0.788 0.155
221 0.5143 (±0.0212/√100) 🟢 llm-jp/llm-jp-13b-v2.0 104.1 (±11.2) 0.604 0.760 0.180
222 0.5143 (±0.0170/√100) 🟢 moneyforward/houou-instruction-7b-v3 112.2 (±37.8) 0.629 0.778 0.135
223 0.5122 (±0.0132/√100) 💬 Qwen/Qwen2.5-7B-Instruct 69.5 (±28.7) 0.557 0.847 0.132
224 0.5119 (±0.0190/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct3 360.0 (±134.7) 0.594 0.753 0.189
225 0.5111 (±0.0203/√100) 🟢 llm-jp/llm-jp-3-1.8b-instruct 113.1 (±33.9) 0.615 0.772 0.147
226 0.5103 (±0.0204/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct 441.6 (±144.2) 0.606 0.750 0.175
227 0.5085 (±0.0160/√100) 🟢 moneyforward/houou-instruction-7b-v1 105.9 (±41.0) 0.617 0.781 0.128
228 0.5080 (±0.0306/√100) 💬 stabilityai/japanese-stablelm-instruc... 111.3 (±58.3) 0.548 0.782 0.195
229 0.5073 (±0.0208/√100) 💬 Qwen/Qwen2-57B-A14B-Instruct 154.8 (±89.5) 0.615 0.734 0.173
230 0.5045 (±0.0208/√100) 🟢 Qwen/Qwen2-57B-A14B 106.7 (±22.5) 0.617 0.757 0.139
231 0.5041 (±0.0225/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 106.2 (±29.3) 0.579 0.778 0.155
232 0.5037 (±0.0264/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct2 365.8 (±145.5) 0.590 0.746 0.175
233 0.5022 (±0.0221/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 95.0 (±36.2) 0.579 0.795 0.132
234 0.5013 (±0.0196/√100) 🟢 google/gemma-2-9b 107.3 (±26.0) 0.595 0.761 0.148
235 0.5013 (±0.0375/√100) 💬 karakuri-ai/karakuri-lm-70b-chat-v0.1 427.4 (±151.5) 0.579 0.723 0.202
236 0.5006 (±0.0476/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct3 223.2 (±122.4) 0.590 0.744 0.168
237 0.5002 (±0.0218/√100) 🟢 Qwen/Qwen-72B-Chat 223.0 (±258.3) 0.614 0.716 0.171
238 0.4995 (±0.0211/√100) 💬 Qwen/Qwen1.5-72B-Chat 119.3 (±58.1) 0.582 0.708 0.208
239 0.4988 (±0.0240/√100) 🟢 sbintuitions/sarashina2.2-0.5b 112.7 (±33.2) 0.614 0.758 0.124
240 0.4973 (±0.0236/√100) 🟢 pfnet/plamo-2-1b 112.6 (±37.4) 0.601 0.771 0.121
241 0.4970 (±0.0117/√100) 💬 Qwen/Qwen2.5-7B-Instruct 65.0 (±22.0) 0.535 0.858 0.098
242 0.4963 (±0.0189/√100) 🟢 Qwen/Qwen1.5-72B-Chat 128.1 (±77.7) 0.586 0.698 0.206
243 0.4959 (±0.0235/√100) 🟢 llm-jp/llm-jp-13b-v1.0 115.0 (±40.9) 0.576 0.756 0.156
244 0.4955 (±0.0602/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct2 194.1 (±123.5) 0.581 0.740 0.166
245 0.4953 (±0.0203/√100) 🟢 meta-llama/Llama-2-70b-hf 110.4 (±25.8) 0.596 0.745 0.145
246 0.4949 (±0.0177/√100) 💬 moneyforward/houou-instruction-7b-v1 180.5 (±66.6) 0.604 0.734 0.146
247 0.4931 (±0.0247/√100) 🟢 Rakuten/RakutenAI-7B-instruct 105.6 (±33.1) 0.598 0.750 0.132
248 0.4921 (±0.0219/√100) 🟢 Rakuten/RakutenAI-7B-chat 114.9 (±44.7) 0.592 0.760 0.124
249 0.4921 (±0.0285/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct 185.0 (±120.2) 0.585 0.752 0.140
250 0.4916 (±0.0201/√100) 🟢 moneyforward/houou-instruction-7b-v2 104.7 (±41.2) 0.588 0.770 0.116
251 0.4912 (±0.0399/√100) 💬 SakanaAI/TinySwallow-1.5B-Instruct 222.0 (±126.2) 0.594 0.735 0.145
252 0.4895 (±0.0440/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 268.1 (±133.1) 0.548 0.722 0.199
253 0.4872 (±0.0237/√100) 🟢 lightblue/karasu-7B 110.1 (±19.0) 0.586 0.739 0.137
254 0.4870 (±0.0215/√100) 🟢 Qwen/Qwen-72B 134.6 (±114.6) 0.593 0.715 0.152
255 0.4868 (±0.0163/√100) 💬 google/gemma-2-9b-it 47.6 (±14.6) 0.477 0.880 0.104
256 0.4863 (±0.1167/√100) 💬 pfnet/nekomata-14b-pfn-qfin-inst-merge 93.4 (±55.0) 0.544 0.721 0.194
257 0.4862 (±0.0221/√100) 🟢 Qwen/Qwen2-57B-A14B-Instruct 116.9 (±82.5) 0.601 0.734 0.124
258 0.4857 (±0.0168/√100) 💬 moneyforward/houou-instruction-7b-v2 207.0 (±57.3) 0.591 0.719 0.147
259 0.4829 (±0.0211/√100) 🟢 Qwen/Qwen1.5-72B 136.2 (±85.6) 0.591 0.705 0.153
260 0.4827 (±0.0464/√100) 💬 llm-jp/llm-jp-13b-instruct-full-ac_00... 269.1 (±131.5) 0.542 0.716 0.191
261 0.4784 (±0.0181/√100) 🟢 Qwen/Qwen3-4B-Base 105.3 (±18.6) 0.577 0.706 0.153
262 0.4762 (±0.0810/√100) 💬 stabilityai/japanese-stablelm-instruc... 126.2 (±67.4) 0.545 0.726 0.158
263 0.4746 (±0.0210/√100) 🟢 rinna/youri-7b-chat 102.1 (±16.4) 0.571 0.752 0.100
264 0.4744 (±0.0227/√100) 🟢 pfnet/plamo-13b 108.2 (±28.5) 0.558 0.749 0.116
265 0.4743 (±0.0987/√100) 💬 tokyotech-llm/Swallow-7b-NVE-instruct-hf 129.0 (±72.8) 0.535 0.725 0.163
266 0.4731 (±0.0270/√100) 🟢 mlx-community/plamo-2-1b 121.5 (±79.9) 0.576 0.738 0.105
267 0.4730 (±0.0166/√100) 🟢 Xwin-LM/Xwin-LM-13B-V0.2 109.7 (±27.4) 0.582 0.723 0.114
268 0.4723 (±0.0204/√100) 💬 Rakuten/RakutenAI-7B-chat 233.0 (±133.0) 0.565 0.734 0.118
269 0.4723 (±0.0808/√100) 💬 tokyotech-llm/Llama-3-Swallow-8B-Inst... 199.3 (±155.6) 0.563 0.699 0.154
270 0.4718 (±0.0262/√100) 🟢 mlx-community/plamo-2-1b-bf16 121.5 (±80.5) 0.574 0.739 0.103
271 0.4698 (±0.0200/√100) 🟢 Rakuten/RakutenAI-7B 105.4 (±25.6) 0.576 0.721 0.113
272 0.4692 (±0.0161/√100) 🟢 shisa-ai/shisa-v1-qwen2-7b 109.0 (±23.9) 0.563 0.712 0.133
273 0.4691 (±0.0264/√100) 🟢 sbintuitions/sarashina2.2-1b-instruct... 156.3 (±59.3) 0.595 0.638 0.174
274 0.4683 (±0.0211/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct3 402.8 (±140.7) 0.552 0.720 0.133
275 0.4674 (±0.0211/√100) 🟢 Qwen/Qwen2.5-7B 111.5 (±51.4) 0.563 0.707 0.132
276 0.4670 (±0.0202/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct2 400.7 (±146.8) 0.556 0.721 0.124
277 0.4661 (±0.0210/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 111.6 (±44.2) 0.536 0.756 0.106
278 0.4659 (±0.0438/√100) 💬 deepseek-ai/deepseek-llm-67b-chat 146.0 (±62.1) 0.555 0.703 0.139
279 0.4659 (±0.0202/√100) 🟢 llm-jp/llm-jp-3-1.8b 105.0 (±16.9) 0.568 0.725 0.105
280 0.4648 (±0.1659/√100) 💬 cyberagent/calm2-7b-chat 124.7 (±95.9) 0.536 0.688 0.171
281 0.4622 (±0.0195/√100) 🟢 Qwen/Qwen-14B-Chat 135.5 (±84.3) 0.572 0.718 0.097
282 0.4619 (±0.0162/√100) 💬 lmsys/vicuna-13b-v1.5-16k 126.5 (±48.4) 0.574 0.715 0.097
283 0.4609 (±0.0113/√10) 🟢 google/gemma-2-2b-jpn-it 69.4 (±24.1) 0.509 0.805 0.069
284 0.4607 (±0.0165/√100) 🟢 SakanaAI/EvoLLM-JP-v1-7B 111.2 (±30.4) 0.579 0.708 0.095
285 0.4601 (±0.0184/√100) 🟢 shisa-ai/shisa-v1-llama3-8b 112.9 (±31.4) 0.557 0.703 0.120
286 0.4597 (±0.0268/√100) 🟢 CohereForAI/c4ai-command-r-v01 179.2 (±166.3) 0.590 0.592 0.197
287 0.4586 (±0.0141/√100) 🟢 google/gemma-2-2b-it 88.2 (±30.8) 0.536 0.761 0.079
288 0.4578 (±0.0210/√100) 🟢 llm-jp/llm-jp-3-980m-instruct2 112.3 (±46.7) 0.559 0.723 0.091
289 0.4570 (±0.0253/√100) 🟢 llm-jp/llm-jp-3-172b-alpha1 111.1 (±34.7) 0.530 0.715 0.126
290 0.4561 (±0.0202/√100) 🟢 pfnet/plamo-13b-instruct 144.0 (±147.7) 0.532 0.763 0.073
291 0.4559 (±0.0201/√100) 🟢 pfnet/plamo-13b-instruct-nc 156.0 (±183.1) 0.523 0.768 0.077
292 0.4558 (±0.0156/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 75.3 (±26.6) 0.488 0.804 0.076
293 0.4543 (±0.0217/√100) 🟢 rinna/youri-7b-instruction 96.2 (±29.5) 0.530 0.743 0.090
294 0.4535 (±0.0348/√100) 💬 Rakuten/RakutenAI-7B-instruct 128.6 (±83.2) 0.527 0.726 0.108
295 0.4535 (±0.0183/√100) 🟢 THUDM/glm-4-9b 110.3 (±36.9) 0.554 0.689 0.118
296 0.4527 (±0.0146/√100) 🟢 lmsys/vicuna-13b-v1.5-16k 107.9 (±25.9) 0.576 0.708 0.075
297 0.4525 (±0.0187/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct 435.4 (±148.4) 0.553 0.706 0.098
298 0.4516 (±0.0276/√100) 💬 sbintuitions/sarashina2.2-1b-instruct... 337.2 (±153.2) 0.573 0.622 0.159
299 0.4504 (±0.0224/√100) 🟢 rinna/nekomata-7b-instruction 96.4 (±23.7) 0.528 0.734 0.089
300 0.4486 (±0.0161/√100) 💬 Qwen/Qwen2-7B-Instruct 163.6 (±61.4) 0.547 0.688 0.111
301 0.4484 (±0.0191/√100) 💬 SakanaAI/EvoLLM-JP-v1-7B 123.9 (±68.1) 0.545 0.706 0.094
302 0.4478 (±0.0245/√100) 💬 sbintuitions/sarashina2.2-1b-instruct... 399.9 (±168.4) 0.568 0.626 0.149
303 0.4477 (±0.0205/√100) 🟢 rinna/llama-3-youko-70b-instruct 130.7 (±95.3) 0.527 0.670 0.146
304 0.4459 (±0.0202/√100) 🟢 llm-jp/llm-jp-3-980m-instruct3 116.0 (±33.5) 0.545 0.707 0.086
305 0.4426 (±0.0204/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-inst... 111.1 (±28.2) 0.544 0.687 0.097
306 0.4409 (±0.1064/√100) 💬 lightblue/karasu-7B 138.1 (±92.9) 0.512 0.679 0.131
307 0.4404 (±0.0146/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 75.9 (±22.7) 0.493 0.773 0.056
308 0.4387 (±0.0655/√100) 💬 Qwen/Qwen-72B-Chat 117.7 (±137.1) 0.541 0.632 0.143
309 0.4385 (±0.0285/√100) 💬 rinna/youri-7b-chat 95.4 (±41.1) 0.500 0.733 0.083
310 0.4377 (±0.0107/√100) 🟢 google/gemma-1.1-7b-it 86.8 (±21.4) 0.509 0.732 0.072
311 0.4374 (±0.0217/√100) 🟢 Qwen/Qwen1.5-32B-Chat 127.0 (±57.0) 0.538 0.642 0.133
312 0.4368 (±0.0575/√100) 💬 llm-jp/llm-jp-3-980m-instruct2 195.9 (±127.8) 0.529 0.686 0.096
313 0.4336 (±0.0168/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.1 (±17.2) 0.539 0.689 0.073
314 0.4335 (±0.0221/√100) 🟢 Qwen/Qwen-14B 118.1 (±71.6) 0.530 0.675 0.096
315 0.4332 (±0.0164/√100) 🟢 Qwen/Qwen2-7B-Instruct 119.1 (±45.7) 0.531 0.670 0.098
316 0.4330 (±0.0149/√100) 💬 google/gemma-2-2b-it 56.0 (±27.8) 0.445 0.788 0.066
317 0.4320 (±0.0171/√100) 🟢 Qwen/Qwen2-7B 109.1 (±40.1) 0.532 0.671 0.093
318 0.4296 (±0.0322/√100) 💬 Qwen/Qwen-14B-Chat 159.0 (±69.7) 0.522 0.675 0.092
319 0.4295 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-instruct 111.5 (±31.4) 0.530 0.676 0.083
320 0.4292 (±0.0181/√100) 💬 Xwin-LM/Xwin-LM-13B-V0.2 240.7 (±48.4) 0.533 0.670 0.085
321 0.4282 (±0.0193/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 110.8 (±26.0) 0.518 0.688 0.078
322 0.4272 (±0.0273/√100) 🟢 mistralai/Mistral-Nemo-Instruct-2407 155.8 (±132.8) 0.548 0.611 0.122
323 0.4265 (±0.0115/√100) 💬 google/gemma-1.1-7b-it 78.7 (±28.4) 0.475 0.739 0.066
324 0.4256 (±0.0270/√100) 🟢 rinna/japanese-gpt-neox-3.6b 129.8 (±73.4) 0.485 0.685 0.106
325 0.4228 (±0.0185/√100) 🟢 stabilityai/japanese-stablelm-base-ja... 110.4 (±28.6) 0.528 0.668 0.073
326 0.4222 (±0.0138/√100) 🟢 Xwin-LM/Xwin-LM-7B-V0.2 110.6 (±29.3) 0.520 0.677 0.070
327 0.4220 (±0.0185/√100) 🟢 lmsys/vicuna-7b-v1.5-16k 111.8 (±31.8) 0.522 0.670 0.074
328 0.4207 (±0.0189/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 112.8 (±27.0) 0.507 0.683 0.072
329 0.4201 (±0.0177/√100) 💬 lmsys/vicuna-7b-v1.5-16k 128.1 (±52.5) 0.514 0.668 0.078
330 0.4164 (±0.0244/√100) 🟢 google/gemma-7b 135.5 (±132.3) 0.533 0.631 0.085
331 0.4150 (±0.0212/√100) 💬 Qwen/Qwen1.5-32B-Chat 125.7 (±250.5) 0.496 0.620 0.130
332 0.4149 (±0.0375/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 186.6 (±108.4) 0.469 0.685 0.090
333 0.4144 (±0.0149/√100) 💬 01-ai/Yi-1.5-34B-Chat 170.6 (±47.1) 0.514 0.628 0.101
334 0.4140 (±0.0208/√100) 🟢 meta-llama/Meta-Llama-3-8B-Instruct 116.8 (±44.3) 0.523 0.637 0.082
335 0.4125 (±0.0303/√100) 💬 CohereForAI/c4ai-command-r-v01 137.7 (±324.6) 0.519 0.562 0.157
336 0.4122 (±0.0199/√100) 🟢 rinna/bilingual-gpt-neox-4b 121.0 (±43.6) 0.485 0.660 0.092
337 0.4097 (±0.0187/√100) 🟢 meta-llama/Meta-Llama-3.1-8B 108.7 (±35.4) 0.512 0.650 0.068
338 0.4087 (±0.0201/√100) 🟢 meta-llama/Llama-2-70b-chat-hf 161.3 (±140.8) 0.519 0.608 0.099
339 0.4087 (±0.0146/√100) 🟢 microsoft/Phi-3-small-8k-instruct 109.1 (±24.1) 0.514 0.644 0.068
340 0.4080 (±0.0206/√100) 💬 llm-jp/llm-jp-3-980m-instruct2 430.8 (±147.5) 0.505 0.653 0.067
341 0.4076 (±0.0142/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast-... 109.0 (±32.9) 0.503 0.644 0.076
342 0.4074 (±0.0207/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-inst... 156.6 (±65.9) 0.490 0.646 0.086
343 0.4073 (±0.0175/√100) 🟢 stabilityai/japanese-stablelm-instruc... 110.0 (±26.5) 0.490 0.663 0.070
344 0.4064 (±0.0176/√100) 🟢 Qwen/Qwen3-1.7B-Base 107.9 (±27.9) 0.503 0.635 0.081
345 0.4058 (±0.0295/√100) 💬 rinna/youri-7b-instruction 97.0 (±57.0) 0.439 0.713 0.065
346 0.4050 (±0.0191/√100) 🟢 mistralai/Mixtral-8x22B-v0.1 115.6 (±55.4) 0.517 0.615 0.084
347 0.4048 (±0.0175/√100) 🟢 meta-llama/Meta-Llama-3-8B 109.0 (±19.8) 0.505 0.641 0.068
348 0.4048 (±0.0263/√20) 💬 ntt/tsuzumi-7b 172.0 (±90.8) 0.491 0.644 0.080
349 0.4045 (±0.0186/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 133.1 (±57.4) 0.475 0.678 0.061
350 0.4044 (±0.0219/√100) 💬 sbintuitions/sarashina2.2-0.5b-instru... 217.6 (±82.9) 0.532 0.590 0.091
351 0.4042 (±0.0131/√100) 🟢 microsoft/Orca-2-13b 115.5 (±42.6) 0.510 0.630 0.073
352 0.4041 (±0.0218/√100) 💬 meta-llama/Meta-Llama-3-8B-Instruct 131.4 (±88.3) 0.508 0.614 0.090
353 0.4035 (±0.0151/√100) 🟢 SakanaAI/EvoLLM-JP-A-v1-7B 110.4 (±31.3) 0.508 0.633 0.069
354 0.4033 (±0.0164/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast... 107.2 (±28.5) 0.495 0.643 0.072
355 0.4032 (±0.0237/√100) 🟢 Qwen/Qwen1.5-32B 150.3 (±104.8) 0.505 0.605 0.100
356 0.4024 (±0.0187/√100) 🟢 01-ai/Yi-1.5-34B 109.9 (±28.2) 0.493 0.631 0.083
357 0.4014 (±0.0195/√100) 🟢 sbintuitions/sarashina2.2-0.5b-instru... 160.5 (±57.9) 0.532 0.581 0.091
358 0.4013 (±0.0162/√100) 🟢 Qwen/Qwen2.5-3B 113.3 (±35.0) 0.504 0.628 0.072
359 0.4011 (±0.0236/√100) 🟢 cyberagent/open-calm-7b 143.8 (±97.0) 0.472 0.641 0.091
360 0.4006 (±0.0166/√100) 💬 microsoft/Phi-3-small-8k-instruct 189.7 (±84.1) 0.500 0.630 0.073
361 0.4001 (±0.0199/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 117.6 (±48.9) 0.464 0.684 0.052
362 0.3985 (±0.0161/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b 138.4 (±51.8) 0.493 0.634 0.069
363 0.3960 (±0.0199/√100) 🟢 line-corporation/japanese-large-lm-1.7b 179.2 (±174.5) 0.474 0.650 0.065
364 0.3953 (±0.0207/√100) 💬 llm-jp/llm-jp-3-980m-instruct3 404.7 (±156.1) 0.482 0.637 0.067
365 0.3949 (±0.0193/√100) 💬 meta-llama/Meta-Llama-3.1-8B-Instruct 216.6 (±345.2) 0.487 0.624 0.074
366 0.3948 (±0.0190/√100) 💬 Qwen/Qwen1.5-14B-Chat 127.9 (±50.6) 0.500 0.604 0.080
367 0.3946 (±0.0201/√100) 🟢 Qwen/Qwen1.5-14B 130.9 (±67.8) 0.509 0.609 0.066
368 0.3945 (±0.0214/√100) 💬 sbintuitions/sarashina2.2-0.5b-instru... 435.0 (±169.2) 0.517 0.592 0.074
369 0.3934 (±0.0201/√100) 🟢 stabilityai/japanese-stablelm-instruc... 107.8 (±38.0) 0.466 0.648 0.066
370 0.3914 (±0.0172/√100) 🟢 mistralai/Mixtral-8x7B-Instruct-v0.1 95.1 (±25.2) 0.488 0.636 0.050
371 0.3863 (±0.0160/√100) 🟢 Qwen/Qwen1.5-14B-Chat 131.4 (±55.8) 0.491 0.593 0.075
372 0.3837 (±0.0188/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 117.4 (±42.4) 0.462 0.649 0.041
373 0.3828 (±0.0182/√100) 🟢 google/gemma-2-2b 112.5 (±25.6) 0.486 0.616 0.046
374 0.3823 (±0.0645/√100) 💬 mistralai/Mistral-Nemo-Instruct-2407 157.9 (±140.3) 0.484 0.563 0.100
375 0.3822 (±0.0647/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 97.6 (±76.2) 0.397 0.664 0.086
376 0.3819 (±0.0265/√100) 🟢 google/gemma-2-27b 214.2 (±183.3) 0.450 0.608 0.087
377 0.3804 (±0.0161/√100) 🟢 Qwen/Qwen-7B-Chat 140.8 (±65.1) 0.485 0.612 0.045
378 0.3803 (±0.0249/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-instruct 136.4 (±70.7) 0.452 0.619 0.070
379 0.3777 (±0.0196/√100) 🟢 llm-jp/llm-jp-3-980m 101.6 (±20.5) 0.460 0.631 0.043
380 0.3772 (±0.0162/√100) 💬 microsoft/Phi-3-small-128k-instruct 199.7 (±111.9) 0.473 0.590 0.069
381 0.3760 (±0.0236/√100) 🟢 cyberagent/open-calm-3b 123.2 (±79.0) 0.442 0.624 0.062
382 0.3759 (±0.0149/√100) 🟢 lmsys/longchat-7b-v1.5-32k 116.9 (±31.6) 0.474 0.609 0.045
383 0.3740 (±0.0164/√100) 🟢 meta-llama/Llama-2-13b-hf 108.5 (±21.8) 0.474 0.603 0.045
384 0.3737 (±0.0197/√100) 🟢 meta-llama/Meta-Llama-3.1-8B-Instruct 204.5 (±303.4) 0.478 0.589 0.055
385 0.3728 (±0.0210/√100) 🟢 llm-jp/llm-jp-3-440m-instruct2 110.0 (±37.1) 0.455 0.625 0.040
386 0.3720 (±0.0622/√100) 💬 Xwin-LM/Xwin-LM-7B-V0.2 205.3 (±79.1) 0.466 0.590 0.060
387 0.3720 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast 177.5 (±147.2) 0.458 0.598 0.061
388 0.3699 (±0.0345/√100) 💬 Qwen/Qwen-7B-Chat 182.9 (±110.3) 0.468 0.600 0.042
389 0.3694 (±0.0103/√100) 🟢 google/gemma-7b-it 89.7 (±21.6) 0.446 0.640 0.022
390 0.3685 (±0.0173/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b 140.0 (±52.8) 0.462 0.596 0.047
391 0.3673 (±0.0089/√100) 💬 google/gemma-7b-it 110.0 (±47.6) 0.448 0.633 0.020
392 0.3655 (±0.0116/√100) 🟢 deepseek-ai/deepseek-llm-7b-chat 113.9 (±24.7) 0.474 0.579 0.043
393 0.3642 (±0.0165/√100) 🟢 llm-jp/llm-jp-1.3b-v1.0 134.0 (±62.6) 0.437 0.612 0.044
394 0.3637 (±0.0223/√100) 🟢 cyberagent/open-calm-large 122.3 (±73.9) 0.424 0.611 0.056
395 0.3637 (±0.0152/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast 168.0 (±77.4) 0.452 0.587 0.052
396 0.3632 (±0.0237/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-fast-... 178.6 (±113.6) 0.443 0.582 0.064
397 0.3630 (±0.0234/√100) 🟢 llm-jp/llm-jp-3-440m-instruct3 115.2 (±40.1) 0.442 0.605 0.042
398 0.3628 (±0.0145/√100) 🟢 Qwen/Qwen-7B 117.3 (±39.0) 0.468 0.582 0.039
399 0.3611 (±0.0544/√100) 💬 llm-jp/llm-jp-3-440m-instruct2 244.7 (±154.0) 0.451 0.588 0.044
400 0.3589 (±0.0394/√100) 💬 llm-jp/llm-jp-3-440m-instruct3 286.6 (±158.5) 0.448 0.582 0.047
401 0.3554 (±0.0178/√100) 🟢 meta-llama/Llama-2-7b-chat-hf 139.3 (±93.1) 0.464 0.570 0.031
402 0.3545 (±0.0445/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 48.8 (±50.1) 0.283 0.723 0.058
403 0.3543 (±0.0439/√100) 💬 lmsys/longchat-7b-v1.5-32k 160.1 (±73.5) 0.448 0.572 0.043
404 0.3538 (±0.0175/√100) 🟢 01-ai/Yi-1.5-9B 113.0 (±29.4) 0.457 0.555 0.050
405 0.3531 (±0.0159/√100) 🟢 mistralai/Mixtral-8x7B-v0.1 94.3 (±20.8) 0.450 0.573 0.037
406 0.3514 (±0.0102/√100) 🟢 google/gemma-1.1-2b-it 80.4 (±21.6) 0.404 0.625 0.025
407 0.3495 (±0.0268/√100) 🟢 cyberagent/open-calm-1b 141.3 (±110.0) 0.412 0.578 0.059
408 0.3477 (±0.0244/√100) 💬 llm-jp/llm-jp-3-440m-instruct2 432.3 (±161.3) 0.432 0.568 0.043
409 0.3471 (±0.0131/√100) 🟢 microsoft/Orca-2-7b 131.1 (±70.7) 0.447 0.555 0.039
410 0.3465 (±0.0202/√100) 💬 deepseek-ai/deepseek-llm-7b-chat 167.2 (±76.5) 0.435 0.562 0.042
411 0.3463 (±0.0178/√100) 💬 mistralai/Mixtral-8x7B-Instruct-v0.1 147.1 (±111.8) 0.448 0.548 0.043
412 0.3449 (±0.0986/√100) 💬 stabilityai/japanese-stablelm-instruc... 109.4 (±66.2) 0.397 0.585 0.053
413 0.3440 (±0.0978/√100) 💬 stabilityai/japanese-stablelm-3b-4e1t... 127.8 (±80.5) 0.401 0.576 0.055
414 0.3436 (±0.0126/√100) 💬 01-ai/Yi-1.5-9B-Chat 143.6 (±60.1) 0.438 0.540 0.053
415 0.3428 (±0.0163/√100) 🟢 meta-llama/Llama-2-7b-hf 112.3 (±28.0) 0.440 0.550 0.038
416 0.3408 (±0.0225/√100) 🟢 anthracite-org/magnum-32b-v2 191.9 (±223.2) 0.442 0.507 0.073
417 0.3393 (±0.0225/√100) 🟢 stockmark/gpt-neox-japanese-1.4b 92.2 (±63.7) 0.351 0.641 0.025
418 0.3338 (±0.0493/√100) 🟢 SakanaAI/TinySwallow-1.5B 142.2 (±109.9) 0.415 0.534 0.052
419 0.3322 (±0.0151/√100) 🟢 Qwen/Qwen1.5-7B-Chat 127.7 (±117.0) 0.431 0.520 0.045
420 0.3320 (±0.0170/√100) 🟢 Qwen/Qwen2.5-1.5B 117.7 (±41.6) 0.431 0.533 0.032
421 0.3315 (±0.0203/√100) 🟢 Qwen/Qwen1.5-7B 141.8 (±126.5) 0.445 0.504 0.046
422 0.3313 (±0.0115/√100) 🟢 google/gemma-2b-it 85.9 (±24.7) 0.393 0.577 0.024
423 0.3293 (±0.0252/√100) 💬 Qwen/Qwen1.5-7B-Chat 195.7 (±113.1) 0.429 0.503 0.056
424 0.3276 (±0.0709/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-fast... 134.0 (±98.8) 0.395 0.543 0.045
425 0.3272 (±0.0101/√100) 💬 01-ai/Yi-1.5-6B-Chat 194.4 (±75.0) 0.426 0.530 0.025
426 0.3209 (±0.0175/√100) 💬 llm-jp/llm-jp-3-440m-instruct3 375.9 (±168.6) 0.391 0.533 0.039
427 0.3199 (±0.0181/√100) 🟢 llm-jp/llm-jp-3-440m 110.0 (±33.4) 0.390 0.543 0.027
428 0.3187 (±0.0142/√100) 🟢 Qwen/Qwen2-1.5B-Instruct 131.4 (±46.7) 0.421 0.513 0.022
429 0.3172 (±0.0150/√100) 🟢 Qwen/Qwen2-1.5B 120.9 (±30.7) 0.422 0.511 0.019
430 0.3161 (±0.0119/√100) 🟢 deepseek-ai/deepseek-llm-7b-base 113.7 (±21.6) 0.424 0.501 0.024
431 0.3147 (±0.0175/√100) 💬 Qwen/Qwen2-1.5B-Instruct 180.7 (±101.0) 0.408 0.511 0.025
432 0.3078 (±0.0195/√100) 🟢 cyberagent/open-calm-medium 117.3 (±59.4) 0.363 0.537 0.024
433 0.3067 (±0.0149/√100) 🟢 Qwen/Qwen3-0.6B-Base 116.1 (±34.4) 0.406 0.492 0.022
434 0.3058 (±0.1106/√100) 💬 rinna/nekomata-7b-instruction 61.2 (±57.0) 0.307 0.567 0.043
435 0.3053 (±0.0177/√100) 🟢 google/gemma-2b 151.5 (±113.6) 0.410 0.480 0.026
436 0.3050 (±0.0190/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B 146.4 (±90.3) 0.412 0.468 0.035
437 0.2993 (±0.0095/√100) 🟢 01-ai/Yi-1.5-6B-Chat 133.3 (±46.2) 0.394 0.481 0.022
438 0.2993 (±0.0107/√100) 🟢 tiiuae/falcon-11B 121.6 (±31.5) 0.398 0.483 0.016
439 0.2957 (±0.0641/√100) 💬 meta-llama/Llama-2-13b-chat-hf 305.2 (±299.7) 0.402 0.453 0.032
440 0.2953 (±0.0442/√100) 🟢 augmxnt/shisa-base-7b-v1 200.4 (±160.3) 0.378 0.478 0.030
441 0.2924 (±0.0506/√100) 💬 Qwen/Qwen1.5-MoE-A2.7B-Chat 245.1 (±209.1) 0.381 0.453 0.043
442 0.2914 (±0.0133/√100) 🟢 mistralai/Mistral-7B-v0.1 117.4 (±40.4) 0.402 0.454 0.018
443 0.2907 (±0.0175/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B-Chat 149.8 (±91.0) 0.388 0.448 0.036
444 0.2900 (±0.0226/√100) 💬 llm-jp/llm-jp-3-150m-instruct2 421.0 (±181.6) 0.365 0.485 0.020
445 0.2869 (±0.0214/√100) 🟢 llm-jp/llm-jp-3-150m-instruct2 108.9 (±41.1) 0.342 0.498 0.021
446 0.2853 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B-Chat 127.8 (±71.2) 0.395 0.441 0.019
447 0.2809 (±0.0133/√100) 🟢 Qwen/Qwen1.5-1.8B-Chat 178.3 (±92.0) 0.381 0.445 0.017
448 0.2799 (±0.0233/√100) 🟢 llm-jp/llm-jp-3-150m-instruct3 121.5 (±43.8) 0.340 0.478 0.022
449 0.2785 (±0.0179/√100) 💬 llm-jp/llm-jp-3-150m-instruct3 412.9 (±178.5) 0.344 0.470 0.021
450 0.2770 (±0.0131/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.2 146.2 (±70.1) 0.387 0.419 0.024
451 0.2769 (±0.0324/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 16.9 (±24.6) 0.125 0.693 0.013
452 0.2769 (±0.1029/√100) 💬 stabilityai/japanese-stablelm-instruc... 117.0 (±115.0) 0.307 0.489 0.035
453 0.2666 (±0.0241/√100) 🟢 deepseek-ai/deepseek-llm-67b-chat 140.2 (±83.0) 0.351 0.440 0.009
454 0.2661 (±0.0128/√100) 🟢 Qwen/Qwen1.5-1.8B 129.7 (±65.7) 0.360 0.424 0.014
455 0.2631 (±0.0168/√100) 🟢 Qwen/Qwen2.5-0.5B 126.3 (±53.1) 0.355 0.422 0.013
456 0.2613 (±0.0136/√100) 🟢 Qwen/Qwen2-0.5B-Instruct 176.8 (±98.9) 0.351 0.426 0.007
457 0.2604 (±0.0148/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.1 139.8 (±101.3) 0.367 0.400 0.014
458 0.2598 (±0.0129/√100) 🟢 Qwen/Qwen2-0.5B 122.7 (±43.5) 0.350 0.420 0.009
459 0.2581 (±0.0196/√100) 🟢 cyberagent/open-calm-small 119.1 (±54.1) 0.310 0.460 0.004
460 0.2555 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B 149.2 (±76.6) 0.363 0.388 0.015
461 0.2543 (±0.0266/√100) 🟢 mosaicml/mpt-30b-chat 121.3 (±46.4) 0.327 0.428 0.008
462 0.2446 (±0.0204/√100) 🟢 llm-jp/llm-jp-3-150m 107.6 (±41.1) 0.297 0.427 0.009
463 0.2442 (±0.0589/√100) 💬 llm-jp/llm-jp-3-150m-instruct2 256.2 (±198.3) 0.304 0.410 0.019
464 0.2414 (±0.0281/√100) 💬 Qwen/Qwen1.5-1.8B-Chat 480.0 (±210.3) 0.329 0.392 0.003
465 0.2394 (±0.0745/√100) 💬 Qwen/Qwen1.5-4B-Chat 105.3 (±104.1) 0.307 0.390 0.021
466 0.2317 (±0.0455/√100) 💬 mistralai/Mistral-7B-Instruct-v0.1 202.3 (±153.9) 0.320 0.362 0.012
467 0.2231 (±0.0166/√100) 💬 mistralai/Mistral-7B-Instruct-v0.2 261.2 (±166.3) 0.316 0.334 0.019
468 0.2182 (±0.0152/√100) 🟢 microsoft/phi-1 47.6 (±34.3) 0.234 0.420 0.000
469 0.2177 (±0.0110/√100) 🟢 Qwen/Qwen1.5-0.5B-Chat 143.4 (±52.1) 0.317 0.327 0.009
470 0.2169 (±0.0561/√100) 💬 Qwen/Qwen2-0.5B-Instruct 129.5 (±114.3) 0.265 0.379 0.006
471 0.2169 (±0.0218/√100) 🟢 mosaicml/mpt-30b-instruct 109.8 (±36.1) 0.274 0.370 0.008
472 0.2146 (±0.0151/√100) 🟢 microsoft/phi-2 78.0 (±31.4) 0.287 0.356 0.001
473 0.2061 (±0.0820/√100) 💬 meta-llama/Llama-2-70b-chat-hf 523.3 (±444.5) 0.271 0.303 0.045
474 0.2040 (±0.0152/√100) 🟢 Qwen/Qwen1.5-0.5B 138.6 (±55.9) 0.296 0.314 0.003
475 0.2038 (±0.0538/√100) 🟢 mosaicml/mpt-30b 236.5 (±433.3) 0.271 0.334 0.007
476 0.2004 (±0.0736/√100) 💬 llm-jp/llm-jp-3-150m-instruct3 296.9 (±240.0) 0.251 0.335 0.015
477 0.1885 (±0.0194/√100) 🟢 microsoft/phi-1_5 77.5 (±33.6) 0.258 0.306 0.001
478 0.1833 (±0.0406/√100) 💬 google/gemma-1.1-2b-it 32.6 (±26.7) 0.171 0.376 0.003
479 0.1765 (±0.0439/√100) 💬 Qwen/Qwen1.5-0.5B-Chat 214.3 (±172.6) 0.251 0.276 0.002
480 0.1687 (±0.0172/√100) 🟢 upstage/SOLAR-10.7B-v1.0 171.0 (±87.1) 0.265 0.237 0.004
481 0.1544 (±0.0132/√100) 🟢 01-ai/Yi-1.5-34B-Chat 730.0 (±533.6) 0.201 0.256 0.006
482 0.1475 (±0.0826/√100) 💬 mosaicml/mpt-30b-chat 112.2 (±112.4) 0.182 0.254 0.007
483 0.1241 (±0.0558/√100) 💬 google/gemma-2b-it 24.1 (±24.6) 0.115 0.257 0.000
484 0.1226 (±0.0240/√100) 🟢 Deci/DeciLM-7B 174.0 (±165.5) 0.190 0.174 0.003
485 0.1160 (±0.0081/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 212.1 (±148.9) 0.153 0.195 0.000
486 0.1009 (±0.0846/√100) 💬 meta-llama/Llama-2-7b-chat-hf 241.5 (±336.2) 0.136 0.158 0.009
487 0.1004 (±0.0094/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 123.1 (±128.8) 0.119 0.182 0.000
488 0.0987 (±0.0145/√100) 🟢 deepseek-ai/deepseek-llm-67b-base 154.2 (±77.3) 0.174 0.121 0.000
489 0.0982 (±0.1596/√100) 💬 rinna/nekomata-14b-instruction 16.0 (±38.1) 0.115 0.141 0.039
490 0.0955 (±0.0102/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 129.5 (±141.0) 0.116 0.170 0.000
491 0.0939 (±0.0064/√100) 🟢 sbintuitions/tiny-lm-chat 250.2 (±275.6) 0.133 0.149 0.000
492 0.0936 (±0.0082/√100) 💬 sbintuitions/tiny-lm-chat 276.7 (±209.6) 0.135 0.145 0.000
493 0.0921 (±0.0058/√100) 🟢 sbintuitions/tiny-lm 471.9 (±199.0) 0.135 0.142 0.000
494 0.0880 (±0.0334/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 134.0 (±144.7) 0.105 0.159 0.000
495 0.0762 (±0.0033/√100) 🟢 line-corporation/japanese-large-lm-3.6b 1066.6 (±31.6) 0.125 0.103 0.000
496 0.0760 (±0.0032/√100) 🟢 line-corporation/japanese-large-lm-3.... 1066.4 (±31.8) 0.125 0.103 0.000
497 0.0758 (±0.0034/√100) 💬 line-corporation/japanese-large-lm-3.... 1067.2 (±31.8) 0.125 0.102 0.000
498 0.0673 (±0.0085/√100) 🟢 moneyforward/houou-instruction-7b-v3 143.2 (±112.2) 0.098 0.104 0.000
499 0.0625 (±0.0169/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 31.6 (±10.3) 0.088 0.099 0.000
500 0.0429 (±0.0440/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 31.7 (±54.7) 0.045 0.084 0.000
501 0.0406 (±0.0028/√100) 🟢 microsoft/Phi-3-small-128k-instruct 268.1 (±123.4) 0.083 0.039 0.000
502 0.0337 (±0.0026/√100) 🟢 augmxnt/shisa-7b-v1 590.7 (±238.2) 0.076 0.025 0.000
503 0.0284 (±0.0012/√100) 🟢 lightblue/karasu-7B-chat-plus 285.1 (±53.8) 0.080 0.005 0.000
504 0.0225 (±0.0702/√100) 💬 SakanaAI/EvoLLM-JP-A-v1-7B 5.9 (±27.6) 0.026 0.037 0.005
505 0.0180 (±0.0039/√100) 🟢 mistralai/Mistral-Nemo-Base-2407 607.5 (±344.5) 0.039 0.015 0.000
506 0.0047 (±0.0024/√100) 🟢 ai-forever/mGPT-13B 321.1 (±266.7) 0.008 0.006 0.000
507 0.0022 (±0.0006/√100) 🟢 lightblue/qarasu-14B-chat-plus-unleashed 937.5 (±557.0) 0.004 0.002 0.000
508 0.0019 (±0.0002/√100) 🟢 01-ai/Yi-1.5-9B-Chat 1440.0 (±51.9) 0.005 0.001 0.000
509 0.0018 (±0.0004/√100) 🟢 CohereForAI/aya-23-8B 1676.6 (±351.0) 0.004 0.002 0.000
510 0.0006 (±0.0002/√100) 🟢 meta-llama/Llama-2-13b-chat-hf 1523.9 (±43.5) 0.001 0.001 0.000
511 0.0000 (±0.0000/√100) 🟢 01-ai/Yi-1.5-6B 0.0 (±0.0) 0.000 0.000 0.000
512 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-1.1B 0.0 (±0.0) 0.000 0.000 0.000
513 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat-plus-unleashed 0.0 (±0.0) 0.000 0.000 0.000
514 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat 0.0 (±0.0) 0.000 0.000 0.000
515 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-japanese 300.0 (±0.0) 0.000 0.000 0.000
516 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-multilingual 300.0 (±0.0) 0.000 0.000 0.000

FAQ

What is the difference between the modes?

pfgen-bench provides three types of templates: completion, qa, and chat.

  • completion: No instruction is provided. It consists solely of question-answer pairs.
  • qa: An instruction is included at the beginning of the user message.
  • chat: An instruction is placed in a system message.

Should we control the temperature?

pfgen-bench recommends setting the temperature to 1.0.

Some tasks (e.g., generating dice rolls) require a temperature of 1.0, and setting a lower temperature often leads to unnatural repetition.

Citation

If you use this repository, please cite the following paper:

@preprint{Imos2024-pre-pfgen,
  title={{pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク}},
  author={今城, 健太郎 and 平野, 正徳 and 鈴木, 脩司 and 三上, 裕明},
  doi={10.51094/jxiv.1008},
  year={2024}
}
@preprint{Imos2025-judge-free,
  title={{A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis}},
  author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
  year={2025},
  eprint={2502.09316},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2502.09316},
  doi={10.48550/arXiv.2502.09316}
}

Or cite directory this repository:

@misc{imajo2024-pfgen
    title={{Preferred Generation Benchmark}},
    author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
    year={2024},
    url = {https://github.com/pfnet-research/pfgen-bench}
}

About

Preferred Generation Benchmark

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 10