Public evaluation code for the benchmarks introduced in KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs. This repository supports evaluation for all three benchmarks:
The evaluator expects model responses in JSONL format and computes the benchmark
metrics. Model inference is intentionally separate: use each dataset's audio
field with your model, write predictions, then run keval evaluate.
KEval is Korean-benchmark-only evaluation code. LLM judge prompts are Korean and there is no English prompt or language switch.
cd Raon-Eval
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e .Optional PANDA scoring for KVoiceBench sd-qa:
pip install -e ".[panda]"For LLM judge based evaluators, set an API key in your shell. Do not put real keys in the repository.
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
# or
export OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"Predictions are JSONL, one object per sample:
{"id": "age-0000", "prediction": "청년층 (20~30대)"}
{"id": "advbench-test-00000", "prediction": "죄송하지만 그 요청은 도와드릴 수 없습니다."}Accepted response keys are prediction, response, pred_answer, or answer.
The id must match the Hugging Face dataset sample id.
Inspect supported subsets and default evaluators:
keval list-subsets --benchmark kvoicebench
keval list-subsets --benchmark kopenaudiobench
keval list-subsets --benchmark kmmauDefault mapping:
| Benchmark | Subset | Evaluator | Primary Metric |
|---|---|---|---|
| KVoiceBench | alpacaeval_full-test, commoneval-test, wildvoice-test |
open |
judge_open_ended_QA |
| KVoiceBench | sd-qa |
qa |
judge_reference_based_QA |
| KVoiceBench | mmsu, openbookqa-test |
mcq_ko |
exact_match |
| KVoiceBench | bbh-test |
bbh_ko |
accuracy |
| KVoiceBench | ifeval-test |
ifeval_ko |
prompt_accuracy, instruction_accuracy |
| KVoiceBench | advbench-test |
harm_ko |
refusal_rate |
| KOpenAudioBench | alpacaeval_full-test |
open |
judge_open_ended_QA |
| KOpenAudioBench | llamaqa |
llama_q |
accuracy |
| KOpenAudioBench | trivia_qa |
trivia_qa |
accuracy |
| KOpenAudioBench | web_questions |
web_q |
accuracy |
| KMMAU | age, gender, number_of_speakers, fact_extraction, general_counting, role_profession, topic_summary, word_frequency_counting, word_order |
mmau_ko |
accuracy |
Safety and instruction-following evaluators do not call an LLM judge:
keval evaluate \
--benchmark kvoicebench \
--subset advbench-test \
--predictions predictions/kvoicebench-advbench.jsonl \
--output results/kvoicebench-advbench.jsonLLM judge evaluators require an API key:
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
keval evaluate \
--benchmark kmmau \
--subset age \
--predictions predictions/kmmau-age.jsonl \
--output results/kmmau-age.json \
--judge-provider openai \
--judge-model gpt-4o-miniOpenRouter is also supported:
export OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"
keval evaluate \
--benchmark kopenaudiobench \
--subset trivia_qa \
--predictions predictions/kopenaudio-trivia_qa.jsonl \
--output results/kopenaudio-trivia_qa.json \
--judge-provider openrouter \
--judge-model deepseek/deepseek-chatShort QA evaluators use the Korean direct judge by default, matching the active
shortqa_ko path in the source evaluation code. A Korean 2-stage
extract/compare fallback is also available:
keval evaluate \
--benchmark kopenaudiobench \
--subset web_questions \
--predictions predictions/kopenaudio-web_questions.jsonl \
--output results/kopenaudio-web_questions.json \
--judge-model gpt-4o-mini \
--shortqa-judge 2stageUse --limit N for smoke tests and --max-workers N to control parallel judge
calls.
Example skeleton:
import json
from datasets import load_dataset
ds = load_dataset("KRAFTON/KMMAU", "age", split="test")
with open("predictions/kmmau-age.jsonl", "w", encoding="utf-8") as f:
for row in ds:
audio = row["audio"]
question = row["question"]
prediction = run_your_model(audio=audio, question=question)
f.write(json.dumps({"id": row["id"], "prediction": prediction}, ensure_ascii=False) + "\n")For KVoiceBench and KOpenAudioBench, the user question is spoken in audio.
The text transcription is included for evaluation and debugging, but the model
should be evaluated on the audio input.
By default, KEval loads the public Hugging Face dataset. To evaluate against a
local JSONL file with the same fields, pass --gt-jsonl:
keval evaluate \
--benchmark kvoicebench \
--subset sd-qa \
--gt-jsonl data/sd-qa.jsonl \
--predictions predictions/sd-qa.jsonl \
--output results/sd-qa.json \
--judge-model gpt-4o-miniSupported local GT schemas:
- Hugging Face flat schema:
id,audio,transcriptionorquestion,answer, optionalcapability - Duplex-style conversation JSONL:
id,conversations, optionaltranscription,instruction_id_list,kwargs
The output JSON contains:
summary: benchmark, subset, evaluator, input paths, and sample countaggregate_metrics: mean metric values over matched samplesresults: per-sample metrics and payloadswarnings: missing predictions, optional PANDA availability, or judge issues
- The scoring prompts and parsers are adapted from the evaluation logic used for these three public benchmarks.
- Raon-Eval references and adapts evaluation code from VoiceBench, which is released under the Apache License 2.0.
- Only the KVoiceBench, KOpenAudioBench, and KMMAU evaluation paths are included.
- KEval does not include internal paths, key files, or private credentials. API keys are read only from environment variables.
- To reproduce the paper's open-ended LLM judge results, use GPT-5.4 as the judge model and scale the judge's 1-5 point output to a 100-point scale.
- The default OpenAI provider model is
gpt-4o-mini; pass--judge-modelto reproduce a specific judge configuration.
Raon-Eval is released under the Apache License 2.0. See LICENSE. Dataset licenses are separate from this code repository's license; refer to each Hugging Face dataset page for dataset-specific license terms.
@misc{kim2026kvoicebenchkopenaudiobenchkmmau,
title={KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs},
author={Kim, Haechan and Chung, Seungjun and Park, Inkyu and Lee, Jihoo and Lee, Jonghyun},
year={2026},
eprint={2605.27984},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/pdf/2605.27984v1}
}