Raon-Eval

Public evaluation code for the benchmarks introduced in KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs. This repository supports evaluation for all three benchmarks:

The evaluator expects model responses in JSONL format and computes the benchmark metrics. Model inference is intentionally separate: use each dataset's audio field with your model, write predictions, then run keval evaluate.

KEval is Korean-benchmark-only evaluation code. LLM judge prompts are Korean and there is no English prompt or language switch.

Install

cd Raon-Eval
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e .

Optional PANDA scoring for KVoiceBench sd-qa:

pip install -e ".[panda]"

For LLM judge based evaluators, set an API key in your shell. Do not put real keys in the repository.

export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
# or
export OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"

Prediction Format

Predictions are JSONL, one object per sample:

{"id": "age-0000", "prediction": "청년층 (20~30대)"}
{"id": "advbench-test-00000", "prediction": "죄송하지만 그 요청은 도와드릴 수 없습니다."}

Accepted response keys are prediction, response, pred_answer, or answer. The id must match the Hugging Face dataset sample id.

Subsets

Inspect supported subsets and default evaluators:

keval list-subsets --benchmark kvoicebench
keval list-subsets --benchmark kopenaudiobench
keval list-subsets --benchmark kmmau

Default mapping:

Benchmark	Subset	Evaluator	Primary Metric
KVoiceBench	`alpacaeval_full-test`, `commoneval-test`, `wildvoice-test`	`open`	`judge_open_ended_QA`
KVoiceBench	`sd-qa`	`qa`	`judge_reference_based_QA`
KVoiceBench	`mmsu`, `openbookqa-test`	`mcq_ko`	`exact_match`
KVoiceBench	`bbh-test`	`bbh_ko`	`accuracy`
KVoiceBench	`ifeval-test`	`ifeval_ko`	`prompt_accuracy`, `instruction_accuracy`
KVoiceBench	`advbench-test`	`harm_ko`	`refusal_rate`
KOpenAudioBench	`alpacaeval_full-test`	`open`	`judge_open_ended_QA`
KOpenAudioBench	`llamaqa`	`llama_q`	`accuracy`
KOpenAudioBench	`trivia_qa`	`trivia_qa`	`accuracy`
KOpenAudioBench	`web_questions`	`web_q`	`accuracy`
KMMAU	`age`, `gender`, `number_of_speakers`, `fact_extraction`, `general_counting`, `role_profession`, `topic_summary`, `word_frequency_counting`, `word_order`	`mmau_ko`	`accuracy`

Run Evaluation

Safety and instruction-following evaluators do not call an LLM judge:

keval evaluate \
  --benchmark kvoicebench \
  --subset advbench-test \
  --predictions predictions/kvoicebench-advbench.jsonl \
  --output results/kvoicebench-advbench.json

LLM judge evaluators require an API key:

export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"

keval evaluate \
  --benchmark kmmau \
  --subset age \
  --predictions predictions/kmmau-age.jsonl \
  --output results/kmmau-age.json \
  --judge-provider openai \
  --judge-model gpt-4o-mini

OpenRouter is also supported:

export OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"

keval evaluate \
  --benchmark kopenaudiobench \
  --subset trivia_qa \
  --predictions predictions/kopenaudio-trivia_qa.jsonl \
  --output results/kopenaudio-trivia_qa.json \
  --judge-provider openrouter \
  --judge-model deepseek/deepseek-chat

Short QA evaluators use the Korean direct judge by default, matching the active shortqa_ko path in the source evaluation code. A Korean 2-stage extract/compare fallback is also available:

keval evaluate \
  --benchmark kopenaudiobench \
  --subset web_questions \
  --predictions predictions/kopenaudio-web_questions.jsonl \
  --output results/kopenaudio-web_questions.json \
  --judge-model gpt-4o-mini \
  --shortqa-judge 2stage

Use --limit N for smoke tests and --max-workers N to control parallel judge calls.

Generating Predictions

Example skeleton:

import json
from datasets import load_dataset

ds = load_dataset("KRAFTON/KMMAU", "age", split="test")

with open("predictions/kmmau-age.jsonl", "w", encoding="utf-8") as f:
    for row in ds:
        audio = row["audio"]
        question = row["question"]
        prediction = run_your_model(audio=audio, question=question)
        f.write(json.dumps({"id": row["id"], "prediction": prediction}, ensure_ascii=False) + "\n")

For KVoiceBench and KOpenAudioBench, the user question is spoken in audio. The text transcription is included for evaluation and debugging, but the model should be evaluated on the audio input.

Local Ground Truth JSONL

By default, KEval loads the public Hugging Face dataset. To evaluate against a local JSONL file with the same fields, pass --gt-jsonl:

keval evaluate \
  --benchmark kvoicebench \
  --subset sd-qa \
  --gt-jsonl data/sd-qa.jsonl \
  --predictions predictions/sd-qa.jsonl \
  --output results/sd-qa.json \
  --judge-model gpt-4o-mini

Supported local GT schemas:

Hugging Face flat schema: id, audio, transcription or question, answer, optional capability
Duplex-style conversation JSONL: id, conversations, optional transcription, instruction_id_list, kwargs

Output

The output JSON contains:

summary: benchmark, subset, evaluator, input paths, and sample count
aggregate_metrics: mean metric values over matched samples
results: per-sample metrics and payloads
warnings: missing predictions, optional PANDA availability, or judge issues

Notes

The scoring prompts and parsers are adapted from the evaluation logic used for these three public benchmarks.
Raon-Eval references and adapts evaluation code from VoiceBench, which is released under the Apache License 2.0.
Only the KVoiceBench, KOpenAudioBench, and KMMAU evaluation paths are included.
KEval does not include internal paths, key files, or private credentials. API keys are read only from environment variables.
To reproduce the paper's open-ended LLM judge results, use GPT-5.4 as the judge model and scale the judge's 1-5 point output to a 100-point scale.
The default OpenAI provider model is gpt-4o-mini; pass --judge-model to reproduce a specific judge configuration.

License

Raon-Eval is released under the Apache License 2.0. See LICENSE. Dataset licenses are separate from this code repository's license; refer to each Hugging Face dataset page for dataset-specific license terms.

Citation

@misc{kim2026kvoicebenchkopenaudiobenchkmmau,
  title={KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs},
  author={Kim, Haechan and Chung, Seungjun and Park, Inkyu and Lee, Jihoo and Lee, Jonghyun},
  year={2026},
  eprint={2605.27984},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/pdf/2605.27984v1}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
keval		keval
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Raon-Eval

Install

Prediction Format

Subsets

Run Evaluation

Generating Predictions

Local Ground Truth JSONL

Output

Notes

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Raon-Eval

Install

Prediction Format

Subsets

Run Evaluation

Generating Predictions

Local Ground Truth JSONL

Output

Notes

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages