Skip to content

krafton-ai/Raon-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Raon-Eval

Public evaluation code for the benchmarks introduced in KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs. This repository supports evaluation for all three benchmarks:

The evaluator expects model responses in JSONL format and computes the benchmark metrics. Model inference is intentionally separate: use each dataset's audio field with your model, write predictions, then run keval evaluate.

KEval is Korean-benchmark-only evaluation code. LLM judge prompts are Korean and there is no English prompt or language switch.

Install

cd Raon-Eval
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e .

Optional PANDA scoring for KVoiceBench sd-qa:

pip install -e ".[panda]"

For LLM judge based evaluators, set an API key in your shell. Do not put real keys in the repository.

export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
# or
export OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"

Prediction Format

Predictions are JSONL, one object per sample:

{"id": "age-0000", "prediction": "청년층 (20~30대)"}
{"id": "advbench-test-00000", "prediction": "죄송하지만 그 요청은 도와드릴 수 없습니다."}

Accepted response keys are prediction, response, pred_answer, or answer. The id must match the Hugging Face dataset sample id.

Subsets

Inspect supported subsets and default evaluators:

keval list-subsets --benchmark kvoicebench
keval list-subsets --benchmark kopenaudiobench
keval list-subsets --benchmark kmmau

Default mapping:

Benchmark Subset Evaluator Primary Metric
KVoiceBench alpacaeval_full-test, commoneval-test, wildvoice-test open judge_open_ended_QA
KVoiceBench sd-qa qa judge_reference_based_QA
KVoiceBench mmsu, openbookqa-test mcq_ko exact_match
KVoiceBench bbh-test bbh_ko accuracy
KVoiceBench ifeval-test ifeval_ko prompt_accuracy, instruction_accuracy
KVoiceBench advbench-test harm_ko refusal_rate
KOpenAudioBench alpacaeval_full-test open judge_open_ended_QA
KOpenAudioBench llamaqa llama_q accuracy
KOpenAudioBench trivia_qa trivia_qa accuracy
KOpenAudioBench web_questions web_q accuracy
KMMAU age, gender, number_of_speakers, fact_extraction, general_counting, role_profession, topic_summary, word_frequency_counting, word_order mmau_ko accuracy

Run Evaluation

Safety and instruction-following evaluators do not call an LLM judge:

keval evaluate \
  --benchmark kvoicebench \
  --subset advbench-test \
  --predictions predictions/kvoicebench-advbench.jsonl \
  --output results/kvoicebench-advbench.json

LLM judge evaluators require an API key:

export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"

keval evaluate \
  --benchmark kmmau \
  --subset age \
  --predictions predictions/kmmau-age.jsonl \
  --output results/kmmau-age.json \
  --judge-provider openai \
  --judge-model gpt-4o-mini

OpenRouter is also supported:

export OPENROUTER_API_KEY="YOUR_OPENROUTER_API_KEY"

keval evaluate \
  --benchmark kopenaudiobench \
  --subset trivia_qa \
  --predictions predictions/kopenaudio-trivia_qa.jsonl \
  --output results/kopenaudio-trivia_qa.json \
  --judge-provider openrouter \
  --judge-model deepseek/deepseek-chat

Short QA evaluators use the Korean direct judge by default, matching the active shortqa_ko path in the source evaluation code. A Korean 2-stage extract/compare fallback is also available:

keval evaluate \
  --benchmark kopenaudiobench \
  --subset web_questions \
  --predictions predictions/kopenaudio-web_questions.jsonl \
  --output results/kopenaudio-web_questions.json \
  --judge-model gpt-4o-mini \
  --shortqa-judge 2stage

Use --limit N for smoke tests and --max-workers N to control parallel judge calls.

Generating Predictions

Example skeleton:

import json
from datasets import load_dataset

ds = load_dataset("KRAFTON/KMMAU", "age", split="test")

with open("predictions/kmmau-age.jsonl", "w", encoding="utf-8") as f:
    for row in ds:
        audio = row["audio"]
        question = row["question"]
        prediction = run_your_model(audio=audio, question=question)
        f.write(json.dumps({"id": row["id"], "prediction": prediction}, ensure_ascii=False) + "\n")

For KVoiceBench and KOpenAudioBench, the user question is spoken in audio. The text transcription is included for evaluation and debugging, but the model should be evaluated on the audio input.

Local Ground Truth JSONL

By default, KEval loads the public Hugging Face dataset. To evaluate against a local JSONL file with the same fields, pass --gt-jsonl:

keval evaluate \
  --benchmark kvoicebench \
  --subset sd-qa \
  --gt-jsonl data/sd-qa.jsonl \
  --predictions predictions/sd-qa.jsonl \
  --output results/sd-qa.json \
  --judge-model gpt-4o-mini

Supported local GT schemas:

  • Hugging Face flat schema: id, audio, transcription or question, answer, optional capability
  • Duplex-style conversation JSONL: id, conversations, optional transcription, instruction_id_list, kwargs

Output

The output JSON contains:

  • summary: benchmark, subset, evaluator, input paths, and sample count
  • aggregate_metrics: mean metric values over matched samples
  • results: per-sample metrics and payloads
  • warnings: missing predictions, optional PANDA availability, or judge issues

Notes

  • The scoring prompts and parsers are adapted from the evaluation logic used for these three public benchmarks.
  • Raon-Eval references and adapts evaluation code from VoiceBench, which is released under the Apache License 2.0.
  • Only the KVoiceBench, KOpenAudioBench, and KMMAU evaluation paths are included.
  • KEval does not include internal paths, key files, or private credentials. API keys are read only from environment variables.
  • To reproduce the paper's open-ended LLM judge results, use GPT-5.4 as the judge model and scale the judge's 1-5 point output to a 100-point scale.
  • The default OpenAI provider model is gpt-4o-mini; pass --judge-model to reproduce a specific judge configuration.

License

Raon-Eval is released under the Apache License 2.0. See LICENSE. Dataset licenses are separate from this code repository's license; refer to each Hugging Face dataset page for dataset-specific license terms.

Citation

@misc{kim2026kvoicebenchkopenaudiobenchkmmau,
  title={KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs},
  author={Kim, Haechan and Chung, Seungjun and Park, Inkyu and Lee, Jihoo and Lee, Jonghyun},
  year={2026},
  eprint={2605.27984},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/pdf/2605.27984v1}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages