This repository contains the official implementation of the paper "Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets".
Hanna Yukhymenko | Anton Alexandrov | Martin Vechev
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. By effectively applying these methods, our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages. Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.
We provide a novel automated translation framework supporting four methods across various model types, including open-weight models. The framework facilitates machine translation of datasets and benchmarks with minimal manual supervision and maximum configurability.
Supported Languages: Bulgarian, Estonian, Greek, Lithuanian, Romanian, Slovak, Turkish, Ukrainian
Supported Benchmarks: MMLU, Hellaswag, ARC, Winogrande
Supported Datasets: FLORES, WMT24++
Supported Model Providers: OpenAI, Google Gemini, TogetherAI, OpenRouter, Local vLLM
Our setup allows easy addition of new languages and benchmarks to the task configuration together with the diversity of translation methods ensuring flexibility in finding the perfect fit for your translation task.
- Python 3.9+
- CUDA (optional, for local vLLM inference)
- Clone the repository:
git clone https://github.com/insait-institute/ritranslation.git
cd ritranslation- Install dependencies:
pip install -r requirements.txt- Configure API credentials by creating a
credentials.pyfile in the root folder:
open_api_key = '<your_OpenAI_API_key>'
hf_token = '<your_Hugging_Face_token>'
google_api_key = '<your_Gemini_API_key>'
together_api_key = '<your_TogetherAI_API_key>'
openrouter_api_key = '<your_OpenRouter_API_key>'Note: Leave unused API keys as empty strings (
'').
Our framework incorporates four translation methods with different quality/cost tradeoffs:
Classic one-prompt translation with optional self-correction as a lightweight solution. The model translates the text, then optionally evaluates and corrects the result in a new chat without history.
Source Text → LLM Translation → (Optional) Self-Correction → Final Translation
Best for: Large text translation into high-resource languages where translation capabilities are sufficient.
Samples N translation candidates at higher temperature (0.7) for diversity, then prompts the LLM to score candidates 1-10 based on specified criteria, selecting the highest-scored translation.
Source Text → N Candidate Translations → LLM Scoring → Highest Score Selected
Best for: Cost-effective translation when language-agnostic approach is needed or if you have a judge model trained to quality scoring available.
Building on Universal Self-Consistency and Fusion-of-N, this method samples N candidate translations using higher temperature, then presents them to an evaluator LLM with instructions to combine the candidates into the best version according to specified criteria. Requires only N + 1 model calls per entry.
Best for: Short and simple dataset translation; cost-efficient for lower-resource languages.
Our proposed method that employs multi-prompt candidate sampling and multi-round competitive ranking to enhance error detection. Candidates are systematically presented in different positional orders across rounds to reduce positional bias. After ranking, the judge model corrects and refines the selected translation candidate. Requires 2N + 1 model calls.
Best for: Benchmark translation with complex question structures and specific domain terminology; highest quality when cost is not a primary concern.
python run.py --config_path configs/dataset/WMT/dataset_wmt_uk.yamlpython run.py --config_path configs/benchmark/MMLU/bench_mmlu_bg.yamlAll translation jobs are configured via YAML files located in the configs/ directory.
task: "BENCHMARK"
output_dir: "src/benchmark/data"
translation_model:
name: "gpt-4o-mini-2024-07-18"
provider: "openai"
judge_model:
name: "gpt-4o-mini-2024-07-18"
provider: "openai"
task_config:
benchmark:
name: "cais/mmlu"
subset: ["all"]
split: ["test"]
n_entries: null
target_language: "Ukrainian"
method: "TRANK" # SC, USI, BoN, TRANK
temperature_translator: 0.5
temperature_judge: 0.1
max_workers: true
num_workers: 4
n_samples: 5
question_fields: ["question"]
answer_fields: ["choices"]
agent_check: false
few_shot: false
multi_prompt: falsetask: "DATASET"
output_dir: "src/dataset/data/flores/bg"
translation_model:
name: "gpt-4o-mini-2024-07-18"
provider: "openai"
task_config:
dataset:
name: "gsarti/flores_101"
subset: ["eng"]
split: ["devtest"]
target_language: "Bulgarian"
method: "TRANK"
fields: ["sentence"]
temperature_translator: 0.5
n_samples: 5| Provider | Example Model | Config |
|---|---|---|
| OpenAI | gpt-4o-mini-2024-07-18 |
provider: "openai" |
| Google Gemini | gemini-2.0-flash |
provider: "google" |
| TogetherAI | meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo |
provider: "together" |
| OpenRouter | anthropic/claude-3-sonnet |
provider: "openrouter" |
| Local vLLM | Custom model | provider: "vllm" |
TODO: add Cohere API👀
├── run.py # Main entry point
├── credentials.py # API credentials (create this)
├── requirements.txt # Dependencies
│
├── configs/ # Configuration files
│ ├── benchmark/ # Benchmark configs (ARC, Hellaswag, MMLU, Winogrande)
│ └── dataset/ # Dataset configs (FLORES, WMT)
│
└── src/
├── initialization.py # Config parsing (Pydantic validation)
├── translate_benchmark.py # Benchmark translation pipeline
├── translate_dataset.py # Dataset translation pipeline
│
├── benchmark/
│ ├── methods.py # SC, USI, BoN, T-RANK implementations
│ ├── model_factory.py # Multi-provider LLM interface
│ ├── utils.py # Prompt loading, text processing
│ ├── save_to_hf.py # HuggingFace Hub upload
│ ├── prompts/ # Prompt templates
│ └── eval_mmlu/ # Evaluation scripts (COMET, LLM-judge)
│
├── dataset/
│ ├── methods.py # Dataset translation methods
│ ├── model_factory.py # LLM interface
│ ├── utils.py # Utilities
│ └── prompts/ # Prompt templates
│
└── common_utils/
└── serve_local_vllm.sh # Local vLLM server script
The framework includes multiple evaluation methods:
python src/benchmark/eval_mmlu/evaluate_translations_comet.pypython src/benchmark/eval_mmlu/evaluate_mmlu_comet_qe.pypython src/benchmark/eval_mmlu/evaluate_translations_llm_judge.pypython src/benchmark/eval_mmlu/manual_evaluation.pyOur methods demonstrate substantial improvements on WMT24++ and FLORES benchmarks:
| Method | WMT24++ | FLORES |
|---|---|---|
| Baseline | 0.827 | 0.937 |
| SC (with check) | 0.821 | 0.937 |
| Best-of-N (n=5) | 0.843 | 0.943 |
| USI (n=5) | 0.843 | 0.945 |
| T-RANK (p=5) | 0.845 | 0.940 |
COMET reference-based scores for EN→UK translation with GPT-4o-mini, where n denotes number of samples candidates from the same prompts and p denotes number of different prompts used to sample 1 candidate
LLM-as-a-judge evaluation shows our T-RANK translations significantly outperform Global-MMLU:
| Translation | Wins | Draws | Losses |
|---|---|---|---|
| Global-MMLU-UK | 2016 | 3276 | 8750 |
| T-RANK (ours) | 8750 | 3276 | 2016 |
-
Create language-specific prompts in
src/benchmark/prompts/mq_base_translation_prompts/<language>/ -
Create few-shot examples (optional) in
src/benchmark/prompts/few_shot_*.txt -
Create configuration files in
configs/by copying an existing config and updatingtarget_language
This project is released under the MIT License.
This work was done during a Master's thesis at INSAIT, Sofia University "St. Kliment Ohridski".
If you find this work useful, please cite:
@article{yukhymenko2026recovered,
title={Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets},
author={Yukhymenko, Hanna and Alexandrov, Anton and Vechev, Martin},
journal={arXiv preprint arXiv:2602.22207},
year={2026}
}For questions or issues, please open an issue on GitHub or contact: hanna.yukhymenko@insait.ai

