Recovered in Translation

This repository contains the official implementation of the paper "Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets".

Authors

Hanna Yukhymenko | Anton Alexandrov | Martin Vechev

Abstract

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. By effectively applying these methods, our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages. Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

Overview

We provide a novel automated translation framework supporting four methods across various model types, including open-weight models. The framework facilitates machine translation of datasets and benchmarks with minimal manual supervision and maximum configurability.

Supported Languages: Bulgarian, Estonian, Greek, Lithuanian, Romanian, Slovak, Turkish, Ukrainian

Supported Benchmarks: MMLU, Hellaswag, ARC, Winogrande

Supported Datasets: FLORES, WMT24++

Supported Model Providers: OpenAI, Google Gemini, TogetherAI, OpenRouter, Local vLLM

Our setup allows easy addition of new languages and benchmarks to the task configuration together with the diversity of translation methods ensuring flexibility in finding the perfect fit for your translation task.

Installation

Requirements

Python 3.9+
CUDA (optional, for local vLLM inference)

Setup

Clone the repository:

git clone https://github.com/insait-institute/ritranslation.git
cd ritranslation

Install dependencies:

pip install -r requirements.txt

Configure API credentials by creating a credentials.py file in the root folder:

open_api_key = '<your_OpenAI_API_key>'
hf_token = '<your_Hugging_Face_token>'
google_api_key = '<your_Gemini_API_key>'
together_api_key = '<your_TogetherAI_API_key>'
openrouter_api_key = '<your_OpenRouter_API_key>'

Note: Leave unused API keys as empty strings ('').

Translation Methods

Our framework incorporates four translation methods with different quality/cost tradeoffs:

Self-Correction (SC)

Classic one-prompt translation with optional self-correction as a lightweight solution. The model translates the text, then optionally evaluates and corrects the result in a new chat without history.

Source Text → LLM Translation → (Optional) Self-Correction → Final Translation

Best for: Large text translation into high-resource languages where translation capabilities are sufficient.

Best-of-N Sampling (BoN)

Samples N translation candidates at higher temperature (0.7) for diversity, then prompts the LLM to score candidates 1-10 based on specified criteria, selecting the highest-scored translation.

Source Text → N Candidate Translations → LLM Scoring → Highest Score Selected

Best for: Cost-effective translation when language-agnostic approach is needed or if you have a judge model trained to quality scoring available.

Universal Self-Improvement (USI)

Building on Universal Self-Consistency and Fusion-of-N, this method samples N candidate translations using higher temperature, then presents them to an evaluator LLM with instructions to combine the candidates into the best version according to specified criteria. Requires only N + 1 model calls per entry.

Best for: Short and simple dataset translation; cost-efficient for lower-resource languages.

Translation Ranking (T-RANK)

Our proposed method that employs multi-prompt candidate sampling and multi-round competitive ranking to enhance error detection. Candidates are systematically presented in different positional orders across rounds to reduce positional bias. After ranking, the judge model corrects and refines the selected translation candidate. Requires 2N + 1 model calls.

Best for: Benchmark translation with complex question structures and specific domain terminology; highest quality when cost is not a primary concern.

Quick Start

Translate a Dataset

python run.py --config_path configs/dataset/WMT/dataset_wmt_uk.yaml

Translate a Benchmark

python run.py --config_path configs/benchmark/MMLU/bench_mmlu_bg.yaml

Configuration

All translation jobs are configured via YAML files located in the configs/ directory.

Benchmark Configuration Example

task: "BENCHMARK"
output_dir: "src/benchmark/data"

translation_model:
  name: "gpt-4o-mini-2024-07-18"
  provider: "openai"

judge_model:
  name: "gpt-4o-mini-2024-07-18"
  provider: "openai"

task_config:
  benchmark:
    name: "cais/mmlu"
    subset: ["all"]
    split: ["test"]
    n_entries: null

  target_language: "Ukrainian"
  method: "TRANK"                # SC, USI, BoN, TRANK
  temperature_translator: 0.5
  temperature_judge: 0.1
  max_workers: true
  num_workers: 4
  n_samples: 5
  question_fields: ["question"]
  answer_fields: ["choices"]
  agent_check: false
  few_shot: false
  multi_prompt: false

Dataset Configuration Example

task: "DATASET"
output_dir: "src/dataset/data/flores/bg"

translation_model:
  name: "gpt-4o-mini-2024-07-18"
  provider: "openai"

task_config:
  dataset:
    name: "gsarti/flores_101"
    subset: ["eng"]
    split: ["devtest"]

  target_language: "Bulgarian"
  method: "TRANK"
  fields: ["sentence"]
  temperature_translator: 0.5
  n_samples: 5

Supported Model Providers

Provider	Example Model	Config
OpenAI	`gpt-4o-mini-2024-07-18`	`provider: "openai"`
Google Gemini	`gemini-2.0-flash`	`provider: "google"`
TogetherAI	`meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`	`provider: "together"`
OpenRouter	`anthropic/claude-3-sonnet`	`provider: "openrouter"`
Local vLLM	Custom model	`provider: "vllm"`

TODO: add Cohere API👀

Project Structure

├── run.py                      # Main entry point
├── credentials.py              # API credentials (create this)
├── requirements.txt            # Dependencies
│
├── configs/                    # Configuration files
│   ├── benchmark/              # Benchmark configs (ARC, Hellaswag, MMLU, Winogrande)
│   └── dataset/                # Dataset configs (FLORES, WMT)
│
└── src/
    ├── initialization.py       # Config parsing (Pydantic validation)
    ├── translate_benchmark.py  # Benchmark translation pipeline
    ├── translate_dataset.py    # Dataset translation pipeline
    │
    ├── benchmark/
    │   ├── methods.py          # SC, USI, BoN, T-RANK implementations
    │   ├── model_factory.py    # Multi-provider LLM interface
    │   ├── utils.py            # Prompt loading, text processing
    │   ├── save_to_hf.py       # HuggingFace Hub upload
    │   ├── prompts/            # Prompt templates
    │   └── eval_mmlu/          # Evaluation scripts (COMET, LLM-judge)
    │
    ├── dataset/
    │   ├── methods.py          # Dataset translation methods
    │   ├── model_factory.py    # LLM interface
    │   ├── utils.py            # Utilities
    │   └── prompts/            # Prompt templates
    │
    └── common_utils/
        └── serve_local_vllm.sh # Local vLLM server script

Evaluation

The framework includes multiple evaluation methods:

COMET Evaluation (Reference-based)

python src/benchmark/eval_mmlu/evaluate_translations_comet.py

Quality Estimation (Reference-free)

python src/benchmark/eval_mmlu/evaluate_mmlu_comet_qe.py

LLM-as-Judge Evaluation (MMLU)

python src/benchmark/eval_mmlu/evaluate_translations_llm_judge.py

Manual Evaluation & Correction of examples (Gradio Interface)

python src/benchmark/eval_mmlu/manual_evaluation.py

Results

Our methods demonstrate substantial improvements on WMT24++ and FLORES benchmarks:

Method	WMT24++	FLORES
Baseline	0.827	0.937
SC (with check)	0.821	0.937
Best-of-N (n=5)	0.843	0.943
USI (n=5)	0.843	0.945
T-RANK (p=5)	0.845	0.940

COMET reference-based scores for EN→UK translation with GPT-4o-mini, where n denotes number of samples candidates from the same prompts and p denotes number of different prompts used to sample 1 candidate

LLM-as-a-judge evaluation shows our T-RANK translations significantly outperform Global-MMLU:

Translation	Wins	Draws	Losses
Global-MMLU-UK	2016	3276	8750
T-RANK (ours)	8750	3276	2016

Adding New Languages

Create language-specific prompts in src/benchmark/prompts/mq_base_translation_prompts/<language>/
Create few-shot examples (optional) in src/benchmark/prompts/few_shot_*.txt
Create configuration files in configs/ by copying an existing config and updating target_language

License

This project is released under the MIT License.

Acknowledgements

This work was done during a Master's thesis at INSAIT, Sofia University "St. Kliment Ohridski".

Citation

If you find this work useful, please cite:

@article{yukhymenko2026recovered,
  title={Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets},
  author={Yukhymenko, Hanna and Alexandrov, Anton and Vechev, Martin},
  journal={arXiv preprint arXiv:2602.22207},
  year={2026}
}

Contact

For questions or issues, please open an issue on GitHub or contact: hanna.yukhymenko@insait.ai

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

Recovered in Translation

Authors

Abstract

Overview

Installation

Requirements

Setup

Translation Methods

Self-Correction (SC)

Best-of-N Sampling (BoN)

Universal Self-Improvement (USI)

Translation Ranking (T-RANK)

Quick Start

Translate a Dataset

Translate a Benchmark

Configuration

Benchmark Configuration Example

Dataset Configuration Example

Supported Model Providers

Project Structure

Evaluation

COMET Evaluation (Reference-based)

Quality Estimation (Reference-free)

LLM-as-Judge Evaluation (MMLU)

Manual Evaluation & Correction of examples (Gradio Interface)

Results

Adding New Languages

License

Acknowledgements

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages