Skip to content

Datasets, Models and Papers

K.B.Dharun Krishna edited this page Apr 29, 2025 · 2 revisions

This page documents the official and community datasets, models and research papers featuring TLDR-pages.

Datasets

Official Datasets

We provide and generate datasets in formats like CSV, XML, JSON and TMX (Translation Memory eXchange) using https://github.com/tldr-pages/tldr-translation-pairs-gen tool. And can be found under its latest release. These artifacts are also available with the below sources:

  • OPUS TLDR-pages Dataset (TMX format) (2023 - present)

    • OPUS is a public dataset of translated resources on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions.
    • These datasets are helpful for a variety of applications such as research and machine learning.
    • A notable project that uses the OPUS corpora is LibreTranslate (which is powered by argos-translate).
  • Kaggle Translation Pairs Dataset (CSV format) (2024 - present)

    • Kaggle is a data science competition platform and online community of data scientists and machine learning practitioners under Google LLC.
    • It is popular among Students, Researchers, and Data Scientists.
    • This multilingual text dataset contains paired strings mapping various localized TLDR-pages.

Community Datasets

Warning

The below links contains various datasets from the community for academic and research reference, use them at your own discretion as it's contents aren't vet by our maintainers.

Papers

Newer research papers, featuring TLDR pages, can be found here at Google Scholar.