Datasets, Models and Papers

This page documents the official and community datasets, models and research papers featuring TLDR-pages.

Datasets

Official Datasets

We provide and generate datasets in formats like CSV, XML, JSON and TMX (Translation Memory eXchange) using https://github.com/tldr-pages/tldr-translation-pairs-gen tool. And can be found under its latest release. These artifacts are also available with the below sources:

OPUS TLDR-pages Dataset (TMX format) (2023 - present)
- OPUS is a public dataset of translated resources on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions.
- These datasets are helpful for a variety of applications such as research and machine learning.
- A notable project that uses the OPUS corpora is LibreTranslate (which is powered by argos-translate).
Kaggle Translation Pairs Dataset (CSV format) (2024 - present)
- Kaggle is a data science competition platform and online community of data scientists and machine learning practitioners under Google LLC.
- It is popular among Students, Researchers, and Data Scientists.
- This multilingual text dataset contains paired strings mapping various localized TLDR-pages.

Community Datasets

Warning

The below links contains various datasets from the community for academic and research reference, use them at your own discretion as it's contents aren't vet by our maintainers.

https://www.kaggle.com/datasets/bppuneethpai/tldr-summary-for-man-pages (2020) - This dataset provides paired man pages and their concise tldr summaries, facilitating the development of text summarization models.
https://huggingface.co/datasets/neulab/tldr (2022) (Research paper) - Natural language to bash generation dataset based on tldr pages in English, used for evaluating code generations.
https://huggingface.co/datasets/tmskss/linux-man-pages-tldr-summarized (2023) - This dataset provides a small CSV of Linux man pages in English paired with their concise tldr summaries for text summarization tasks.
https://huggingface.co/datasets/Edoigtrd/tldr-pages (2024) - This dataset contains Linux Bash commands from tldr along with their descriptions.

Papers

Newer research papers, featuring TLDR pages, can be found here at Google Scholar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Datasets, Models and Papers

Datasets

Official Datasets

Community Datasets

Papers

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally