-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Datasets, Models and Papers
This page documents the official and community datasets, models and research papers featuring TLDR-pages.
We provide and generate datasets in formats like CSV, XML, JSON and TMX (Translation Memory eXchange) using https://github.com/tldr-pages/tldr-translation-pairs-gen tool. And can be found under its latest release. These artifacts are also available with the below sources:
-
OPUS TLDR-pages Dataset (TMX format) (2023 - present)
- OPUS is a public dataset of translated resources on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions.
- These datasets are helpful for a variety of applications such as research and machine learning.
- A notable project that uses the OPUS corpora is LibreTranslate (which is powered by argos-translate).
-
Kaggle Translation Pairs Dataset (CSV format) (2024 - present)
- Kaggle is a data science competition platform and online community of data scientists and machine learning practitioners under Google LLC.
- It is popular among Students, Researchers, and Data Scientists.
- This multilingual text dataset contains paired strings mapping various localized TLDR-pages.
Warning
The below links contains various datasets from the community for academic and research reference, use them at your own discretion as it's contents aren't vet by our maintainers.
- https://www.kaggle.com/datasets/bppuneethpai/tldr-summary-for-man-pages (2020) - This dataset provides paired man pages and their concise tldr summaries, facilitating the development of text summarization models.
- https://huggingface.co/datasets/neulab/tldr (2022) (Research paper) - Natural language to bash generation dataset based on tldr pages in English, used for evaluating code generations.
- https://huggingface.co/datasets/tmskss/linux-man-pages-tldr-summarized (2023) - This dataset provides a small CSV of Linux man pages in English paired with their concise tldr summaries for text summarization tasks.
- https://huggingface.co/datasets/Edoigtrd/tldr-pages (2024) - This dataset contains Linux Bash commands from tldr along with their descriptions.
- Explainable Natural Language to Bash Translation using Abstract Syntax Tree (2021)
- DocPrompting: Generating Code by Retrieving the Docs (2022, 2023)
- ShellFusion: An Answer Generator for Shell Programming Tasks via Knowledge Fusion (2023)
- LLM-Supported Natural Language to Bash Translation (2025)
Newer research papers, featuring TLDR pages, can be found here at Google Scholar.