Add dataset upload guide for LLMs (experimental guide) #1830

davanstrien · 2025-07-09T10:52:46Z

Overview

This PR introduces an experimental guide tailored to help LLMs assist users in uploading datasets to the Hugging Face Hub in a format compatible with the Hub. The goal is to be an llms.txt file, but focused less on the datasets library and more on "how do I get this dataset onto the Hub in the most optimal format"

The primary use case is researchers or practitioners who have existing datasets (research data, domain-specific collections, etc.) that aren't in native Hub formats. They can provide this guide to an LLM (Claude Code, Cursor, or other AI coding assistants) along with a pointer to their local dataset. While advanced agents may handle much of the conversion automatically, even basic LLM + prompting workflows benefit from this structured guidance.

This is a proof of concept to test whether we should consider adding more LLM-focused documentation to the Hub docs for datasets. The guide is structured to work well with LLM context windows and provides a decision-tree approach that LLMs can follow to give users optimal upload recommendations.

Motivation

Without dedicated guidance, LLMs may recommend outdated or suboptimal approaches for dataset uploads based on their training data. This guide ensures LLMs have access to current best practices, including:

Latest built-in loaders (imagefolder, audiofolder, videofolder, etc.)
Current repository limits and recommendations
Modern upload methods like upload_large_folder for large datasets
Dataset Viewer compatibility requirements

What's included

New guide: datasets-upload-guide-llm.md covering:

Decision workflow for choosing upload methods
Built-in loader usage (imagefolder, audiofolder, etc.)
Custom format conversion strategies
Validation and troubleshooting steps
Multi-modal dataset examples
Added to TOC under the Datasets section

Note: This is experimental, and we can iterate on the format based on feedback and more real-world testing with LLMs. We may not want to merge this straight away (or at all), but I think it could be nice to start pointing people to this guide.

I also have a relatively good agent-based approach for end-to-end conversion of existing datasets to hub-compatible formats. I think it makes sense to start with this approach first, so we can get some sense of how well a more minimal approach like this can work for helping people with this.

cc @lhoestq @julien-c for viz.

- Add new guide specifically designed to help LLMs assist users with dataset uploads - Covers decision workflow, built-in loaders, custom format conversion, and validation - Includes practical examples for various data types and upload scenarios - Add entry to table of contents under Datasets section

- Add commands for LLMs to request when they don't have direct file access - Includes tree/find commands for structure, du/ls for sizes, head for data preview - Helps LLMs guide users on web interfaces (Claude, ChatGPT browser) - Placed early in guide for immediate visibility

- Remove unintended changes from other work - Keep only the addition of datasets-upload-guide-llm entry

- Add structured YAML block with Hub constraints for LLM parsing - Helps prevent hallucination of limits by providing exact values - Include comments explaining each limit - Keep human-readable summary alongside machine format

HuggingFaceDocBuilderDev · 2025-07-09T10:54:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

nice! I wonder if we should also include this in the HF datasets MCP as a list of instructions accessible for agents to read ?

edit: as replied on slack, this is more for llm + search than coding agents so it's fine as docs :)

docs/hub/datasets-upload-guide-llm.md

docs/hub/_toctree.yml

docs/hub/datasets-upload-guide-llm.md

davanstrien · 2025-07-09T12:31:09Z

Thanks for the feedback @lhoestq! I've addressed the technical issues:

✅ Fixed Array1D → List() with length parameter
✅ Added push_to_hub() calls to the multi-modal examples

Regarding the toctree placement: I initially placed it at the same level as "Uploading Datasets" for better discoverability by LLM agents (as you mentioned, SEO for LLM agents), but I'm happy to move it as a subsection under "Uploading Datasets" if that's preferred. What do you think would be best?

julien-c

cool idea. I think it's a "Resource" in MCP speak no? https://modelcontextprotocol.io/docs/concepts/resources

(cc @evalstate too)

docs/hub/_toctree.yml

docs/hub/datasets-upload-guide-llm.md

Co-authored-by: Julien Chaumond <[email protected]>

davanstrien added 4 commits July 9, 2025 11:49

Fix toctree to only include LLM guide entry

00e884e

- Remove unintended changes from other work - Keep only the addition of datasets-upload-guide-llm entry

Add machine-readable hub limits in YAML format

0166c8e

- Add structured YAML block with Hub constraints for LLM parsing - Helps prevent hallucination of limits by providing exact values - Include comments explaining each limit - Keep human-readable summary alongside machine format

lhoestq reviewed Jul 9, 2025

View reviewed changes

docs/hub/datasets-upload-guide-llm.md Outdated Show resolved Hide resolved

docs/hub/_toctree.yml Outdated Show resolved Hide resolved

docs/hub/datasets-upload-guide-llm.md Show resolved Hide resolved

address comments

2e95be1

julien-c reviewed Jul 9, 2025

View reviewed changes

docs/hub/_toctree.yml Outdated Show resolved Hide resolved

docs/hub/datasets-upload-guide-llm.md Outdated Show resolved Hide resolved

Update docs/hub/_toctree.yml

4598c15

Co-authored-by: Julien Chaumond <[email protected]>

lhoestq approved these changes Jul 10, 2025

View reviewed changes

nicer tip formatting

3c9a913

davanstrien merged commit f94af89 into main Jul 16, 2025
2 checks passed

davanstrien deleted the datasets-llm-txt branch July 16, 2025 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dataset upload guide for LLMs (experimental guide) #1830

Add dataset upload guide for LLMs (experimental guide) #1830

Uh oh!

davanstrien commented Jul 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 9, 2025

Uh oh!

lhoestq left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davanstrien commented Jul 9, 2025

Uh oh!

julien-c left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add dataset upload guide for LLMs (experimental guide) #1830

Add dataset upload guide for LLMs (experimental guide) #1830

Uh oh!

Conversation

davanstrien commented Jul 9, 2025

Overview

Motivation

What's included

Uh oh!

HuggingFaceDocBuilderDev commented Jul 9, 2025

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davanstrien commented Jul 9, 2025

Uh oh!

julien-c left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq left a comment •

edited

Loading