Skip to content

Add dataset upload guide for LLMs (experimental guide) #1830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 16, 2025

Conversation

davanstrien
Copy link
Member

Overview

This PR introduces an experimental guide tailored to help LLMs assist users in uploading datasets to the Hugging Face Hub in a format compatible with the Hub. The goal is to be an llms.txt file, but focused less on the datasets library and more on "how do I get this dataset onto the Hub in the most optimal format"

The primary use case is researchers or practitioners who have existing datasets (research data, domain-specific collections, etc.) that aren't in native Hub formats. They can provide this guide to an LLM (Claude Code, Cursor, or other AI coding assistants) along with a pointer to their local dataset. While advanced agents may handle much of the conversion automatically, even basic LLM + prompting workflows benefit from this structured guidance.

This is a proof of concept to test whether we should consider adding more LLM-focused documentation to the Hub docs for datasets. The guide is structured to work well with LLM context windows and provides a decision-tree approach that LLMs can follow to give users optimal upload recommendations.

Motivation

Without dedicated guidance, LLMs may recommend outdated or suboptimal approaches for dataset uploads based on their training data. This guide ensures LLMs have access to current best practices, including:

  • Latest built-in loaders (imagefolder, audiofolder, videofolder, etc.)
  • Current repository limits and recommendations
  • Modern upload methods like upload_large_folder for large datasets
  • Dataset Viewer compatibility requirements

What's included

New guide: datasets-upload-guide-llm.md covering:

  • Decision workflow for choosing upload methods
  • Built-in loader usage (imagefolder, audiofolder, etc.)
  • Custom format conversion strategies
  • Validation and troubleshooting steps
  • Multi-modal dataset examples
  • Added to TOC under the Datasets section

Note: This is experimental, and we can iterate on the format based on feedback and more real-world testing with LLMs. We may not want to merge this straight away (or at all), but I think it could be nice to start pointing people to this guide.

I also have a relatively good agent-based approach for end-to-end conversion of existing datasets to hub-compatible formats. I think it makes sense to start with this approach first, so we can get some sense of how well a more minimal approach like this can work for helping people with this.

cc @lhoestq @julien-c for viz.

- Add new guide specifically designed to help LLMs assist users with dataset uploads
- Covers decision workflow, built-in loaders, custom format conversion, and validation
- Includes practical examples for various data types and upload scenarios
- Add entry to table of contents under Datasets section
- Add commands for LLMs to request when they don't have direct file access
- Includes tree/find commands for structure, du/ls for sizes, head for data preview
- Helps LLMs guide users on web interfaces (Claude, ChatGPT browser)
- Placed early in guide for immediate visibility
- Remove unintended changes from other work
- Keep only the addition of datasets-upload-guide-llm entry
- Add structured YAML block with Hub constraints for LLM parsing
- Helps prevent hallucination of limits by providing exact values
- Include comments explaining each limit
- Keep human-readable summary alongside machine format
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I wonder if we should also include this in the HF datasets MCP as a list of instructions accessible for agents to read ?

edit: as replied on slack, this is more for llm + search than coding agents so it's fine as docs :)

@davanstrien
Copy link
Member Author

Thanks for the feedback @lhoestq! I've addressed the technical issues:

  1. ✅ Fixed Array1DList() with length parameter
  2. ✅ Added push_to_hub() calls to the multi-modal examples

Regarding the toctree placement: I initially placed it at the same level as "Uploading Datasets" for better discoverability by LLM agents (as you mentioned, SEO for LLM agents), but I'm happy to move it as a subsection under "Uploading Datasets" if that's preferred. What do you think would be best?

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool idea. I think it's a "Resource" in MCP speak no? https://modelcontextprotocol.io/docs/concepts/resources

(cc @evalstate too)

Co-authored-by: Julien Chaumond <[email protected]>
@davanstrien davanstrien merged commit f94af89 into main Jul 16, 2025
2 checks passed
@davanstrien davanstrien deleted the datasets-llm-txt branch July 16, 2025 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants