-
Notifications
You must be signed in to change notification settings - Fork 337
Add dataset upload guide for LLMs (experimental guide) #1830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add new guide specifically designed to help LLMs assist users with dataset uploads - Covers decision workflow, built-in loaders, custom format conversion, and validation - Includes practical examples for various data types and upload scenarios - Add entry to table of contents under Datasets section
- Add commands for LLMs to request when they don't have direct file access - Includes tree/find commands for structure, du/ls for sizes, head for data preview - Helps LLMs guide users on web interfaces (Claude, ChatGPT browser) - Placed early in guide for immediate visibility
- Remove unintended changes from other work - Keep only the addition of datasets-upload-guide-llm entry
- Add structured YAML block with Hub constraints for LLM parsing - Helps prevent hallucination of limits by providing exact values - Include comments explaining each limit - Keep human-readable summary alongside machine format
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I wonder if we should also include this in the HF datasets MCP as a list of instructions accessible for agents to read ?
edit: as replied on slack, this is more for llm + search than coding agents so it's fine as docs :)
Thanks for the feedback @lhoestq! I've addressed the technical issues:
Regarding the toctree placement: I initially placed it at the same level as "Uploading Datasets" for better discoverability by LLM agents (as you mentioned, SEO for LLM agents), but I'm happy to move it as a subsection under "Uploading Datasets" if that's preferred. What do you think would be best? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool idea. I think it's a "Resource" in MCP speak no? https://modelcontextprotocol.io/docs/concepts/resources
(cc @evalstate too)
Co-authored-by: Julien Chaumond <[email protected]>
Overview
This PR introduces an experimental guide tailored to help LLMs assist users in uploading datasets to the Hugging Face Hub in a format compatible with the Hub. The goal is to be an llms.txt file, but focused less on the datasets library and more on "how do I get this dataset onto the Hub in the most optimal format"
The primary use case is researchers or practitioners who have existing datasets (research data, domain-specific collections, etc.) that aren't in native Hub formats. They can provide this guide to an LLM (Claude Code, Cursor, or other AI coding assistants) along with a pointer to their local dataset. While advanced agents may handle much of the conversion automatically, even basic LLM + prompting workflows benefit from this structured guidance.
This is a proof of concept to test whether we should consider adding more LLM-focused documentation to the Hub docs for datasets. The guide is structured to work well with LLM context windows and provides a decision-tree approach that LLMs can follow to give users optimal upload recommendations.
Motivation
Without dedicated guidance, LLMs may recommend outdated or suboptimal approaches for dataset uploads based on their training data. This guide ensures LLMs have access to current best practices, including:
upload_large_folder
for large datasetsWhat's included
New guide:
datasets-upload-guide-llm.md
covering:Note: This is experimental, and we can iterate on the format based on feedback and more real-world testing with LLMs. We may not want to merge this straight away (or at all), but I think it could be nice to start pointing people to this guide.
I also have a relatively good agent-based approach for end-to-end conversion of existing datasets to hub-compatible formats. I think it makes sense to start with this approach first, so we can get some sense of how well a more minimal approach like this can work for helping people with this.
cc @lhoestq @julien-c for viz.