diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml
index 5ba9ead00..4d38a15b3 100644
--- a/docs/hub/_toctree.yml
+++ b/docs/hub/_toctree.yml
@@ -164,6 +164,8 @@
title: Gated Datasets
- local: datasets-adding
title: Uploading Datasets
+ - local: datasets-upload-guide-llm
+ title: Uploading Datasets (for LLMs)
- local: datasets-downloading
title: Downloading Datasets
- local: datasets-libraries
diff --git a/docs/hub/datasets-upload-guide-llm.md b/docs/hub/datasets-upload-guide-llm.md
new file mode 100644
index 000000000..b5504d518
--- /dev/null
+++ b/docs/hub/datasets-upload-guide-llm.md
@@ -0,0 +1,529 @@
+# Hugging Face Dataset Upload Decision Guide
+
+
+This guide is primarily designed for LLMs to help users upload datasets to the Hugging Face Hub in the most compatible format. Users can also reference this guide to understand the upload process and best practices.
+
+
+
+> Decision guide for uploading datasets to Hugging Face Hub. Optimized for Dataset Viewer compatibility and integration with the Hugging Face ecosystem.
+
+## Overview
+
+Your goal is to help a user upload a dataset to the Hugging Face Hub. Ideally, the dataset should be compatible with the Dataset Viewer (and thus the `load_dataset` function) to ensure easy access and usability. You should aim to meet the following criteria:
+
+| **Criteria** | Description | Priority |
+| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- |
+| **Respect repository limits** | Ensure the dataset adheres to Hugging Face's storage limits for file sizes, repository sizes, and file counts. See the Critical Constraints section below for specific limits. | Required |
+| **Use hub-compatible formats** | Use Parquet format when possible (best compression, rich typing, large dataset support). For smaller datasets (10k)** | Use upload_large_folder to avoid Git limitations | `api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset")` |
+| **Streaming large media** | WebDataset format for efficient streaming | Create .tar shards, then `upload_large_folder()` |
+| **Scientific data (HDF5, NetCDF)** | Convert to Parquet with Array features | See [Scientific Data](#scientific-data) section |
+| **Custom/proprietary formats** | Document thoroughly if conversion impossible | `upload_large_folder()` with comprehensive README |
+
+## Upload Workflow
+
+0. ✓ **Gather dataset information** (if needed):
+
+ - What type of data? (images, text, audio, CSV, etc.)
+ - How is it organized? (folder structure, single file, multiple files)
+ - What's the approximate size?
+ - What format are the files in?
+ - Any special requirements? (e.g., streaming, private access)
+ - Check for existing README or documentation files that describe the dataset
+
+1. ✓ **Authenticate**:
+ - CLI: `huggingface-cli login`
+ - Or use token: `HfApi(token="hf_...")` or set `HF_TOKEN` environment variable
+2. ✓ **Identify your data type**: Check the [Quick Reference](#quick-reference-by-data-type) table above
+3. ✓ **Choose upload method**:
+
+ - **Small files (<1GB) with hub-compatible format**: Can use [Hub UI](https://huggingface.co/new-dataset) for quick uploads
+ - **Built-in loader available**: Use the loader + `push_to_hub()` (see Quick Reference table)
+ - **Large datasets or many files**: Use `upload_large_folder()` for files >100GB or >10k files
+ - **Custom formats**: Convert to hub-compatible format if possible, otherwise document thoroughly
+
+4. ✓ **Test locally** (if using built-in loader):
+
+ ```python
+ # Validate your dataset loads correctly before uploading
+ dataset = load_dataset("loader_name", data_dir="./your_data")
+ print(dataset)
+ ```
+
+5. ✓ **Upload to Hub**:
+
+ ```python
+ # Basic upload
+ dataset.push_to_hub("username/dataset-name")
+
+ # With options for large datasets
+ dataset.push_to_hub(
+ "username/dataset-name",
+ max_shard_size="5GB", # Control memory usage
+ private=True # For private datasets
+ )
+ ```
+
+6. ✓ **Verify your upload**:
+ - Check Dataset Viewer: `https://huggingface.co/datasets/username/dataset-name`
+ - Test loading: `load_dataset("username/dataset-name")`
+ - If viewer shows errors, check the [Troubleshooting](#common-issues--solutions) section
+
+## Common Conversion Patterns
+
+When built-in loaders don't match your data structure, use the datasets library as a compatibility layer. Convert your data to a Dataset object, then use `push_to_hub()` for maximum flexibility and Dataset Viewer compatibility.
+
+### From DataFrames
+
+If you already have your data working in pandas, polars, or other dataframe libraries, you can convert directly:
+
+```python
+# From pandas DataFrame
+import pandas as pd
+from datasets import Dataset
+
+df = pd.read_csv("your_data.csv")
+dataset = Dataset.from_pandas(df)
+dataset.push_to_hub("username/dataset-name")
+
+# From polars DataFrame (direct method)
+import polars as pl
+from datasets import Dataset
+
+df = pl.read_csv("your_data.csv")
+dataset = Dataset.from_polars(df) # Direct conversion
+dataset.push_to_hub("username/dataset-name")
+
+# From PyArrow Table (useful for scientific data)
+import pyarrow as pa
+from datasets import Dataset
+
+# If you have a PyArrow table
+table = pa.table({'data': [1, 2, 3], 'labels': ['a', 'b', 'c']})
+dataset = Dataset(table)
+dataset.push_to_hub("username/dataset-name")
+
+# For Spark/Dask dataframes, see https://huggingface.co/docs/hub/datasets-libraries
+```
+
+## Custom Format Conversion
+
+When built-in loaders don't match your data format, convert to Dataset objects following these principles:
+
+### Design Principles
+
+**1. Prefer wide/flat structures over joins**
+
+- Denormalize relational data into single rows for better usability
+- Include all relevant information in each example
+- Lean towards bigger but more usable data - Hugging Face's infrastructure uses advanced deduplication (XetHub) and Parquet optimizations to handle redundancy efficiently
+
+**2. Use configs for logical dataset variations**
+
+- Beyond train/test/val splits, use configs for different subsets or views of your data
+- Each config can have different features or data organization
+- Example: language-specific configs, task-specific views, or data modalities
+
+### Conversion Methods
+
+**Small datasets (fits in memory) - use `Dataset.from_dict()`**:
+
+```python
+# Parse your custom format into a dictionary
+data_dict = {
+ "text": ["example1", "example2"],
+ "label": ["positive", "negative"],
+ "score": [0.9, 0.2]
+}
+
+# Create dataset with appropriate features
+from datasets import Dataset, Features, Value, ClassLabel
+features = Features({
+ 'text': Value('string'),
+ 'label': ClassLabel(names=['negative', 'positive']),
+ 'score': Value('float32')
+})
+
+dataset = Dataset.from_dict(data_dict, features=features)
+dataset.push_to_hub("username/dataset")
+```
+
+**Large datasets (memory-efficient) - use `Dataset.from_generator()`**:
+
+```python
+def data_generator():
+ # Parse your custom format progressively
+ for item in parse_large_file("data.custom"):
+ yield {
+ "text": item["content"],
+ "label": item["category"],
+ "embedding": item["vector"]
+ }
+
+# Specify features for Dataset Viewer compatibility
+from datasets import Features, Value, ClassLabel, List
+features = Features({
+ 'text': Value('string'),
+ 'label': ClassLabel(names=['cat1', 'cat2', 'cat3']),
+ 'embedding': List(feature=Value('float32'), length=768)
+})
+
+dataset = Dataset.from_generator(data_generator, features=features)
+dataset.push_to_hub("username/dataset", max_shard_size="1GB")
+```
+
+**Tip**: For large datasets, test with a subset first by adding a limit to your generator or using `.select(range(100))` after creation.
+
+### Using Configs for Dataset Variations
+
+```python
+# Push different configurations of your dataset
+dataset_en = Dataset.from_dict(english_data, features=features)
+dataset_en.push_to_hub("username/multilingual-dataset", config_name="english")
+
+dataset_fr = Dataset.from_dict(french_data, features=features)
+dataset_fr.push_to_hub("username/multilingual-dataset", config_name="french")
+
+# Users can then load specific configs
+dataset = load_dataset("username/multilingual-dataset", "english")
+```
+
+### Multi-modal Examples
+
+**Text + Audio (speech recognition)**:
+
+```python
+def speech_generator():
+ for audio_file in Path("audio/").glob("*.wav"):
+ transcript_file = audio_file.with_suffix(".txt")
+ yield {
+ "audio": str(audio_file),
+ "text": transcript_file.read_text().strip(),
+ "speaker_id": audio_file.stem.split("_")[0]
+ }
+
+features = Features({
+ 'audio': Audio(sampling_rate=16000),
+ 'text': Value('string'),
+ 'speaker_id': Value('string')
+})
+
+dataset = Dataset.from_generator(speech_generator, features=features)
+dataset.push_to_hub("username/speech-dataset")
+```
+
+**Multiple images per example**:
+
+```python
+# Before/after images, medical imaging, etc.
+data = {
+ "image_before": ["img1_before.jpg", "img2_before.jpg"],
+ "image_after": ["img1_after.jpg", "img2_after.jpg"],
+ "treatment": ["method_A", "method_B"]
+}
+
+features = Features({
+ 'image_before': Image(),
+ 'image_after': Image(),
+ 'treatment': ClassLabel(names=['method_A', 'method_B'])
+})
+
+dataset = Dataset.from_dict(data, features=features)
+dataset.push_to_hub("username/before-after-images")
+```
+
+**Note**: For text + images, consider using ImageFolder with metadata.csv which handles this automatically.
+
+## Essential Features
+
+Features define the schema and data types for your dataset columns. Specifying correct features ensures:
+
+- Proper data handling and type conversion
+- Dataset Viewer functionality (e.g., image/audio previews)
+- Efficient storage and loading
+- Clear documentation of your data structure
+
+For complete feature documentation, see: [Dataset Features](https://huggingface.co/docs/datasets/about_dataset_features)
+
+### Feature Types Overview
+
+**Basic Types**:
+
+- `Value`: Scalar values - `string`, `int64`, `float32`, `bool`, `binary`, and other numeric types
+- `ClassLabel`: Categorical data with named classes
+- `Sequence`: Lists of any feature type
+- `LargeList`: For very large lists
+
+**Media Types** (enable Dataset Viewer previews):
+
+- `Image()`: Handles various image formats, returns PIL Image objects
+- `Audio(sampling_rate=16000)`: Audio with array data and optional sampling rate
+- `Video()`: Video files
+- `Pdf()`: PDF documents with text extraction
+
+**Array Types** (for tensors/scientific data):
+
+- `Array2D`, `Array3D`, `Array4D`, `Array5D`: Fixed or variable-length arrays
+- Example: `Array2D(shape=(224, 224), dtype='float32')`
+- First dimension can be `None` for variable length
+
+**Translation Types**:
+
+- `Translation`: For translation pairs with fixed languages
+- `TranslationVariableLanguages`: For translations with varying language pairs
+
+**Note**: New feature types are added regularly. Check the documentation for the latest additions.
+
+## Upload Methods
+
+**Dataset objects (use push_to_hub)**:
+Use when you've loaded/converted data using the datasets library
+
+```python
+dataset.push_to_hub("username/dataset", max_shard_size="5GB")
+```
+
+**Pre-existing files (use upload_large_folder)**:
+Use when you have hub-compatible files (e.g., Parquet files) already prepared and organized
+
+```python
+from huggingface_hub import HfApi
+api = HfApi()
+api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset", num_workers=16)
+```
+
+**Important**: Before using `upload_large_folder`, verify the files meet repository limits:
+
+- Check folder structure if you have file access: ensure no folder contains >10k files
+- Ask the user to confirm: "Are your files in a hub-compatible format (Parquet/CSV/JSON) and organized appropriately?"
+- For non-standard formats, consider converting to Dataset objects first to ensure compatibility
+
+## Validation
+
+**Consider small reformatting**: If data is close to a built-in loader format, suggest minor changes:
+
+- Rename columns (e.g., 'filename' → 'file_name' for ImageFolder)
+- Reorganize folders (e.g., move images into class subfolders)
+- Rename files to match expected patterns (e.g., 'data.csv' → 'train.csv')
+
+**Pre-upload**:
+
+- Test locally: `load_dataset("imagefolder", data_dir="./data")`
+- Verify features work correctly:
+
+ ```python
+ # Test first example
+ print(dataset[0])
+
+ # For images: verify they load
+ if 'image' in dataset.features:
+ dataset[0]['image'] # Should return PIL Image
+
+ # Check dataset size before upload
+ print(f"Size: {len(dataset)} examples")
+ ```
+
+- Check metadata.csv has 'file_name' column
+- Verify relative paths, no leading slashes
+- Ensure no folder >10k files
+
+**Post-upload**:
+
+- Check viewer: `https://huggingface.co/datasets/username/dataset`
+- Test loading: `load_dataset("username/dataset")`
+- Verify features preserved: `print(dataset.features)`
+
+## Common Issues → Solutions
+
+| Issue | Solution |
+| -------------------------- | ------------------------------------ |
+| "Repository not found" | Run `huggingface-cli login` |
+| Memory errors | Use `max_shard_size="500MB"` |
+| Dataset viewer not working | Wait 5-10min, check README.md config |
+| Timeout errors | Use `multi_commits=True` |
+| Files >50GB | Split into smaller files |
+| "File not found" | Use relative paths in metadata |
+
+## Dataset Viewer Configuration
+
+**Note**: This section is primarily for datasets uploaded directly to the Hub (via UI or `upload_large_folder`). Datasets uploaded with `push_to_hub()` typically configure the viewer automatically.
+
+### When automatic detection works
+
+The Dataset Viewer automatically detects standard structures:
+
+- Files named: `train.csv`, `test.json`, `validation.parquet`
+- Directories named: `train/`, `test/`, `validation/`
+- Split names with delimiters: `test-data.csv` ✓ (not `testdata.csv` ✗)
+
+### Manual configuration
+
+For custom structures, add YAML to your README.md:
+
+```yaml
+---
+configs:
+ - config_name: default # Required even for single config!
+ data_files:
+ - split: train
+ path: "data/train/*.parquet"
+ - split: test
+ path: "data/test/*.parquet"
+---
+```
+
+Multiple configurations example:
+
+```yaml
+---
+configs:
+ - config_name: english
+ data_files: "en/*.parquet"
+ - config_name: french
+ data_files: "fr/*.parquet"
+---
+```
+
+### Common viewer issues
+
+- **No viewer after upload**: Wait 5-10 minutes for processing
+- **"Config names error"**: Add `config_name` field (required!)
+- **Files not detected**: Check naming patterns (needs delimiters)
+- **Viewer disabled**: Remove `viewer: false` from README YAML
+
+## Quick Templates
+
+```python
+# ImageFolder with metadata
+dataset = load_dataset("imagefolder", data_dir="./images")
+dataset.push_to_hub("username/dataset")
+
+# Memory-efficient upload
+dataset.push_to_hub("username/dataset", max_shard_size="500MB")
+
+# Multiple CSV files
+dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'})
+dataset.push_to_hub("username/dataset")
+```
+
+## Documentation
+
+**Core docs**: [Adding datasets](https://huggingface.co/docs/hub/datasets-adding) | [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer) | [Storage limits](https://huggingface.co/docs/hub/storage-limits) | [Upload guide](https://huggingface.co/docs/datasets/upload_dataset)
+
+## Dataset Cards
+
+Remind users to add a dataset card (README.md) with:
+
+- Dataset description and usage
+- License information
+- Citation details
+
+See [Dataset Cards guide](https://huggingface.co/docs/hub/datasets-cards) for details.
+
+---
+
+## Appendix: Special Cases
+
+### WebDataset Structure
+
+For streaming large media datasets:
+
+- Create 1-5GB tar shards
+- Consistent internal structure
+- Upload with `upload_large_folder`
+
+### Scientific Data
+
+- HDF5/NetCDF → Convert to Parquet with Array features
+- Time series → Array2D(shape=(None, n))
+- Complex metadata → Store as JSON strings
+
+### Community Resources
+
+For very specialized or bespoke formats:
+
+- Search the Hub for similar datasets: `https://huggingface.co/datasets`
+- Ask for advice on the [Hugging Face Forums](https://discuss.huggingface.co/c/datasets/10)
+- Join the [Hugging Face Discord](https://hf.co/join/discord) for real-time help
+- Many domain-specific formats already have examples on the Hub