|
| 1 | +# 📦 Add a New Dataset |
| 2 | + |
| 3 | +This section explains how to register and add a custom dataset with the InternManip framework. |
| 4 | +The process involves two main steps: **[ensuring the dataset format](#dataset-structure)** and **[registering it in code](#implementation-steps)**. |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | +## Dataset Structure |
| 9 | + |
| 10 | +All datasets must follow the [LeRobotDataset Format](#https://github.com/huggingface/lerobot) to ensure compatibility with the data loaders and training pipelines. |
| 11 | +The expected structure is: |
| 12 | + |
| 13 | + |
| 14 | +``` |
| 15 | +<your_dataset_root> # Root directory of your dataset |
| 16 | +│ |
| 17 | +├── data # Structured episode data in .parquet format |
| 18 | +│ │ |
| 19 | +│ ├── chunk-000 # Episodes 000000 - 000999 |
| 20 | +│ │ ├── episode_000000.parquet |
| 21 | +│ │ ├── episode_000001.parquet |
| 22 | +│ │ └── ... |
| 23 | +│ │ |
| 24 | +│ ├── chunk-001 # Episodes 001000 - 001999 |
| 25 | +│ │ └── ... |
| 26 | +│ │ |
| 27 | +│ ├── ... |
| 28 | +│ │ |
| 29 | +│ └── chunk-00n # Follows the same convention (1,000 episodes per chunk) |
| 30 | +│ └── ... |
| 31 | +│ |
| 32 | +├── meta # Metadata and statistical information |
| 33 | +│ ├── episodes.jsonl # Per-episode metadata (length, subtask, etc.) |
| 34 | +│ ├── info.json # Dataset-level information |
| 35 | +│ ├── tasks.jsonl # Task definitions |
| 36 | +│ ├── modality.json # Key dimensions and mapping information for each modality |
| 37 | +│ └── stats.json # Global dataset statistics (mean, std, min, max, quantiles) |
| 38 | +│ |
| 39 | +└── videos # Multi-view videos for each episode |
| 40 | + │ |
| 41 | + ├── chunk-000 # Videos for episodes 000000 - 000999 |
| 42 | + │ ├── observation.images.head # Head (main front-view) camera |
| 43 | + │ │ ├── episode_000000.mp4 |
| 44 | + │ │ └── ... |
| 45 | + │ ├── observation.images.hand_left # Left hand camera |
| 46 | + │ └── observation.images.hand_right # Right hand camera |
| 47 | + │ |
| 48 | + ├── chunk-001 # Videos for episodes 001000 - 001999 |
| 49 | + │ |
| 50 | + ├── ... |
| 51 | + │ |
| 52 | + └── chunk-00n # Follows the same naming and structure |
| 53 | +
|
| 54 | +``` |
| 55 | + |
| 56 | +> 💡 Note: For more detailed tutorials, please refer to the [Dataset](../tutorials/dataset.md) section. |
| 57 | +
|
| 58 | +This separation of raw data, video files, and metadata makes it easier to standardize transformations and modality handling across different datasets. |
| 59 | + |
| 60 | + |
| 61 | +<!-- > 💡 Note: The `episodes_stats.jsonl` file under `meta/` is optional and can be omitted. --> |
| 62 | + |
| 63 | +## Implementation Steps |
| 64 | + |
| 65 | +### Register a Dataset Class |
| 66 | + |
| 67 | +Create a new dataset class under `internmanip/datasets/`, inheriting from `LeRobotDataset`: |
| 68 | + |
| 69 | +```python |
| 70 | +from internmanip.datasets import LeRobotDataset |
| 71 | + |
| 72 | +class CustomDataset(LeRobotDataset): |
| 73 | + def __init__(self, *args, **kwargs): |
| 74 | + super().__init__(*args, **kwargs) |
| 75 | + |
| 76 | + def load_data(self): |
| 77 | + # Implement custom data loading logic here |
| 78 | + pass |
| 79 | +``` |
| 80 | + |
| 81 | +This class defines how to read your dataset’s raw files and convert them into a standardized format for training. |
| 82 | + |
| 83 | +### Define a Data Configuration |
| 84 | + |
| 85 | +Each dataset needs a data configuration class that specifies modalities, keys, and transformations. |
| 86 | +Create a new configuration file under `internmanip/configs/data_configs/`. Here’s a minimal example: |
| 87 | + |
| 88 | +```python |
| 89 | +class CustomDataConfig(BaseDataConfig): |
| 90 | + """Data configuration for the custom dataset.""" |
| 91 | + video_keys = ["video.rgb"] |
| 92 | + state_keys = ["state.pos"] |
| 93 | + action_keys = ["action.delta_pos"] |
| 94 | + language_keys = ["annotation.instruction"] |
| 95 | + |
| 96 | + # Temporal indices |
| 97 | + observation_indices = [0] # Current timestep for observations |
| 98 | + action_indices = list(range(16)) # Future timesteps for actions (0-15) |
| 99 | + |
| 100 | + def modality_config(self) -> dict[str, ModalityConfig]: |
| 101 | + """Define modality configurations.""" |
| 102 | + return { |
| 103 | + "video": ModalityConfig(self.observation_indices, self.video_keys), |
| 104 | + "state": ModalityConfig(self.observation_indices, self.state_keys), |
| 105 | + "action": ModalityConfig(self.action_indices, self.action_keys), |
| 106 | + "language": ModalityConfig(self.observation_indices, self.language_keys), |
| 107 | + } |
| 108 | + |
| 109 | + def transform(self): |
| 110 | + """Define preprocessing pipelines.""" |
| 111 | + return [ |
| 112 | + # Video preprocessing |
| 113 | + VideoToTensor(apply_to=self.video_keys), |
| 114 | + VideoResize(apply_to=self.video_keys, height=224, width=224), |
| 115 | + |
| 116 | + # State preprocessing |
| 117 | + StateActionToTensor(apply_to=self.state_keys), |
| 118 | + StateActionTransform( |
| 119 | + apply_to=self.state_keys, |
| 120 | + normalization_modes={"state.pos": "mean_std"}, |
| 121 | + ), |
| 122 | + |
| 123 | + # Action preprocessing |
| 124 | + StateActionToTensor(apply_to=self.action_keys), |
| 125 | + StateActionTransform( |
| 126 | + apply_to=self.action_keys, |
| 127 | + normalization_modes={"action.delta_pos": "mean_std"}, |
| 128 | + ), |
| 129 | + |
| 130 | + # Concatenate modalities |
| 131 | + ConcatTransform( |
| 132 | + video_concat_order=self.video_keys, |
| 133 | + state_concat_order=self.state_keys, |
| 134 | + action_concat_order=self.action_keys, |
| 135 | + ), |
| 136 | + ] |
| 137 | +``` |
| 138 | + |
| 139 | +### Register Your Config |
| 140 | + |
| 141 | +Finally, register your custom config by adding it to `DATA_CONFIG_MAP`. |
| 142 | + |
| 143 | + |
| 144 | +```python |
| 145 | +DATA_CONFIG_MAP = { |
| 146 | + ..., |
| 147 | + "custom": CustomDataConfig(), |
| 148 | +} |
| 149 | +``` |
| 150 | + |
| 151 | +> 💡 Tips: Adjust the key names (`video_keys`, `state_keys`, etc.) and `normalization_modes` based on your dataset. For multi-view video or multi-joint actions, just add more keys and update the transforms accordingly. |
| 152 | +
|
| 153 | +This config sets up how to load and process different modalities, and ensures compatibility with the training framework. |
| 154 | + |
| 155 | +### What's Next? |
| 156 | +After registration, you can use your dataset by passing `--dataset_path <path>` and `--data_config custom` to the training YAML file. |
0 commit comments