`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

ringohoffman · 2025-03-03T05:51:26Z

Describe the bug

If you map a dataset and save it to a specific cache_file_name with a specific num_proc, and then call map again with that same existing cache_file_name but a different num_proc, the dataset will be re-mapped.

Steps to reproduce the bug

Download a dataset

import datasets

dataset = datasets.load_dataset("ylecun/mnist")

Generating train split: 100%|██████████| 60000/60000 [00:00<00:00, 116429.85 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 103310.27 examples/s]

map and cache it with a specific num_proc

cache_file_name="./cache/train.map"
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=2)

Map (num_proc=2): 100%|██████████| 60000/60000 [00:01<00:00, 53764.03 examples/s]

map it with a different num_proc and the same cache_file_name as before

dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=3)

Map (num_proc=3): 100%|██████████| 60000/60000 [00:00<00:00, 65377.12 examples/s]

Expected behavior

If I specify an existing cache_file_name, I don't expect using a different num_proc than the one that was used to generate it to cause the dataset to have be be re-mapped.

Environment info

$ datasets-cli env

- `datasets` version: 3.3.2
- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.16
- `huggingface_hub` version: 0.29.1
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.12.0

The text was updated successfully, but these errors were encountered:

lhoestq · 2025-03-03T17:47:20Z

This feels related: #3044

ringohoffman · 2025-03-04T05:55:07Z

@lhoestq This comment specifically, I agree:

Inconsistent caching behaviour when using Dataset.map() with a new_fingerprint and num_proc>1 #3044 (comment)

Almost a year later and I'm in a similar boat. Using custom fingerprints and when using multiprocessing the cached datasets are saved with a template at the end of the filename (something like "000001_of_000008" for every process of num_proc). So if in the next time you run the script you set num_proc to a different number, the cache cannot be used.

Is there any way to get around this? I am processing a huge dataset so I do the processing on one machine and then transfer the processed data to another in its cache dir but currently that's not possible due to num_proc mismatch.

Fixes huggingface#7433 This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly

ringohoffman changed the title ~~Dataset.map ignores cache_file_name when ran with different num_proc~~ Dataset.map ignores existing caches and remaps when ran with different num_proc Mar 3, 2025

ringohoffman linked a pull request Mar 4, 2025 that will close this issue

Refactor Dataset.map to reuse cache files mapped with different num_proc #7434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

ringohoffman commented Mar 3, 2025 •

edited

Loading

lhoestq commented Mar 3, 2025

ringohoffman commented Mar 4, 2025

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

Comments

ringohoffman commented Mar 3, 2025 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Mar 3, 2025

ringohoffman commented Mar 4, 2025

`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

ringohoffman commented Mar 3, 2025 •

edited

Loading