Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

Open
ringohoffman opened this issue Mar 3, 2025 · 2 comments · May be fixed by #7434
Open

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

ringohoffman opened this issue Mar 3, 2025 · 2 comments · May be fixed by #7434

Comments

@ringohoffman
Copy link
Contributor

ringohoffman commented Mar 3, 2025

Describe the bug

If you map a dataset and save it to a specific cache_file_name with a specific num_proc, and then call map again with that same existing cache_file_name but a different num_proc, the dataset will be re-mapped.

Steps to reproduce the bug

  1. Download a dataset
import datasets

dataset = datasets.load_dataset("ylecun/mnist")
Generating train split: 100%|██████████| 60000/60000 [00:00<00:00, 116429.85 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 103310.27 examples/s]
  1. map and cache it with a specific num_proc
cache_file_name="./cache/train.map"
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=2)
Map (num_proc=2): 100%|██████████| 60000/60000 [00:01<00:00, 53764.03 examples/s]
  1. map it with a different num_proc and the same cache_file_name as before
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=3)
Map (num_proc=3): 100%|██████████| 60000/60000 [00:00<00:00, 65377.12 examples/s]

Expected behavior

If I specify an existing cache_file_name, I don't expect using a different num_proc than the one that was used to generate it to cause the dataset to have be be re-mapped.

Environment info

$ datasets-cli env

- `datasets` version: 3.3.2
- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.16
- `huggingface_hub` version: 0.29.1
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.12.0
@ringohoffman ringohoffman changed the title Dataset.map ignores cache_file_name when ran with different num_proc Dataset.map ignores existing caches and remaps when ran with different num_proc Mar 3, 2025
@lhoestq
Copy link
Member

lhoestq commented Mar 3, 2025

This feels related: #3044

@ringohoffman
Copy link
Contributor Author

@lhoestq This comment specifically, I agree:

Almost a year later and I'm in a similar boat. Using custom fingerprints and when using multiprocessing the cached datasets are saved with a template at the end of the filename (something like "000001_of_000008" for every process of num_proc). So if in the next time you run the script you set num_proc to a different number, the cache cannot be used.

Is there any way to get around this? I am processing a huge dataset so I do the processing on one machine and then transfer the processed data to another in its cache dir but currently that's not possible due to num_proc mismatch.

ringohoffman added a commit to ringohoffman/datasets that referenced this issue Mar 4, 2025
Fixes huggingface#7433

This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards

Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants