-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset.map
ignores existing caches and remaps when ran with different num_proc
#7433
Comments
Dataset.map
ignores cache_file_name
when ran with different num_proc
Dataset.map
ignores existing caches and remaps when ran with different num_proc
This feels related: #3044 |
@lhoestq This comment specifically, I agree:
|
Fixes huggingface#7433 This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly
Describe the bug
If you
map
a dataset and save it to a specificcache_file_name
with a specificnum_proc
, and then call map again with that same existingcache_file_name
but a differentnum_proc
, the dataset will be re-mapped.Steps to reproduce the bug
map
and cache it with a specificnum_proc
map
it with a differentnum_proc
and the samecache_file_name
as beforeExpected behavior
If I specify an existing
cache_file_name
, I don't expect using a differentnum_proc
than the one that was used to generate it to cause the dataset to have be be re-mapped.Environment info
The text was updated successfully, but these errors were encountered: