-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of NaturalIdPartitioner #3276
Conversation
natural_ids = np.array(self.dataset[self._partition_by]) | ||
unique_natural_ids = self.dataset.unique(self._partition_by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do this without having to load the whole dataset into memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it work quicker we need to load the column specified by self._partition_by
to the memory.
…er' into fds-improve-natural-id-partitioner
Co-authored-by: Javier <[email protected]>
Issue
When adding and testing the
femnist
dataset onNaturalIdPartitioner
, I discovered that partitioning takes a very long time.Description
The
femnist
dataset consists of over 800,000 samples corresponding to over 3,500 unique writers.The current implementation filters (using
Dataset.filter
) the dataset for each unique writer (when there's a request for a particular partition, it filters that for a single unique value). It took about 1 minute to filter the data for a single unique writer_id (it takes 3600 minutes to do that for all = 60 hours). Also, the filtering takes all columns instead of a single column.Proposal
Iterate a single time over the whole column specified by
partition_by
to create a mapping.Time
Time old = time it takes to load 100 partitions (using slight optimization of the old way = still using filter)
time(load 100) ~ 100 * time(load 1)
and generallytime(load n) = n*time(load 1)
Time new = the same as time old, but here the computation happens once, so
time(load 100) ~ time(load 1) ~ time(load all_partitions)
Table 1: Table reporting time in seconds to load 100 partitions using an old (with slight modification but still using a filter) and a new implementation. It uses a few datasets that vary in terms of the total number of samples and the number of unique clients.
Changelog entry