Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets) Add DistributionPartitioner to Flower Datasets #3791

Merged
merged 78 commits into from
Jul 24, 2024

Conversation

chongshenng
Copy link
Contributor

@chongshenng chongshenng commented Jul 12, 2024

Issue

Some FL researchers use specific NIID distributions such as a pathological power law split but Flower Datasets do not currently support such partitioning schemes.

Description

The power law splitting procedure are used in several notable research such as fedprox and tamuna. To ensure that Flower Datasets can be used with Flower baselines, this partitioner needs to be developed.

Related issues/PRs

Proposal

Add a partitioner that closely reproduce the power-law partitioning scheme.

Explanation

This partitioner will include the functionalities to:

  1. Accept sample counts come from a user-specified distribution.
  2. Specify minimum occurrence per label per client.
  3. Specify number of labels per client.

Point 1 is intentional so that other distributions (other than a log-normal distribution) can also be flexibly prescribed to this partitioning scheme. It retains its original pathological definition, i.e. partitions only have num_unique_labels_per_partition. All samples from the dataset are exhausted during sampling, if the rescale parameter is set to True.

This implementation is inspired from Li et al. "Federated Optimization in Heterogeneous Networks" (2020) https://arxiv.org/abs/1812.06127.

Visualizations

The histogram shows the pathological partitioning of a log-normal distribution for 20 partitions with the MNIST dataset, with 5 preassigned number of samples per label, and 2 unique labels per partition:
image

To assess the original power law implementation by Li et al. and the Flower Datasets implementation, we plot and compare the distributions from both implementation below:
image
The plot above uses the original configuration of 1_000 partitions, 5 preassigned number of samples per label, and 2 unique labels per partition. Our Flower Datasets implementation closely matches the original distribution and the $y=x^{-2}$ curve, which validates our implementation.

Checklist

  • Implement proposed change
  • Write tests
  • Make CI checks pass
  • Ping maintainers on Slack (channel #contributions)

Any other comments?

adam-narozniak
adam-narozniak previously approved these changes Jul 22, 2024
Copy link
Contributor

@adam-narozniak adam-narozniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@jafermarq jafermarq enabled auto-merge (squash) July 24, 2024 09:12
@jafermarq jafermarq disabled auto-merge July 24, 2024 09:14
@jafermarq jafermarq enabled auto-merge (squash) July 24, 2024 10:42
@jafermarq jafermarq disabled auto-merge July 24, 2024 10:43
@jafermarq jafermarq enabled auto-merge (squash) July 24, 2024 11:03
@jafermarq jafermarq merged commit 7dc7c8a into main Jul 24, 2024
41 checks passed
@jafermarq jafermarq deleted the fds-add-distribution-partitioner branch July 24, 2024 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants