feat(datasets) Add `DistributionPartitioner` to Flower Datasets #3791

chongshenng · 2024-07-12T19:32:05Z

Issue

Some FL researchers use specific NIID distributions such as a pathological power law split but Flower Datasets do not currently support such partitioning schemes.

Description

The power law splitting procedure are used in several notable research such as fedprox and tamuna. To ensure that Flower Datasets can be used with Flower baselines, this partitioner needs to be developed.

Related issues/PRs

Proposal

Add a partitioner that closely reproduce the power-law partitioning scheme.

Explanation

This partitioner will include the functionalities to:

Accept sample counts come from a user-specified distribution.
Specify minimum occurrence per label per client.
Specify number of labels per client.

Point 1 is intentional so that other distributions (other than a log-normal distribution) can also be flexibly prescribed to this partitioning scheme. It retains its original pathological definition, i.e. partitions only have num_unique_labels_per_partition. All samples from the dataset are exhausted during sampling, if the rescale parameter is set to True.

This implementation is inspired from Li et al. "Federated Optimization in Heterogeneous Networks" (2020) https://arxiv.org/abs/1812.06127.

Visualizations

The histogram shows the pathological partitioning of a log-normal distribution for 20 partitions with the MNIST dataset, with 5 preassigned number of samples per label, and 2 unique labels per partition:

To assess the original power law implementation by Li et al. and the Flower Datasets implementation, we plot and compare the distributions from both implementation below:

The plot above uses the original configuration of 1_000 partitions, 5 preassigned number of samples per label, and 2 unique labels per partition. Our Flower Datasets implementation closely matches the original distribution and the $y=x^{-2}$ curve, which validates our implementation.

Checklist

Implement proposed change
Write tests
Make CI checks pass
Ping maintainers on Slack (channel #contributions)

Any other comments?

datasets/flwr_datasets/partitioner/distribution_partitioner_test.py

adam-narozniak

lgtm!

datasets/flwr_datasets/partitioner/distribution_partitioner_test.py

…st.py

datasets/flwr_datasets/partitioner/distribution_partitioner.py

chongshenng added 30 commits July 12, 2024 14:44

Initial commit

1916a5c

Add partition_id_to_samples

89acaf8

Push working commit

5051829

Refactor

350d9b4

Refactor

f7c705f

Add data structure for storing indices

9d2785b

Refactor and rename variables

4e7fa4a

Rename tracker_dict to index_tracker

55ae55b

Update comments

cb99ebe

Refactor checks for input distribution array

69fbec2

Update class docstring

4243830

Disable too many arguments warning

994f319

Update docstring example

7d8d952

Add reshape to docstring

3c29514

Sort imports

e227c93

Merge branch 'main' into fds-add-distribution-partitioner

20931d9

Add warning for num_partitions not divisble by num_unique_labels

9bdf17b

Add check for distribution array sum

a2dd302

Fix missing sum

cd2b3cf

Sort imports

42bb71a

Fix docstrings with docformatter

adf4ebc

Merge branch 'main' into fds-add-distribution-partitioner

e71e69e

Fix docstring

4a86966

Fix docstring

1277f8e

Add newline

046ed33

Refactor ndarray typing

b20c4db

Fix pylint

d5011df

Update docstring

ed90e97

Update top level docstring

b1450bb

Use common typing

226b676

chongshenng added 3 commits July 22, 2024 13:47

Mid change

9744bfe

Change pytest to unittest implementation

2d14200

Merge branch 'main' into fds-add-distribution-partitioner

496e96e

adam-narozniak reviewed Jul 22, 2024

View reviewed changes

datasets/flwr_datasets/partitioner/distribution_partitioner_test.py Outdated Show resolved Hide resolved

chongshenng added 2 commits July 22, 2024 14:29

Use parameterized_class

0a90076

Merge branch 'main' into fds-add-distribution-partitioner

d9a21d6

adam-narozniak previously approved these changes Jul 22, 2024

View reviewed changes

Disable mypy attr-defined error code

791aa6a

chongshenng dismissed adam-narozniak’s stale review via 791aa6a July 22, 2024 13:40

jafermarq reviewed Jul 22, 2024

View reviewed changes

datasets/flwr_datasets/partitioner/distribution_partitioner_test.py Show resolved Hide resolved

Update datasets/flwr_datasets/partitioner/distribution_partitioner_te…

128d818

…st.py

jafermarq reviewed Jul 22, 2024

View reviewed changes

datasets/flwr_datasets/partitioner/distribution_partitioner.py Show resolved Hide resolved

Merge branch 'main' into fds-add-distribution-partitioner

8cd2aaf

jafermarq reviewed Jul 23, 2024

View reviewed changes

jafermarq and others added 5 commits July 23, 2024 19:14

Apply suggestions from code review

43b26ca

format

145dfe6

Address comments

5ee5a13

sort

b9b308c

Merge branch 'main' into fds-add-distribution-partitioner

99ce809

jafermarq approved these changes Jul 24, 2024

View reviewed changes

Merge branch 'main' into fds-add-distribution-partitioner

7f96adf

jafermarq enabled auto-merge (squash) July 24, 2024 09:12

jafermarq disabled auto-merge July 24, 2024 09:14

Merge branch 'main' into fds-add-distribution-partitioner

7afb402

jafermarq enabled auto-merge (squash) July 24, 2024 10:42

jafermarq disabled auto-merge July 24, 2024 10:43

Merge branch 'main' into fds-add-distribution-partitioner

c535ca7

jafermarq enabled auto-merge (squash) July 24, 2024 11:03

jafermarq merged commit 7dc7c8a into main Jul 24, 2024
41 checks passed

jafermarq deleted the fds-add-distribution-partitioner branch July 24, 2024 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets) Add `DistributionPartitioner` to Flower Datasets #3791

feat(datasets) Add `DistributionPartitioner` to Flower Datasets #3791

chongshenng commented Jul 12, 2024 •

edited

Loading

adam-narozniak left a comment

feat(datasets) Add DistributionPartitioner to Flower Datasets #3791

feat(datasets) Add DistributionPartitioner to Flower Datasets #3791

Conversation

chongshenng commented Jul 12, 2024 • edited Loading

Issue

Description

Related issues/PRs

Proposal

Explanation

Visualizations

Checklist

Any other comments?

adam-narozniak left a comment

Choose a reason for hiding this comment

feat(datasets) Add `DistributionPartitioner` to Flower Datasets #3791

feat(datasets) Add `DistributionPartitioner` to Flower Datasets #3791

chongshenng commented Jul 12, 2024 •

edited

Loading