Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LFX'24] Add Sedna Federated Learning v2 Proposal. #455

Merged
merged 2 commits into from
Nov 22, 2024

Conversation

Electronic-Waste
Copy link
Contributor

What type of PR is this?

/kind design

What this PR does / why we need it:

This PR contains the proposal for Sedna Federated Learning V2 (updated version after the last community meeting).

Related to LFX'24 Fall Project: kubeedge/kubeedge#5762

cc👀 @Shelley-BaoYue @fisherxu @tangming1996 @MooreZheng @hsj576

Which issue(s) this PR fixes:

Fixes #

@kubeedge-bot kubeedge-bot added the kind/design Categorizes issue or PR as related to design. label Nov 4, 2024
@kubeedge-bot kubeedge-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 4, 2024
Copy link
Contributor

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal then should further consider the design of data-centric task scheduling.

In the previous routine meeting, we see that there are challenges integrating training-operator with Sedna at the beginning: that is about what to schedule in Sedna federated learning. Since federated learning is also a training task, it is in fact data-driven. When we schedule a training task without scheduling the training data, it can lead to significant training bias.

At the meeting, we see that there are mainly two possible ways to build a practical system.

  1. Assume that there are subnets where data can be scheduled within the same subnet.
    As suggested via @tangming1996, KubeEdge itself also has node-group management that can be used to fulfill the subnet assumption.
  2. Develop a method to transfer non-raw data, e.g., embedding, where raw data can not be recovered from non-raw data.

Besides, Kubeflow assume all training workers share the same parameter and dataset, which is not practical for edge tasks where workers have different parameters and datasets. That means we need an edge version of training operator.

Signed-off-by: Electronic-Waste <[email protected]>
@Electronic-Waste
Copy link
Contributor Author

I've updated the proposal according to the reviews in the routine today.

PTAL👀 @tangming1996 @MooreZheng @hsj576 @Shelley-BaoYue @fisherxu

@tangming1996
Copy link
Contributor

/lgtm

@kubeedge-bot kubeedge-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 22, 2024
@MooreZheng
Copy link
Contributor

/lgtm

Copy link
Contributor

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal aims to solve a complicated and critical issue of Sedna by considering distributed training in all different schemes. The current version is fine as the very first version after rounds of discussion.

Note that since the whole distributed training for all schemes is undoubtedly challenging, it would introduce tons of new features to Sedna. We believe that it is still possible to enrich in the future. As mentioned at the routine meeting, the "DataLoader DS" and the edge-wise data transfer have the potential to be implemented based on EdgeMesh @Poorunga , which should be considered in the implementation and future versions of the proposal.

@kubeedge-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MooreZheng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot kubeedge-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2024
@kubeedge-bot kubeedge-bot merged commit 01351c5 into kubeedge:main Nov 22, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/design Categorizes issue or PR as related to design. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants