-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training: Add documentation for the MultiKueue and spec.managedBy API #3956
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Garvit-77 <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @Garvit-77. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
@andreyvelich please review the PR and let me know if any changes are expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
... | ||
``` | ||
|
||
Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, underneath Kueue will add a webhook which will default the field, so the example will not need to include it. I think we can just rely on the example eventually added to the MultiKueue documentation once kubernetes-sigs/kueue#2552 is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT @tenzen-y @andreyvelich ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can reference to the example in Kueue docs, so we can be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I suggest to drop it in this PR so that we only have one place about the details. Note that the Kueue documentation is not added yet. Once done we will x-ref the pages.
FYI, I think the feature will be almost completely a technical detail hidden from the users, as we are going to default the field in MultiKueue by a webhook once kubernetes-sigs/kueue#2552 is done. So, I think describing the field is ok, but eventually a "plain" TFJob is what the user yaml contains. So, I think we will be able to reference to the MultiKueue docs for example. |
Co-authored-by: Michał Woźniak <[email protected]> Signed-off-by: Garvit Khandelwal <[email protected]>
Co-authored-by: Michał Woźniak <[email protected]> Signed-off-by: Garvit Khandelwal <[email protected]>
Co-authored-by: Michał Woźniak <[email protected]> Signed-off-by: Garvit Khandelwal <[email protected]>
Co-authored-by: Michał Woźniak <[email protected]> Signed-off-by: Garvit Khandelwal <[email protected]>
Co-authored-by: Michał Woźniak <[email protected]> Signed-off-by: Garvit Khandelwal <[email protected]>
Co-authored-by: Michał Woźniak <[email protected]> Signed-off-by: Garvit Khandelwal <[email protected]>
|
||
## Overview | ||
|
||
The `spec.managedBy` field is a new feature introduced in the Kubeflow Training Operator to support a more robust multi-cluster job dispatching by [MultiKueue](https://kueue.sigs.k8s.io/docs/concepts/multikueue/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small adjustment spec.runPolicy.managedBy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, in Kubeflow 1.9 this is under spec.runPolicy
: https://github.com/kubeflow/trainer/blob/078ec30ff26649a07d2e28893c332d4cef70e233/pkg/apis/kubeflow.org/v1/common_types.go#L231C5-L239. Also adjust the code snippets to use
spec:
runPolicy:
managedBy: "kueue.x-k8s.io/multikueue"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich the location of the field is changed between training-operator 1.9.0 and the new trainer. Trainer is still not supported though. Is this documentation page meant for 1.9.0 or the new trainer, or both? If both, should we add a note that the snippet presents the yaml only for training-operator 1.9.0?
FYI we already have a note about this in Kueue in the MD file: https://github.com/kubernetes-sigs/kueue/blob/main/site/content/en/docs/tasks/run/multikueue/kubeflow.md. The actual user-facing documentation in https://kueue.sigs.k8s.io/docs/ link will be available when we release 0.11, which is planned for March 17th. |
These docs should be placed under legacy guide for Kubeflow Training Operator 1.9 @Garvit-77 Please can you rebase this PR, so you have the correct location for the guide. |
This Documentation resolves :
kubeflow/training-operator/issues/2279