-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchx mcad coscheduler #693
Conversation
Codecov Report
@@ Coverage Diff @@
## main #693 +/- ##
==========================================
+ Coverage 92.46% 92.48% +0.02%
==========================================
Files 86 86
Lines 5666 5685 +19
==========================================
+ Hits 5239 5258 +19
Misses 427 427
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
opts.add( | ||
"coscheduler_name", | ||
type_=str, | ||
help="Option to run TorchX-MCAD with a co-scheduler. User must provide the co-scheduler name.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some links to any external documentation about coschedulers?
Like: https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the documentation with brief explanation and references to secondary scheduling, coscheduling and PodGroups, and the PodGroup CRD.
def create_pod_group(role: Role, namespace: str, app_id: str) -> "Dict[str, Any]": | ||
pod_group_name = app_id + "-" + cleanup_str(role.name) + "-pg" | ||
|
||
pod_group: Dict[str, Any] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add a link to the schema definition for this type. I think it's https://github.com/kubernetes-sigs/scheduler-plugins/blob/56e2398e5051f796d5f02adb821d57eb8b6c20a7/config/crd/bases/scheduling.sigs.k8s.io_podgroups.yaml#L9 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included the PodGroup CRD (release 1.24) in the updated documentation.
) | ||
pod.metadata.labels.update( | ||
pod_labels( | ||
app, role_idx, role, replica_id, coscheduler_name, unique_app_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a lot of fields here. Think it's better to switch these to use app=app, coscheduler_name=coscheduler_name, ...
to avoid accidentally passing the wrong field to the wrong param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code to include field names.
resource = app_to_resource( | ||
app, namespace, service_account, image_secret, priority | ||
app, namespace, service_account, image_secret, coscheduler_name, priority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise kwargs for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
2/27/2023 Component Integration Tests/ Kubernetes Dist Train Integration Tests appear to be failing due to a missing environment variable: torchx/scritps/integ_test_utils.py expects |
Hi @Sara-KS thanks for taking a look I actually fixed that particular issue yesterday (see: https://github.com/pytorch/torchx/blob/main/scripts/kube_dist_trainer.py#L45) but the Since I've left Meta and don't have access to TorchX's AWS account I'm unable to login and take a deeper look. Perhaps @kurman or @priyaramani can help here. Long term, @d4l3k and I have discussed moving that integ test to run directly on the CI runner with minikube instead of hitting a deployed EKS cluster. Tristan's already made FWIW |
That integration test doesn't use the mcad scheduler anyways so you can just ignore it for this PR |
LGTM Failures look like an oversight on my part on how tagging is handled for forks |
This addition enables the option of running TorchX-MCAD with a Kubernetes co-scheduler and PodGroups. In shared Kubernetes clusters where some users do not represent their jobs with AppWrappers, the additional use of a co-scheduler helps ensure correct gang scheduling.
Test plan:
Updated CI test coverage
Testing includes using the new
--coscheduler_name
flag for the kubernetes_mcad scheduler and ensuring that Kubernetes or OpenShift reports the named scheduler when pods are scheduled.