Torchx mcad coscheduler#693
Conversation
Codecov Report
@@ Coverage Diff @@
## main #693 +/- ##
==========================================
+ Coverage 92.46% 92.48% +0.02%
==========================================
Files 86 86
Lines 5666 5685 +19
==========================================
+ Hits 5239 5258 +19
Misses 427 427
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
| opts.add( | ||
| "coscheduler_name", | ||
| type_=str, | ||
| help="Option to run TorchX-MCAD with a co-scheduler. User must provide the co-scheduler name.", |
There was a problem hiding this comment.
Can we add some links to any external documentation about coschedulers?
Like: https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md ?
There was a problem hiding this comment.
Updated the documentation with brief explanation and references to secondary scheduling, coscheduling and PodGroups, and the PodGroup CRD.
| def create_pod_group(role: Role, namespace: str, app_id: str) -> "Dict[str, Any]": | ||
| pod_group_name = app_id + "-" + cleanup_str(role.name) + "-pg" | ||
|
|
||
| pod_group: Dict[str, Any] = { |
There was a problem hiding this comment.
Would be nice to add a link to the schema definition for this type. I think it's https://github.com/kubernetes-sigs/scheduler-plugins/blob/56e2398e5051f796d5f02adb821d57eb8b6c20a7/config/crd/bases/scheduling.sigs.k8s.io_podgroups.yaml#L9 ?
There was a problem hiding this comment.
Included the PodGroup CRD (release 1.24) in the updated documentation.
| ) | ||
| pod.metadata.labels.update( | ||
| pod_labels( | ||
| app, role_idx, role, replica_id, coscheduler_name, unique_app_id |
There was a problem hiding this comment.
there's a lot of fields here. Think it's better to switch these to use app=app, coscheduler_name=coscheduler_name, ... to avoid accidentally passing the wrong field to the wrong param
There was a problem hiding this comment.
Updated the code to include field names.
|
|
||
| resource = app_to_resource( | ||
| app, namespace, service_account, image_secret, priority | ||
| app, namespace, service_account, image_secret, coscheduler_name, priority |
|
2/27/2023 Component Integration Tests/ Kubernetes Dist Train Integration Tests appear to be failing due to a missing environment variable: torchx/scritps/integ_test_utils.py expects |
Hi @Sara-KS thanks for taking a look I actually fixed that particular issue yesterday (see: https://github.com/pytorch/torchx/blob/main/scripts/kube_dist_trainer.py#L45) but the Since I've left Meta and don't have access to TorchX's AWS account I'm unable to login and take a deeper look. Perhaps @kurman or @priyaramani can help here. Long term, @d4l3k and I have discussed moving that integ test to run directly on the CI runner with minikube instead of hitting a deployed EKS cluster. Tristan's already made FWIW |
|
That integration test doesn't use the mcad scheduler anyways so you can just ignore it for this PR |
|
LGTM Failures look like an oversight on my part on how tagging is handled for forks |
This addition enables the option of running TorchX-MCAD with a Kubernetes co-scheduler and PodGroups. In shared Kubernetes clusters where some users do not represent their jobs with AppWrappers, the additional use of a co-scheduler helps ensure correct gang scheduling.
Test plan:
Updated CI test coverage
Testing includes using the new
--coscheduler_nameflag for the kubernetes_mcad scheduler and ensuring that Kubernetes or OpenShift reports the named scheduler when pods are scheduled.