Torchx mcad coscheduler #693

Sara-KS · 2023-02-24T17:50:28Z

This addition enables the option of running TorchX-MCAD with a Kubernetes co-scheduler and PodGroups. In shared Kubernetes clusters where some users do not represent their jobs with AppWrappers, the additional use of a co-scheduler helps ensure correct gang scheduling.

Test plan:
Updated CI test coverage
Testing includes using the new--coscheduler_name flag for the kubernetes_mcad scheduler and ensuring that Kubernetes or OpenShift reports the named scheduler when pods are scheduled.

codecov · 2023-02-24T17:57:42Z

Codecov Report

Merging #693 (10d899e) into main (77c67eb) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #693      +/-   ##
==========================================
+ Coverage   92.46%   92.48%   +0.02%     
==========================================
  Files          86       86              
  Lines        5666     5685      +19     
==========================================
+ Hits         5239     5258      +19     
  Misses        427      427

Impacted Files	Coverage Δ
torchx/schedulers/kubernetes_mcad_scheduler.py	`94.29% <100.00%> (+0.26%)`	⬆️
torchx/components/dist.py	`96.42% <0.00%> (+0.06%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

d4l3k · 2023-02-24T20:58:48Z

torchx/schedulers/kubernetes_mcad_scheduler.py

+        opts.add(
+            "coscheduler_name",
+            type_=str,
+            help="Option to run TorchX-MCAD with a co-scheduler. User must provide the co-scheduler name.",


Can we add some links to any external documentation about coschedulers?

Like: https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md ?

Updated the documentation with brief explanation and references to secondary scheduling, coscheduling and PodGroups, and the PodGroup CRD.

d4l3k · 2023-02-24T21:00:49Z

torchx/schedulers/kubernetes_mcad_scheduler.py

+def create_pod_group(role: Role, namespace: str, app_id: str) -> "Dict[str, Any]":
+    pod_group_name = app_id + "-" + cleanup_str(role.name) + "-pg"
+
+    pod_group: Dict[str, Any] = {


Would be nice to add a link to the schema definition for this type. I think it's https://github.com/kubernetes-sigs/scheduler-plugins/blob/56e2398e5051f796d5f02adb821d57eb8b6c20a7/config/crd/bases/scheduling.sigs.k8s.io_podgroups.yaml#L9 ?

Included the PodGroup CRD (release 1.24) in the updated documentation.

d4l3k · 2023-02-24T21:02:40Z

torchx/schedulers/kubernetes_mcad_scheduler.py

+            )
+            pod.metadata.labels.update(
+                pod_labels(
+                    app, role_idx, role, replica_id, coscheduler_name, unique_app_id


there's a lot of fields here. Think it's better to switch these to use app=app, coscheduler_name=coscheduler_name, ... to avoid accidentally passing the wrong field to the wrong param

Updated the code to include field names.

d4l3k · 2023-02-24T21:03:01Z

torchx/schedulers/kubernetes_mcad_scheduler.py

        resource = app_to_resource(
-            app, namespace, service_account, image_secret, priority
+            app, namespace, service_account, image_secret, coscheduler_name, priority


likewise kwargs for this

Sara-KS · 2023-02-27T18:28:32Z

2/27/2023 Component Integration Tests/ Kubernetes Dist Train Integration Tests appear to be failing due to a missing environment variable: torchx/scritps/integ_test_utils.py expects CONTAINER_REPO. Current test failures indicate that the container repository information is not detected in the test environment and attempts a docker push with ":tag" format instead of "repo:tag".
Related issue: #694

kiukchung · 2023-02-27T18:47:32Z

2/27/2023 Component Integration Tests/ Kubernetes Dist Train Integration Tests appear to be failing due to a missing environment variable: torchx/scritps/integ_test_utils.py expects CONTAINER_REPO. Current test failures indicate that the container repository information is not detected in the test environment and attempts a docker push with ":tag" format instead of "repo:tag". Related issue: #694

Hi @Sara-KS thanks for taking a look I actually fixed that particular issue yesterday (see: https://github.com/pytorch/torchx/blob/main/scripts/kube_dist_trainer.py#L45) but the Kubernetest Dist Train Integration Test is failing for a different reason (401 Unauthorized when querying for the status of the job): https://github.com/pytorch/torchx/actions/runs/4277421091/jobs/7446219941

Since I've left Meta and don't have access to TorchX's AWS account I'm unable to login and take a deeper look. Perhaps @kurman or @priyaramani can help here.

Long term, @d4l3k and I have discussed moving that integ test to run directly on the CI runner with minikube instead of hitting a deployed EKS cluster. Tristan's already made components integ test and kfp integ test to do this, we just haven't gotten around making the same change for the k8s integ test.

FWIW Components Integration Tests should be passing now (if you rebase on top of main)

d4l3k · 2023-02-27T19:32:09Z

That integration test doesn't use the mcad scheduler anyways so you can just ignore it for this PR

d4l3k · 2023-03-06T05:17:26Z

LGTM Failures look like an oversight on my part on how tagging is handled for forks

Sara-KS added 6 commits February 22, 2023 09:15

Enable PodGroups

e8005e7

Enable co-scheduler

e95a2f9

Fix lint errors

9e0e34c

Minor documentation fixes

0bb15a2

Remove comment

bb4ceed

Support different PodGroups for different Roles

31b6287

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 24, 2023

d4l3k approved these changes Feb 24, 2023

View reviewed changes

Sara-KS and others added 2 commits February 27, 2023 10:29

Improved documentation and code readability

0fc471a

Merge branch 'pytorch:main' into torchx-mcad-coscheduler

532db7f

Merge branch 'pytorch:main' into torchx-mcad-coscheduler

10d899e

d4l3k merged commit 128a8e5 into pytorch:main Mar 6, 2023

Sara-KS deleted the torchx-mcad-coscheduler branch March 6, 2023 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Torchx mcad coscheduler #693

Torchx mcad coscheduler #693

Uh oh!

Sara-KS commented Feb 24, 2023

Uh oh!

codecov bot commented Feb 24, 2023 •

edited

Loading

Uh oh!

d4l3k Feb 24, 2023

Uh oh!

Sara-KS Feb 27, 2023

Uh oh!

d4l3k Feb 24, 2023

Uh oh!

Sara-KS Feb 27, 2023

Uh oh!

d4l3k Feb 24, 2023

Uh oh!

Sara-KS Feb 27, 2023

Uh oh!

d4l3k Feb 24, 2023

Uh oh!

Sara-KS Feb 27, 2023

Uh oh!

Sara-KS commented Feb 27, 2023 •

edited

Loading

Uh oh!

kiukchung commented Feb 27, 2023

Uh oh!

d4l3k commented Feb 27, 2023

Uh oh!

d4l3k commented Mar 6, 2023

Uh oh!

Uh oh!

Torchx mcad coscheduler #693

Torchx mcad coscheduler #693

Uh oh!

Conversation

Sara-KS commented Feb 24, 2023

Uh oh!

codecov bot commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sara-KS commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiukchung commented Feb 27, 2023

Uh oh!

d4l3k commented Feb 27, 2023

Uh oh!

d4l3k commented Mar 6, 2023

Uh oh!

Uh oh!

codecov bot commented Feb 24, 2023 •

edited

Loading

Sara-KS commented Feb 27, 2023 •

edited

Loading