Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchx mcad coscheduler #693

Merged
merged 9 commits into from
Mar 6, 2023

Conversation

Sara-KS
Copy link
Contributor

@Sara-KS Sara-KS commented Feb 24, 2023

This addition enables the option of running TorchX-MCAD with a Kubernetes co-scheduler and PodGroups. In shared Kubernetes clusters where some users do not represent their jobs with AppWrappers, the additional use of a co-scheduler helps ensure correct gang scheduling.

Test plan:
Updated CI test coverage
Testing includes using the new--coscheduler_name flag for the kubernetes_mcad scheduler and ensuring that Kubernetes or OpenShift reports the named scheduler when pods are scheduled.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 24, 2023
@codecov
Copy link

codecov bot commented Feb 24, 2023

Codecov Report

Merging #693 (10d899e) into main (77c67eb) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #693      +/-   ##
==========================================
+ Coverage   92.46%   92.48%   +0.02%     
==========================================
  Files          86       86              
  Lines        5666     5685      +19     
==========================================
+ Hits         5239     5258      +19     
  Misses        427      427              
Impacted Files Coverage Δ
torchx/schedulers/kubernetes_mcad_scheduler.py 94.29% <100.00%> (+0.26%) ⬆️
torchx/components/dist.py 96.42% <0.00%> (+0.06%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

opts.add(
"coscheduler_name",
type_=str,
help="Option to run TorchX-MCAD with a co-scheduler. User must provide the co-scheduler name.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some links to any external documentation about coschedulers?

Like: https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the documentation with brief explanation and references to secondary scheduling, coscheduling and PodGroups, and the PodGroup CRD.

def create_pod_group(role: Role, namespace: str, app_id: str) -> "Dict[str, Any]":
pod_group_name = app_id + "-" + cleanup_str(role.name) + "-pg"

pod_group: Dict[str, Any] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included the PodGroup CRD (release 1.24) in the updated documentation.

)
pod.metadata.labels.update(
pod_labels(
app, role_idx, role, replica_id, coscheduler_name, unique_app_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a lot of fields here. Think it's better to switch these to use app=app, coscheduler_name=coscheduler_name, ... to avoid accidentally passing the wrong field to the wrong param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to include field names.

resource = app_to_resource(
app, namespace, service_account, image_secret, priority
app, namespace, service_account, image_secret, coscheduler_name, priority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise kwargs for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@Sara-KS
Copy link
Contributor Author

Sara-KS commented Feb 27, 2023

2/27/2023 Component Integration Tests/ Kubernetes Dist Train Integration Tests appear to be failing due to a missing environment variable: torchx/scritps/integ_test_utils.py expects CONTAINER_REPO. Current test failures indicate that the container repository information is not detected in the test environment and attempts a docker push with ":tag" format instead of "repo:tag".
Related issue: #694

@kiukchung
Copy link
Contributor

2/27/2023 Component Integration Tests/ Kubernetes Dist Train Integration Tests appear to be failing due to a missing environment variable: torchx/scritps/integ_test_utils.py expects CONTAINER_REPO. Current test failures indicate that the container repository information is not detected in the test environment and attempts a docker push with ":tag" format instead of "repo:tag". Related issue: #694

Hi @Sara-KS thanks for taking a look I actually fixed that particular issue yesterday (see: https://github.com/pytorch/torchx/blob/main/scripts/kube_dist_trainer.py#L45) but the Kubernetest Dist Train Integration Test is failing for a different reason (401 Unauthorized when querying for the status of the job): https://github.com/pytorch/torchx/actions/runs/4277421091/jobs/7446219941

Since I've left Meta and don't have access to TorchX's AWS account I'm unable to login and take a deeper look. Perhaps @kurman or @priyaramani can help here.

Long term, @d4l3k and I have discussed moving that integ test to run directly on the CI runner with minikube instead of hitting a deployed EKS cluster. Tristan's already made components integ test and kfp integ test to do this, we just haven't gotten around making the same change for the k8s integ test.

FWIW Components Integration Tests should be passing now (if you rebase on top of main)

@d4l3k
Copy link
Member

d4l3k commented Feb 27, 2023

That integration test doesn't use the mcad scheduler anyways so you can just ignore it for this PR

@d4l3k
Copy link
Member

d4l3k commented Mar 6, 2023

LGTM Failures look like an oversight on my part on how tagging is handled for forks

@d4l3k d4l3k merged commit 128a8e5 into pytorch:main Mar 6, 2023
@Sara-KS Sara-KS deleted the torchx-mcad-coscheduler branch March 6, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants