Skip to content

Contribute torchx changes from our fork back upstream so that we can stop maintaining a custom fork #2

@KPostOffice

Description

@KPostOffice

Description

  1. Has the Ray solution been tested with other supported ways to run TorchX-Ray? For example, perhaps on minikube or a traditional HPC-like system.
  2. The dist.ddp component has to work with all supported schedulers. I suspect the Ray on OCP solution does not support all the schedulers. One option would be to have it be a standalone custom component: Rename it to something like rayocp.ddp, move it to a separate branch and not upstream it to pytorch, follow steps for registering custom components, update any internal docs. Alternatively, you would need to test it with all the schedulers that support dist.ddp currently: local, docker, Kubernetes (volcano on plain kubernetes), Kubernetes-MCAD (plain kubernetes or OCP), Slurm, AWS Batch, LSF, and GCP Batch
  3. Make sure it passes the Ray scheduler test: torchx/torchx/schedulers/test/ray_scheduler_test.py
  4. After the steps above, make sure it passes the torchx/scripts/lint.sh and torchx/scripts/pyre.sh
    In order to contribute to TorchX, you also need to have a signed CLA in place. For IBM Research, I had to have my github ID added at a corporate agreement level. I am not sure if there is a process in place from the Red Hat side or if you can sign it individually.

Motivation/Background

We don't want to have to maintain a fork of torchx and it would also be nice if torchx works on OpenShift by default

Alternatives

If we can't get the changes in an acceptable state so that they're accepted back upstream then we will have to continue maintaining the changes here indefinitely.

Additional context/links

Will update with upstream PR

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions