fix: ray module not found handling#1049
Conversation
|
Please check this fix if you can, @d4l3k, @kiukchung, @tonykao8080 |
|
@andywag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@clumsy Can you add the type to quoted_string quoted_values: List[str] = []? |
Summary:
TorchX has been handling `ModuleNotFoundError` gracefully for a while now, e.g. for SageMaker when running `torchx runopts` we get:
```
...
(remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
quiet=QUIET (bool, False)
whether to suppress verbose output for image building. Defaults to ``False``.
aws_sagemaker: No module named 'sagemaker'
gcp_batch:
usage:
[project=PROJECT],[location=LOCATION]
...
```
But for `ray` we get an exception after which we won't get next runopts:
```
gcp_batch:
usage:
[project=PROJECT],[location=LOCATION]
optional arguments:
project=PROJECT (str, None)
Name of the GCP project. Defaults to the configured GCP project in the environment
location=LOCATION (str, us-central1)
Name of the location to schedule the job in. Defaults to us-central1
Traceback (most recent call last):
File "/usr/local/bin/torchx", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
run_main(get_sub_cmds(), argv)
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
opts = runner.scheduler_run_opts(scheduler)
File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
return self._scheduler(scheduler).run_opts()
File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
sched = factory(self._name, **self._scheduler_params)
File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
module = importlib.import_module(path)
File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined
```
That's because `ray_scheduler` has custom `ModuleNotFoundException` handling - perhaps for historic reasons.
Test Plan: [x] existing test must pass
Differential Revision: D73751531
Pulled By: andywag
|
@clumsy : You had some formatting issues which I tried to overwrite but it wouldn't let me so I created another PR. You can either move the changes to this PR so we can commit or we can do it from there. No chnages other than formatting. |
a2d8a47 to
e673ee9
Compare
|
Fixed formatting in this PR, @andywag Regarding type changes you've asked - did you mean the other PR for quoting env variable values? |
Differential Revision: D73751531 Pull Request resolved: #1055
|
@andywag sorry I don't follow what's the issue you were observing, I don't see errors in |
|
Sorry, wrong diff for the pyre issue. I updated it on the correct one. This diff got merged with some formatting changes in 1055 |
|
I see, so should I still pull-rebase/fix this one on top? @andywag |
|
Just curious what's the latest on this one, @andywag |
TorchX has been handling
ModuleNotFoundErrorgracefully for a while now, e.g. for SageMaker when runningtorchx runoptswe get:But for
raywe get an exception after which we won't get next runopts:That's because
ray_schedulerhas customModuleNotFoundExceptionhandling - perhaps for historic reasons.Test plan:
[x] existing test must pass