-
Notifications
You must be signed in to change notification settings - Fork 146
fix: ray module not found handling #1049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please check this fix if you can, @d4l3k, @kiukchung, @tonykao8080 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
@andywag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@clumsy Can you add the type to quoted_string quoted_values: List[str] = []? |
Summary:
TorchX has been handling `ModuleNotFoundError` gracefully for a while now, e.g. for SageMaker when running `torchx runopts` we get:
```
...
(remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
quiet=QUIET (bool, False)
whether to suppress verbose output for image building. Defaults to ``False``.
aws_sagemaker: No module named 'sagemaker'
gcp_batch:
usage:
[project=PROJECT],[location=LOCATION]
...
```
But for `ray` we get an exception after which we won't get next runopts:
```
gcp_batch:
usage:
[project=PROJECT],[location=LOCATION]
optional arguments:
project=PROJECT (str, None)
Name of the GCP project. Defaults to the configured GCP project in the environment
location=LOCATION (str, us-central1)
Name of the location to schedule the job in. Defaults to us-central1
Traceback (most recent call last):
File "/usr/local/bin/torchx", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
run_main(get_sub_cmds(), argv)
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
opts = runner.scheduler_run_opts(scheduler)
File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
return self._scheduler(scheduler).run_opts()
File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
sched = factory(self._name, **self._scheduler_params)
File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
module = importlib.import_module(path)
File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined
```
That's because `ray_scheduler` has custom `ModuleNotFoundException` handling - perhaps for historic reasons.
Test Plan: [x] existing test must pass
Differential Revision: D73751531
Pulled By: andywag
|
@clumsy : You had some formatting issues which I tried to overwrite but it wouldn't let me so I created another PR. You can either move the changes to this PR so we can commit or we can do it from there. No chnages other than formatting. |
a2d8a47 to
e673ee9
Compare
|
Fixed formatting in this PR, @andywag Regarding type changes you've asked - did you mean the other PR for quoting env variable values? |
Differential Revision: D73751531 Pull Request resolved: #1055
|
@andywag sorry I don't follow what's the issue you were observing, I don't see errors in |
|
Sorry, wrong diff for the pyre issue. I updated it on the correct one. This diff got merged with some formatting changes in 1055 |
|
I see, so should I still pull-rebase/fix this one on top? @andywag |
|
Just curious what's the latest on this one, @andywag |
TorchX has been handling
ModuleNotFoundErrorgracefully for a while now, e.g. for SageMaker when runningtorchx runoptswe get:But for
raywe get an exception after which we won't get next runopts:That's because
ray_schedulerhas customModuleNotFoundExceptionhandling - perhaps for historic reasons.Test plan:
[x] existing test must pass