Skip to content

fix: ray module not found handling #1049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Apr 23, 2025

TorchX has been handling ModuleNotFoundError gracefully for a while now, e.g. for SageMaker when running torchx runopts we get:

...
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

aws_sagemaker: No module named 'sagemaker'

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]
...

But for ray we get an exception after which we won't get next runopts:

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
    opts = runner.scheduler_run_opts(scheduler)
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
    return self._scheduler(scheduler).run_opts()
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
    sched = factory(self._name, **self._scheduler_params)
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
    module = importlib.import_module(path)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
    session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined

That's because ray_scheduler has custom ModuleNotFoundException handling - perhaps for historic reasons.

Test plan:
[x] existing test must pass

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2025
@clumsy
Copy link
Contributor Author

clumsy commented Apr 23, 2025

Please check this fix if you can, @d4l3k, @kiukchung, @tonykao8080

Copy link
Contributor

@andywag andywag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@facebook-github-bot
Copy link
Contributor

@andywag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants