-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fixes ray lazy metric reporting and hanging processes #2346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fixes ray lazy metric reporting and hanging processes #2346
Conversation
Hi @garylvov, I have left the PR as a draft so you can first test these on your multi-GPU setup. Moreover, I have noticed a new issue apart from the ones in #2328: You are searching for a very specific pattern to extract the experiment name here:
However, in skrl this line has a colon missing:
and, in rl_games, it does not exist at all, hence, causing issues for ray. As a simple workaround, we can make sure all train.py scripts output this line as you expect, but it does not seem like a very robust solution, as one might change these scripts in the future without any consideration for ray scripts.
Do you have any idea how to make this process (i.e., extracting the experiment name) more robust? |
Thanks for the PR @ozhanozen , I will test this soon. For RL games, that line should be automatically printed by the underlying wrapper; when I developed the ray functionality I tested everything with RL games. Yes, this line-matching is a weakness, if you could fix the SKRL typo in this PR that would be much appreciated. Also, if you can leave a comment in all of the training script linking to this PR comment above that line specifically, I think this will help make sure it's not changed moving forwards (and we can let @kellyguo11 know which should help ;) ) As for extracting the experiment name, and the issues with extracting the logs at each Ray trainer step, the underlying cause for this is that all of the training scripts are only scripts, with no importable methods or introspection, and notable differences between runners. Ray would work best if it was possible to extract the experiment information directly from the runner after each step. However, each environment wrapper is quite different (for example, the SKRL wrapper is mainly housed in the SKRL library itself, and Isaac Lab has a very thin wrapper here). There is no standardized way to extract environment information when launching from the training scripts, as it's always like I think the most robust solution would be to have a single unified training runner, which would abstract the differences between individual training runners, and then logs could be fetched directly from the imported training runner. Although the So to summarize, yes, my Ray stuff includes two notable hacks (kicking off the training with a subprocess as opposed to directly with python, and scraping logs from tensorboard instead of directly from the environment) that are the consequence of the training functionality existing in standalone scripts without a unified interface. During one of the Isaac Lab dev meetings, I pitched unifying the standalone scripts into a single importable functionality so that runs could be easier launched and introspected programmatically, but the overall consensus was that it was more trouble than it was worth. That being said, if the community were to contribute/really desire such a unified interface, I'd be happy to review/to help. If anyone has any suggestions on how to more elegantly get the desired functionality without large changes to the code base I am open to ideas @Mayankm96 |
Hi @garylvov, thank you for your reply. Your solution makes sense as a pragmatic workaround given the current limitations, at least until a more unified runner or standardized interface is introduced. What do you think about adding a check right before the following line:
something like:
so that, even if Ray fails to extract the experiment information due to a future change, the issue will be more visible and easier to debug. In any case, I have added the changes you have asked: fixing the skrl typo and adding a comment in all |
That extra check looks good; I'd just make sure to propagate the error up to the Ray tuner, as otherwise they can be silently squashed. I plan to test tonight thank you for updating the PR! |
@@ -70,6 +70,8 @@ class IsaacLabTuneTrainable(tune.Trainable): | |||
def setup(self, config: dict) -> None: | |||
"""Get the invocation command, return quick for easy scheduling.""" | |||
self.data = None | |||
self.data_freeze_duration = 0.0 | |||
self._DATA_FREEZE_DURATION_THRESHOLD = 180.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: May be good to be able to increase this timeout from the command line with the argparser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done this:
parser.add_argument(
"--data-freeze-threshold",
type=float,
default=DATA_FREEZE_DURATION_THRESHOLD,
help="Seconds to wait with no new tensorboard scalars before terminating the training workflow process",
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @ozhanozen thank you again for this fix!
I just tuned on a 4 GPU machine with RL Games on an in-hand manipulation task in a fresh docker with your changes, and everything seemed to be working as desired without any interruptions.
@kellyguo11 we should be all set 😎
Hi @ozhanozen ; can you please update this PR to main? Alternatively I'm not sure if you enabled contributor editing in the PR otherwise I'd just click the update button I normally see (maybe I need to mark ready for review?) |
…responding log messages
I have done it as the following:
This basically flags the trial with error if the logs cannot be extracted. However, the limitations are: 1) it still needs to wait until the process is over before doing this check, due to the for loop before. 2) Even if it flags one trial with error, it will still try the other trials, as ray does not stop after one failed trial. Do you have any idea if we can optimize this further? One case where this creates a problem is when we cannot extract the logs correctly, and the process hangs at the same time. In such a case, the early termination check I have inserted with |
For 1, I think we could add a max amount of lines that the experiment name / log dir can be extracted from and/or a maximum timeout for extracting the experiment info (can maybe share the same timeout value already in the argparser) For 2, I think you can throw a specific error, and then catch it in the main tuner, and then rethrow a top level error it if you receive this error a certain number of times. I think I configured so that errors get squashed with Ray (in case, a hyperparameter leads to one of the tasks running out of VRAM for example) so that tuning jobs get continued. But I think if you raise a specific error type in the util file, and then catch and rethrow in the tuner itself, this could work. |
@garylvov I have noticed another minor problem. Some libraries -e.g., skrl- log metrics with characters that create problems for tensorboard hparam dashboard (and maybe potentially with ray?) For example, Hence I did the following modification to your function to convert such characters to underscore:
and, moreover, I added metric argument to TuneConfig so that the best trial would be reported periodically:
Please let me know if this is ok. |
Yep I noticed that looks good! |
I have updated it to main. Now it is in the "ready for review" but I couldn't find the option to "Allow edits from maintainers." |
Ok, these make sense. I will try them in the next days. Meanwhile, I will keep the pr in draft until it is done. |
Description
The step() function of ray/tuner.py has some issues preventing one from having an uninterrupted ray hyperparameter tuning session. Please refer to #2328 for details.
Fixes #2328.
Type of change
Checklist
pre-commit
checks with./isaaclab.sh --format
config/extension.toml
fileCONTRIBUTORS.md
or my name already exists there