Skip to content

Fixes ray lazy metric reporting and hanging processes #2346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ozhanozen
Copy link
Contributor

Description

The step() function of ray/tuner.py has some issues preventing one from having an uninterrupted ray hyperparameter tuning session. Please refer to #2328 for details.

Fixes #2328.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@ozhanozen
Copy link
Contributor Author

ozhanozen commented Apr 21, 2025

Hi @garylvov, I have left the PR as a draft so you can first test these on your multi-GPU setup. Moreover, I have noticed a new issue apart from the ones in #2328:

You are searching for a very specific pattern to extract the experiment name here:

experiment_info_pattern = re.compile("Exact experiment name requested from command line: (.+)")

However, in skrl this line has a colon missing:
print(f"Exact experiment name requested from command line {log_dir}")

and, in rl_games, it does not exist at all, hence, causing issues for ray. As a simple workaround, we can make sure all train.py scripts output this line as you expect, but it does not seem like a very robust solution, as one might change these scripts in the future without any consideration for ray scripts.

Do you have any idea how to make this process (i.e., extracting the experiment name) more robust?

@garylvov
Copy link
Contributor

garylvov commented Apr 21, 2025

Thanks for the PR @ozhanozen , I will test this soon.

For RL games, that line should be automatically printed by the underlying wrapper; when I developed the ray functionality I tested everything with RL games.

Yes, this line-matching is a weakness, if you could fix the SKRL typo in this PR that would be much appreciated. Also, if you can leave a comment in all of the training script linking to this PR comment above that line specifically, I think this will help make sure it's not changed moving forwards (and we can let @kellyguo11 know which should help ;) )

As for extracting the experiment name, and the issues with extracting the logs at each Ray trainer step, the underlying cause for this is that all of the training scripts are only scripts, with no importable methods or introspection, and notable differences between runners.

Ray would work best if it was possible to extract the experiment information directly from the runner after each step. However, each environment wrapper is quite different (for example, the SKRL wrapper is mainly housed in the SKRL library itself, and Isaac Lab has a very thin wrapper here).

There is no standardized way to extract environment information when launching from the training scripts, as it's always like runner.run(Env) with no interaction after configuration and launch.

I think the most robust solution would be to have a single unified training runner, which would abstract the differences between individual training runners, and then logs could be fetched directly from the imported training runner. Although the .step of the environments returns metrics that could be used for the ray runner directly without my tensorboard hack, this is not a lot of help, as the runners are configured and launched from within the standalone scripts that can't really be imported. If the configuration and launch could be abstracted into a single importable method or class that provided a way to fetch the logs, this would fix the issue.

So to summarize, yes, my Ray stuff includes two notable hacks (kicking off the training with a subprocess as opposed to directly with python, and scraping logs from tensorboard instead of directly from the environment) that are the consequence of the training functionality existing in standalone scripts without a unified interface.

During one of the Isaac Lab dev meetings, I pitched unifying the standalone scripts into a single importable functionality so that runs could be easier launched and introspected programmatically, but the overall consensus was that it was more trouble than it was worth. That being said, if the community were to contribute/really desire such a unified interface, I'd be happy to review/to help.

If anyone has any suggestions on how to more elegantly get the desired functionality without large changes to the code base I am open to ideas @Mayankm96

@ozhanozen
Copy link
Contributor Author

Hi @garylvov, thank you for your reply. Your solution makes sense as a pragmatic workaround given the current limitations, at least until a more unified runner or standardized interface is introduced.

What do you think about adding a check right before the following line:


something like:

if not(experiment_name and logdir):
   Give a warning, terminate the subprocess, and later ray

so that, even if Ray fails to extract the experiment information due to a future change, the issue will be more visible and easier to debug.

In any case, I have added the changes you have asked: fixing the skrl typo and adding a comment in all train.py scripts on the relevant line (except rl_games, since this handles the logging outside of isaaclab). Please let me know when you test the PR on your side.

@garylvov
Copy link
Contributor

garylvov commented Apr 22, 2025

That extra check looks good; I'd just make sure to propagate the error up to the Ray tuner, as otherwise they can be silently squashed.

I plan to test tonight thank you for updating the PR!

@@ -70,6 +70,8 @@ class IsaacLabTuneTrainable(tune.Trainable):
def setup(self, config: dict) -> None:
"""Get the invocation command, return quick for easy scheduling."""
self.data = None
self.data_freeze_duration = 0.0
self._DATA_FREEZE_DURATION_THRESHOLD = 180.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: May be good to be able to increase this timeout from the command line with the argparser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done this:

parser.add_argument(
        "--data-freeze-threshold",
        type=float,
        default=DATA_FREEZE_DURATION_THRESHOLD,
        help="Seconds to wait with no new tensorboard scalars before terminating the training workflow process",
    )

Copy link
Contributor

@garylvov garylvov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ozhanozen thank you again for this fix!

I just tuned on a 4 GPU machine with RL Games on an in-hand manipulation task in a fresh docker with your changes, and everything seemed to be working as desired without any interruptions.

@kellyguo11 we should be all set 😎

image

@garylvov
Copy link
Contributor

garylvov commented Apr 24, 2025

Hi @ozhanozen ; can you please update this PR to main? Alternatively I'm not sure if you enabled contributor editing in the PR otherwise I'd just click the update button I normally see (maybe I need to mark ready for review?)

@glvov-bdai glvov-bdai marked this pull request as ready for review April 24, 2025 17:09
@ozhanozen
Copy link
Contributor Author

That extra check looks good; I'd just make sure to propagate the error up to the Ray tuner, as otherwise they can be silently squashed.

I have done it as the following:

if extract_experiment and not (experiment_name and logdir):
            error_msg = (
                "Could not extract experiment_name/logdir from trainer output "
                f"(experiment_name={experiment_name!r}, logdir={logdir!r}). "
                "Make sure your training script prints the following correctly:\n"
                "\t\tExact experiment name requested from command line: <name>\n"
                "\t\t[INFO] Logging experiment in directory: <logdir>\n\n"
            )
            print(f"[ERROR]: {error_msg}")
            raise RuntimeError("Could not extract experiment_name/logdir from trainer output.")

This basically flags the trial with error if the logs cannot be extracted. However, the limitations are: 1) it still needs to wait until the process is over before doing this check, due to the for loop before. 2) Even if it flags one trial with error, it will still try the other trials, as ray does not stop after one failed trial.

Do you have any idea if we can optimize this further? One case where this creates a problem is when we cannot extract the logs correctly, and the process hangs at the same time. In such a case, the early termination check I have inserted with DATA_FREEZE_DURATION_THRESHOLD won't be used and it will hang forever.

@glvov-bdai
Copy link
Collaborator

glvov-bdai commented Apr 24, 2025

For 1, I think we could add a max amount of lines that the experiment name / log dir can be extracted from and/or a maximum timeout for extracting the experiment info (can maybe share the same timeout value already in the argparser)

For 2, I think you can throw a specific error, and then catch it in the main tuner, and then rethrow a top level error it if you receive this error a certain number of times. I think I configured so that errors get squashed with Ray (in case, a hyperparameter leads to one of the tasks running out of VRAM for example) so that tuning jobs get continued. But I think if you raise a specific error type in the util file, and then catch and rethrow in the tuner itself, this could work.

@ozhanozen
Copy link
Contributor Author

@garylvov I have noticed another minor problem. Some libraries -e.g., skrl- log metrics with characters that create problems for tensorboard hparam dashboard (and maybe potentially with ray?) For example, myreward(average): brackets, Info / myreward: spaces).

Hence I did the following modification to your function to convert such characters to underscore:

def load_tensorboard_logs(directory: str) -> dict:
    """From a tensorboard directory, get the latest scalar values. If the logs can't be
    found, check the summaries sublevel.

    Args:
        directory: The directory of the tensorboard logging.

    Returns:
        The latest available scalar values.
    """

    # replace any non-alnum/underscore/dot with "_", then collapse runs of "_"
    def replace_invalid_chars(t):
        t2 = re.sub(r"[^0-9A-Za-z_./]", "_", t)
        t2 = re.sub(r"_+", "_", t2)
        return t2.strip("_")

    # Initialize the event accumulator with a size guidance for only the latest entry
    def get_latest_scalars(path: str) -> dict:
        event_acc = EventAccumulator(path, size_guidance={"scalars": 1})
        try:
            event_acc.Reload()
            if event_acc.Tags()["scalars"]:
                return {
                    replace_invalid_chars(tag): event_acc.Scalars(tag)[-1].value
                    for tag in event_acc.Tags()["scalars"]
                    if event_acc.Scalars(tag)
                }
        except (KeyError, OSError, RuntimeError, DirectoryDeletedError):
            return {}

    scalars = get_latest_scalars(directory)
    return scalars or get_latest_scalars(os.path.join(directory, "summaries"))

and, moreover, I added metric argument to TuneConfig so that the best trial would be reported periodically:

# Configure the tuning job
    tuner = tune.Tuner(
        IsaacLabTuneTrainable,
        param_space=cfg,
        tune_config=tune.TuneConfig(
            metric=args.metric,
            mode=args.mode,
            search_alg=repeat_search,
            num_samples=args.num_samples,
            reuse_actors=True,
        ),
        run_config=run_config,
    )

Please let me know if this is ok.

@glvov-bdai
Copy link
Collaborator

Yep I noticed that looks good!

@ozhanozen
Copy link
Contributor Author

Hi @ozhanozen ; can you please update this PR to main? Alternatively I'm not sure if you enabled contributor editing in the PR otherwise I'd just click the update button I normally see (maybe I need to mark ready for review?)

I have updated it to main. Now it is in the "ready for review" but I couldn't find the option to "Allow edits from maintainers."

@ozhanozen
Copy link
Contributor Author

For 1, I think we could add a max amount of lines that the experiment name / log dir can be extracted from and/or a maximum timeout for extracting the experiment info (can maybe share the same timeout value already in the argparser)

For 2, I think you can throw a specific error, and then catch it in the main tuner, and then rethrow a top level error it if you receive this error a certain number of times. I think I configured so that errors get squashed with Ray (in case, a hyperparameter leads to one of the tasks running out of VRAM for example) so that tuning jobs get continued. But I think if you raise a specific error type in the util file, and then catch and rethrow in the tuner itself, this could work.

Ok, these make sense. I will try them in the next days. Meanwhile, I will keep the pr in draft until it is done.

@ozhanozen ozhanozen marked this pull request as draft April 24, 2025 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug Report] Multiple issues within ray/tuner.py
3 participants