Fixes ray lazy metric reporting and hanging processes #2346

ozhanozen · 2025-04-21T18:20:56Z

Description

The step() function of ray/tuner.py has some issues preventing one from having an uninterrupted ray hyperparameter tuning session. Please refer to #2328 for details.

Fixes #2328.

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist

I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

ozhanozen · 2025-04-21T18:33:33Z

Hi @garylvov, I have left the PR as a draft so you can first test these on your multi-GPU setup. Moreover, I have noticed a new issue apart from the ones in #2328:

You are searching for a very specific pattern to extract the experiment name here:

IsaacLab/scripts/reinforcement_learning/ray/util.py

Line 197 in 7de6d6f

    
           experiment_info_pattern = re.compile("Exact experiment name requested from command line: (.+)")

However, in skrl this line has a colon missing:

IsaacLab/scripts/reinforcement_learning/skrl/train.py

Line 143 in 7de6d6f

print(f"Exact experiment name requested from command line {log_dir}")

and, in rl_games, it does not exist at all, hence, causing issues for ray. As a simple workaround, we can make sure all train.py scripts output this line as you expect, but it does not seem like a very robust solution, as one might change these scripts in the future without any consideration for ray scripts.

Do you have any idea how to make this process (i.e., extracting the experiment name) more robust?

garylvov · 2025-04-21T19:14:33Z

Thanks for the PR @ozhanozen , I will test this soon.

For RL games, that line should be automatically printed by the underlying wrapper; when I developed the ray functionality I tested everything with RL games.

Yes, this line-matching is a weakness, if you could fix the SKRL typo in this PR that would be much appreciated. Also, if you can leave a comment in all of the training script linking to this PR comment above that line specifically, I think this will help make sure it's not changed moving forwards (and we can let @kellyguo11 know which should help ;) )

As for extracting the experiment name, and the issues with extracting the logs at each Ray trainer step, the underlying cause for this is that all of the training scripts are only scripts, with no importable methods or introspection, and notable differences between runners.

Ray would work best if it was possible to extract the experiment information directly from the runner after each step. However, each environment wrapper is quite different (for example, the SKRL wrapper is mainly housed in the SKRL library itself, and Isaac Lab has a very thin wrapper here).

There is no standardized way to extract environment information when launching from the training scripts, as it's always like runner.run(Env) with no interaction after configuration and launch.

I think the most robust solution would be to have a single unified training runner, which would abstract the differences between individual training runners, and then logs could be fetched directly from the imported training runner. Although the .step of the environments returns metrics that could be used for the ray runner directly without my tensorboard hack, this is not a lot of help, as the runners are configured and launched from within the standalone scripts that can't really be imported. If the configuration and launch could be abstracted into a single importable method or class that provided a way to fetch the logs, this would fix the issue.

So to summarize, yes, my Ray stuff includes two notable hacks (kicking off the training with a subprocess as opposed to directly with python, and scraping logs from tensorboard instead of directly from the environment) that are the consequence of the training functionality existing in standalone scripts without a unified interface.

During one of the Isaac Lab dev meetings, I pitched unifying the standalone scripts into a single importable functionality so that runs could be easier launched and introspected programmatically, but the overall consensus was that it was more trouble than it was worth. That being said, if the community were to contribute/really desire such a unified interface, I'd be happy to review/to help.

If anyone has any suggestions on how to more elegantly get the desired functionality without large changes to the code base I am open to ideas @Mayankm96

ozhanozen · 2025-04-22T15:53:39Z

Hi @garylvov, thank you for your reply. Your solution makes sense as a pragmatic workaround given the current limitations, at least until a more unified runner or standardized interface is introduced.

What do you think about adding a check right before the following line:

IsaacLab/scripts/reinforcement_learning/ray/util.py

Line 250 in 7de6d6f

process.wait()

something like:

if not(experiment_name and logdir):
   Give a warning, terminate the subprocess, and later ray

so that, even if Ray fails to extract the experiment information due to a future change, the issue will be more visible and easier to debug.

In any case, I have added the changes you have asked: fixing the skrl typo and adding a comment in all train.py scripts on the relevant line (except rl_games, since this handles the logging outside of isaaclab). Please let me know when you test the PR on your side.

garylvov · 2025-04-22T18:34:53Z

That extra check looks good; I'd just make sure to propagate the error up to the Ray tuner, as otherwise they can be silently squashed.

I plan to test tonight thank you for updating the PR!

garylvov · 2025-04-24T03:35:20Z

scripts/reinforcement_learning/ray/tuner.py

@@ -70,6 +70,8 @@ class IsaacLabTuneTrainable(tune.Trainable):
    def setup(self, config: dict) -> None:
        """Get the invocation command, return quick for easy scheduling."""
        self.data = None
+        self.data_freeze_duration = 0.0
+        self._DATA_FREEZE_DURATION_THRESHOLD = 180.0


Nit: May be good to be able to increase this timeout from the command line with the argparser

I have done this:

parser.add_argument( "--data-freeze-threshold", type=float, default=DATA_FREEZE_DURATION_THRESHOLD, help="Seconds to wait with no new tensorboard scalars before terminating the training workflow process", )

garylvov

Hi, @ozhanozen thank you again for this fix!

I just tuned on a 4 GPU machine with RL Games on an in-hand manipulation task in a fresh docker with your changes, and everything seemed to be working as desired without any interruptions.

@kellyguo11 we should be all set 😎

garylvov · 2025-04-24T03:49:18Z

Hi @ozhanozen ; can you please update this PR to main? Alternatively I'm not sure if you enabled contributor editing in the PR otherwise I'd just click the update button I normally see (maybe I need to mark ready for review?)

…responding log messages

…extracted

ozhanozen · 2025-04-24T17:14:21Z

That extra check looks good; I'd just make sure to propagate the error up to the Ray tuner, as otherwise they can be silently squashed.

I have done it as the following:

if extract_experiment and not (experiment_name and logdir):
            error_msg = (
                "Could not extract experiment_name/logdir from trainer output "
                f"(experiment_name={experiment_name!r}, logdir={logdir!r}). "
                "Make sure your training script prints the following correctly:\n"
                "\t\tExact experiment name requested from command line: <name>\n"
                "\t\t[INFO] Logging experiment in directory: <logdir>\n\n"
            )
            print(f"[ERROR]: {error_msg}")
            raise RuntimeError("Could not extract experiment_name/logdir from trainer output.")

This basically flags the trial with error if the logs cannot be extracted. However, the limitations are: 1) it still needs to wait until the process is over before doing this check, due to the for loop before. 2) Even if it flags one trial with error, it will still try the other trials, as ray does not stop after one failed trial.

Do you have any idea if we can optimize this further? One case where this creates a problem is when we cannot extract the logs correctly, and the process hangs at the same time. In such a case, the early termination check I have inserted with DATA_FREEZE_DURATION_THRESHOLD won't be used and it will hang forever.

glvov-bdai · 2025-04-24T17:26:39Z

For 1, I think we could add a max amount of lines that the experiment name / log dir can be extracted from and/or a maximum timeout for extracting the experiment info (can maybe share the same timeout value already in the argparser)

For 2, I think you can throw a specific error, and then catch it in the main tuner, and then rethrow a top level error it if you receive this error a certain number of times. I think I configured so that errors get squashed with Ray (in case, a hyperparameter leads to one of the tasks running out of VRAM for example) so that tuning jobs get continued. But I think if you raise a specific error type in the util file, and then catch and rethrow in the tuner itself, this could work.

ozhanozen · 2025-04-24T17:28:13Z

@garylvov I have noticed another minor problem. Some libraries -e.g., skrl- log metrics with characters that create problems for tensorboard hparam dashboard (and maybe potentially with ray?) For example, myreward(average): brackets, Info / myreward: spaces).

Hence I did the following modification to your function to convert such characters to underscore:

def load_tensorboard_logs(directory: str) -> dict:
    """From a tensorboard directory, get the latest scalar values. If the logs can't be
    found, check the summaries sublevel.

    Args:
        directory: The directory of the tensorboard logging.

    Returns:
        The latest available scalar values.
    """

    # replace any non-alnum/underscore/dot with "_", then collapse runs of "_"
    def replace_invalid_chars(t):
        t2 = re.sub(r"[^0-9A-Za-z_./]", "_", t)
        t2 = re.sub(r"_+", "_", t2)
        return t2.strip("_")

    # Initialize the event accumulator with a size guidance for only the latest entry
    def get_latest_scalars(path: str) -> dict:
        event_acc = EventAccumulator(path, size_guidance={"scalars": 1})
        try:
            event_acc.Reload()
            if event_acc.Tags()["scalars"]:
                return {
                    replace_invalid_chars(tag): event_acc.Scalars(tag)[-1].value
                    for tag in event_acc.Tags()["scalars"]
                    if event_acc.Scalars(tag)
                }
        except (KeyError, OSError, RuntimeError, DirectoryDeletedError):
            return {}

    scalars = get_latest_scalars(directory)
    return scalars or get_latest_scalars(os.path.join(directory, "summaries"))

and, moreover, I added metric argument to TuneConfig so that the best trial would be reported periodically:

# Configure the tuning job
    tuner = tune.Tuner(
        IsaacLabTuneTrainable,
        param_space=cfg,
        tune_config=tune.TuneConfig(
            metric=args.metric,
            mode=args.mode,
            search_alg=repeat_search,
            num_samples=args.num_samples,
            reuse_actors=True,
        ),
        run_config=run_config,
    )

Please let me know if this is ok.

glvov-bdai · 2025-04-24T17:28:55Z

Yep I noticed that looks good!

ozhanozen · 2025-04-24T17:31:03Z

Hi @ozhanozen ; can you please update this PR to main? Alternatively I'm not sure if you enabled contributor editing in the PR otherwise I'd just click the update button I normally see (maybe I need to mark ready for review?)

I have updated it to main. Now it is in the "ready for review" but I couldn't find the option to "Allow edits from maintainers."

ozhanozen · 2025-04-24T17:35:11Z

For 1, I think we could add a max amount of lines that the experiment name / log dir can be extracted from and/or a maximum timeout for extracting the experiment info (can maybe share the same timeout value already in the argparser)

For 2, I think you can throw a specific error, and then catch it in the main tuner, and then rethrow a top level error it if you receive this error a certain number of times. I think I configured so that errors get squashed with Ray (in case, a hyperparameter leads to one of the tasks running out of VRAM for example) so that tuning jobs get continued. But I think if you raise a specific error type in the util file, and then catch and rethrow in the tuner itself, this could work.

Ok, these make sense. I will try them in the next days. Meanwhile, I will keep the pr in draft until it is done.

ozhanozen added 3 commits April 21, 2025 20:12

Fixes ray lazy metric reporting and hanging processes

3cc7d50

Formatting

1b15645

Formatting

bd318c3

ozhanozen added 2 commits April 22, 2025 17:21

Adds missing colon in skrl train script

b13c46d

Leaves ray related comment at train scripts

7e764b9

garylvov reviewed Apr 24, 2025

View reviewed changes

garylvov approved these changes Apr 24, 2025

View reviewed changes

ozhanozen added 5 commits April 24, 2025 18:12

Loads DATA_FREEZE_DURATION_THRESHOLD from arg_parser and improves cor…

fe00c7c

…responding log messages

Replaces invalid chars within tb log keys

3d40235

Adds a check to flag a trial with error if experiment logs cannot be …

a356d5a

…extracted

Merge remote-tracking branch 'upstream/main' into fix/ray

ce7744a

Removes process termination from the case when logs cannot be extracted

893a9b1

glvov-bdai marked this pull request as ready for review April 24, 2025 17:09

glvov-bdai requested review from jsmith-bdai, kellyguo11 and Mayankm96 as code owners April 24, 2025 17:09

ozhanozen marked this pull request as draft April 24, 2025 17:35

Merge remote-tracking branch 'upstream/main' into fix/ray

bad5cda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes ray lazy metric reporting and hanging processes #2346

Fixes ray lazy metric reporting and hanging processes #2346

ozhanozen commented Apr 21, 2025

ozhanozen commented Apr 21, 2025 •

edited

Loading

garylvov commented Apr 21, 2025 •

edited

Loading

ozhanozen commented Apr 22, 2025

garylvov commented Apr 22, 2025 •

edited

Loading

garylvov Apr 24, 2025

ozhanozen Apr 24, 2025

garylvov left a comment •

edited

Loading

garylvov commented Apr 24, 2025 •

edited

Loading

ozhanozen commented Apr 24, 2025

glvov-bdai commented Apr 24, 2025 •

edited

Loading

ozhanozen commented Apr 24, 2025

glvov-bdai commented Apr 24, 2025

ozhanozen commented Apr 24, 2025

ozhanozen commented Apr 24, 2025

Fixes ray lazy metric reporting and hanging processes #2346

Are you sure you want to change the base?

Fixes ray lazy metric reporting and hanging processes #2346

Conversation

ozhanozen commented Apr 21, 2025

Description

Type of change

Checklist

ozhanozen commented Apr 21, 2025 • edited Loading

garylvov commented Apr 21, 2025 • edited Loading

ozhanozen commented Apr 22, 2025

garylvov commented Apr 22, 2025 • edited Loading

garylvov Apr 24, 2025

Choose a reason for hiding this comment

ozhanozen Apr 24, 2025

Choose a reason for hiding this comment

garylvov left a comment • edited Loading

Choose a reason for hiding this comment

garylvov commented Apr 24, 2025 • edited Loading

ozhanozen commented Apr 24, 2025

glvov-bdai commented Apr 24, 2025 • edited Loading

ozhanozen commented Apr 24, 2025

glvov-bdai commented Apr 24, 2025

ozhanozen commented Apr 24, 2025

ozhanozen commented Apr 24, 2025

ozhanozen commented Apr 21, 2025 •

edited

Loading

garylvov commented Apr 21, 2025 •

edited

Loading

garylvov commented Apr 22, 2025 •

edited

Loading

garylvov left a comment •

edited

Loading

garylvov commented Apr 24, 2025 •

edited

Loading

glvov-bdai commented Apr 24, 2025 •

edited

Loading