Skip to content

Broaden exception handling in TraceLens installation#382

Merged
gphuang merged 2 commits intofeat/12-tracelens-integrationfrom
copilot/sub-pr-377-yet-again
Dec 17, 2025
Merged

Broaden exception handling in TraceLens installation#382
gphuang merged 2 commits intofeat/12-tracelens-integrationfrom
copilot/sub-pr-377-yet-again

Conversation

Copy link

Copilot AI commented Dec 17, 2025

The _ensure_tracelens_installed() function only caught subprocess.CalledProcessError, leaving it vulnerable to system-level failures during pip installation.

Changes

  • Added handlers for PermissionError and OSError to catch file system and permission issues
  • Added catch-all Exception handler to prevent installation failures from crashing training
  • Differentiated error messages by exception type for easier debugging
except subprocess.CalledProcessError as e:
    warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
    return False
except (PermissionError, OSError) as e:
    warning_rank_0(f"[TraceLens] Failed to install TraceLens due to system error: {e}")
    return False
except Exception as e:
    warning_rank_0(f"[TraceLens] Failed to install TraceLens due to unexpected error: {e}")
    return False

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>
Copilot AI changed the title [WIP] WIP Address feedback on tracelens integration PR Broaden exception handling in TraceLens installation Dec 17, 2025
Copilot AI requested a review from gphuang December 17, 2025 10:24
@gphuang gphuang marked this pull request as ready for review December 17, 2025 11:17
Copilot AI review requested due to automatic review settings December 17, 2025 11:17
@gphuang gphuang merged commit 34e1e45 into feat/12-tracelens-integration Dec 17, 2025
4 checks passed
@gphuang gphuang deleted the copilot/sub-pr-377-yet-again branch December 17, 2025 11:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances error handling in the _ensure_tracelens_installed() function by adding multiple exception handlers to catch a broader range of installation failures. The changes aim to prevent system-level failures during pip installation from crashing the training process.

  • Added PermissionError and OSError handlers for file system and permission issues
  • Added a catch-all Exception handler for unexpected errors
  • Differentiated error messages by exception type for easier debugging
Comments suppressed due to low confidence (2)

primus/backends/megatron/training/mlflow_artifacts.py:388

  • Variable dfs is not used.
            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

primus/backends/megatron/training/mlflow_artifacts.py:376

  • This assignment to 'dfs' is unnecessary as it is redefined before this value is used.
            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +255 to +260
except (PermissionError, OSError) as e:
warning_rank_0(f"[TraceLens] Failed to install TraceLens due to system error: {e}")
return False
except Exception as e:
warning_rank_0(f"[TraceLens] Failed to install TraceLens due to unexpected error: {e}")
return False
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling order has a logical issue. Since subprocess.CalledProcessError inherits from Exception, and OSError also inherits from Exception, the current order works. However, there's a subtle problem: subprocess.check_call() typically only raises subprocess.CalledProcessError for pip installation failures. File system and permission issues during the subprocess execution are usually wrapped into CalledProcessError. The OSError and PermissionError would only be raised if there are issues with the Python interpreter path itself or the subprocess module invocation, which are extremely rare cases. Consider whether these additional handlers provide value, or if they make the code unnecessarily complex. The catch-all Exception handler at the end provides the safety net for truly unexpected errors.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants