Broaden exception handling in TraceLens installation#382
Broaden exception handling in TraceLens installation#382gphuang merged 2 commits intofeat/12-tracelens-integrationfrom
Conversation
Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR enhances error handling in the _ensure_tracelens_installed() function by adding multiple exception handlers to catch a broader range of installation failures. The changes aim to prevent system-level failures during pip installation from crashing the training process.
- Added
PermissionErrorandOSErrorhandlers for file system and permission issues - Added a catch-all
Exceptionhandler for unexpected errors - Differentiated error messages by exception type for easier debugging
Comments suppressed due to low confidence (2)
primus/backends/megatron/training/mlflow_artifacts.py:388
- Variable dfs is not used.
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
primus/backends/megatron/training/mlflow_artifacts.py:376
- This assignment to 'dfs' is unnecessary as it is redefined before this value is used.
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except (PermissionError, OSError) as e: | ||
| warning_rank_0(f"[TraceLens] Failed to install TraceLens due to system error: {e}") | ||
| return False | ||
| except Exception as e: | ||
| warning_rank_0(f"[TraceLens] Failed to install TraceLens due to unexpected error: {e}") | ||
| return False |
There was a problem hiding this comment.
The exception handling order has a logical issue. Since subprocess.CalledProcessError inherits from Exception, and OSError also inherits from Exception, the current order works. However, there's a subtle problem: subprocess.check_call() typically only raises subprocess.CalledProcessError for pip installation failures. File system and permission issues during the subprocess execution are usually wrapped into CalledProcessError. The OSError and PermissionError would only be raised if there are issues with the Python interpreter path itself or the subprocess module invocation, which are extremely rare cases. Consider whether these additional handlers provide value, or if they make the code unnecessarily complex. The catch-all Exception handler at the end provides the safety net for truly unexpected errors.
The
_ensure_tracelens_installed()function only caughtsubprocess.CalledProcessError, leaving it vulnerable to system-level failures during pip installation.Changes
PermissionErrorandOSErrorto catch file system and permission issuesExceptionhandler to prevent installation failures from crashing training💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.