Feat/12 tracelens integration by gphuang · Pull Request #377 · AMD-AGI/Primus

gphuang · 2025-12-17T10:00:15Z

Summary

Integrates TraceLens for automatic analysis of PyTorch profiler traces, with reports uploaded to MLflow.

Add TraceLens as a dependency to Primus (Install TraceLens directly from GitHub). Run trace analysis (generate_perf_report_pytorch.py) for the trace files if profiling is on. See TraceLens Doc

Changes

File	Change
`mlflow_artifacts.py`	TraceLens report generation logic
`global_vars.py`	Extended `upload_mlflow_artifacts()`
`trainer.py`	Call report generation at end of training
`primus_megatron_module.yaml`	New config options

Configuration

mlflow_upload_tracelens_report: true # Enable TraceLens analysis
mlflow_tracelens_ranks: [0, 8] # Ranks to analyze (null = auto)
mlflow_tracelens_max_reports: 2 # Max reports
mlflow_tracelens_output_format: all # all | xlsx | csv## Output

XLSX: Multi-tab workbook with all analyses
CSV: Individual files per analysis type

Testing

Tested on 2-node MI300X cluster (16 GPUs) ✅

- Add mlflow_artifacts.py with TraceLens report generation and upload - Add TraceLens config options to primus_megatron_module.yaml: - mlflow_upload_traces: Upload profiler trace files - mlflow_upload_logs: Upload training log files - mlflow_upload_tracelens_report: Generate and upload TraceLens reports - mlflow_tracelens_ranks: Filter ranks for TraceLens analysis - mlflow_tracelens_max_reports: Limit number of reports generated - Update global_vars.py with upload_mlflow_artifacts function - Integrate artifact upload in trainer.py before MLflow run ends MLflow Artifact Structure: artifacts/ ├── traces/ # PyTorch profiler trace files ├── logs/ # Training log files └── trace_analysis/ # TraceLens analysis reports

- Default output format changed from HTML to CSV - CSV is viewable in more tools and lighter weight - Added fallback CSV generation when TraceLens CLI unavailable: - Parses PyTorch profiler JSON directly - Extracts kernel/operation statistics - Outputs: Operation, Count, Total/Avg/Min/Max Time (ms), % of Total - HTML files require download to view in MLflow; CSV is more practical

- Fix TraceLens installation: use GitHub repo instead of PyPI - Use TraceLens Python API directly instead of CLI - Add 'all' format option to generate both XLSX and CSV (new default) - XLSX: Single multi-tab Excel file with detailed analysis - CSV: Multiple CSV files per rank (kernels, memory, communication, etc.) - Add mlflow_tracelens_output_format config option - Fallback to simple CSV summary when TraceLens not available

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

This PR integrates TraceLens for automatic analysis of PyTorch profiler traces, enabling detailed performance reports to be generated and uploaded to MLflow alongside training artifacts.

Key Changes:

Added TraceLens integration with automatic installation and report generation capabilities
Extended MLflow artifact upload to include profiler traces, training logs, and TraceLens analysis reports
Added configuration options to control TraceLens analysis (ranks, max reports, output formats)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File	Description
`primus/backends/megatron/training/mlflow_artifacts.py`	New 725-line module implementing trace/log file discovery, TraceLens report generation (with fallback CSV summary), and MLflow artifact upload orchestration
`primus/backends/megatron/training/global_vars.py`	Added `upload_mlflow_artifacts()` wrapper function to coordinate artifact uploads at training completion
`primus/modules/trainer/megatron/trainer.py`	Integrated artifact upload call before MLflow run termination, passing through all TraceLens configuration parameters
`primus/configs/modules/megatron/primus_megatron_module.yaml`	Added 6 new configuration options for controlling trace/log uploads and TraceLens report generation

Comments suppressed due to low confidence (2)

primus/backends/megatron/training/mlflow_artifacts.py:382

Variable dfs is not used.

            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

primus/backends/megatron/training/mlflow_artifacts.py:370

This assignment to 'dfs' is unnecessary as it is redefined before this value is used.

            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/configs/modules/megatron/primus_megatron_module.yaml

primus/modules/trainer/megatron/trainer.py

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

Copilot · 2025-12-17T10:16:35Z

@gphuang I've opened a new pull request, #379, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:17:14Z

@gphuang I've opened a new pull request, #380, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:18:22Z

@gphuang I've opened a new pull request, #381, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:18:54Z

@gphuang I've opened a new pull request, #382, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:19:19Z

@gphuang I've opened a new pull request, #383, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:19:46Z

@gphuang I've opened a new pull request, #384, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:20:16Z

@gphuang I've opened a new pull request, #385, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:20:41Z

@gphuang I've opened a new pull request, #386, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:21:15Z

@gphuang I've opened a new pull request, #387, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-17T10:21:36Z

@gphuang I've opened a new pull request, #388, to work on those changes. Once the pull request is ready, I'll request review from you.

…iew (#388) Addresses feedback from PR #377 review comment regarding TraceLens report generation potentially blocking training completion before MLflow run closure. ## Analysis Reviewed the TraceLens integration implementation in commits cb399c0, 48522c2, and 86685b2. The concern about blocking behavior has already been mitigated through: - **Opt-in design**: `mlflow_upload_tracelens_report` defaults to `False`, preventing unexpected blocking - **Configurable scope**: `mlflow_tracelens_ranks` and `mlflow_tracelens_max_reports` parameters limit report generation to specific ranks or count - **Format options**: `mlflow_tracelens_output_format` allows choosing lighter CSV format over XLSX ## Current behavior ```python upload_mlflow_artifacts( tensorboard_dir=args.tensorboard_dir, exp_root_path=self.exp_root_path, upload_traces=getattr(args, "mlflow_upload_traces", True), upload_logs=getattr(args, "mlflow_upload_logs", True), upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False), # Opt-in tracelens_ranks=getattr(args, "mlflow_tracelens_ranks", None), # Control scope tracelens_max_reports=getattr(args, "mlflow_tracelens_max_reports", None), # Limit count tracelens_output_format=getattr(args, "mlflow_tracelens_output_format", "all"), ) mlflow_writer.end_run() # Guaranteed to execute ``` The synchronous design is appropriate for this use case: - Report generation is explicitly requested by users who need the analysis - Users control the scope and cost via configuration parameters - Failure to generate reports is logged but doesn't prevent MLflow run closure (handled in try/except blocks) No changes required - comment thread marked as resolved.  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…387) The YAML config documentation for `mlflow_tracelens_output_format` was incomplete - it listed "all", "xlsx", and "csv" as valid options, but the implementation in `mlflow_artifacts.py` also accepts "html" (with fallback to xlsx+csv since HTML generation is not yet implemented). ## Changes - Updated `primus_megatron_module.yaml` line 13 to document "html" as a valid format option with note that it falls back to xlsx+csv ```yaml mlflow_tracelens_output_format: all # TraceLens report format: all (xlsx+csv), xlsx, csv, or html (html not yet supported, falls back to xlsx+csv) ``` This aligns the configuration documentation with the actual implementation behavior in `generate_tracelens_report()`.  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>

gphuang · 2025-12-17T14:55:32Z

@copilot open a new pull request to apply changes based on the comments in this thread

Copilot · 2025-12-17T14:55:48Z

@gphuang I've opened a new pull request, #432, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

Thanks for the feedback on #377. I've created this new PR, which merges into #377, to address your comment. I will work on the changes and keep this PR's description up to date as I make progress. Original PR: #377 Triggering review: #377 (comment) > @copilot open a new pull request to apply changes based on [this feedback](#377 (comment))  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…ion (#427) - [x] Analyze the mlflow_artifacts.py module to understand functionality - [ ] Create test_mlflow_artifacts.py file with comprehensive test coverage - [ ] Add tests for _get_all_trace_files function - [ ] Add tests for _get_all_log_files function - [ ] Add tests for _extract_rank_from_filename function - [ ] Add tests for _filter_traces_by_rank function - [ ] Add tests for generate_tracelens_report function - [ ] Add tests for upload_trace_files_to_mlflow function - [ ] Add tests for upload_tracelens_reports_to_mlflow function - [ ] Add tests for upload_artifacts_to_mlflow main function - [ ] Run tests to ensure they all pass  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>

- [ ] Remove HTML references from module docstring in mlflow_artifacts.py (line 33) - [ ] Remove HTML from generate_tracelens_report function docstring (line 474) - [ ] Remove HTML from valid_formats set (line 483) - [ ] Remove HTML fallback logic and warning (lines 576-579) - [ ] Remove HTML from generate_tracelens_reports function docstring (line 730) - [ ] Remove HTML from upload_tracelens_reports_to_mlflow function docstring (line 791) - [ ] Remove HTML from upload_artifacts_to_mlflow function docstring (line 900) - [ ] Remove HTML from upload_mlflow_artifacts function docstring in global_vars.py (lines 125-126) - [ ] Remove HTML from configuration file primus_megatron_module.yaml (line 13) - [ ] Test the changes to ensure no breakage - [ ] Run code review - [ ] Run security check  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>

User requested opening a new pull request to address review comments, but this environment cannot create new PRs - only commit to the current PR. ## Changes Made - Replied to comment requesting clarification on whether to: - Address unresolved review comments in current PR #377, or - Have someone else create a separate PR for those changes ## Unresolved Review Items Available to Address If addressing in current PR, can fix: - Missing input validation for `output_format` parameter - Ambiguous empty ranks list behavior (`[]` vs `None`) - Missing imports (`subprocess`, `sys`) - Code duplication in HTML fallback case - Uninitialized `generated_files` variable - Documentation inconsistencies - Unused variable `dfs` No code changes made pending user clarification on approach.  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

The TraceLens GitHub URL was hardcoded in multiple locations within `mlflow_artifacts.py`, making updates error-prone and inconsistent. ## Changes - Added `TRACELENS_GITHUB_URL` module-level constant - Updated `_ensure_tracelens_installed()` to reference the constant in subprocess call - Docstring retains literal URL for user-facing documentation ```python # Module level TRACELENS_GITHUB_URL = "git+https://github.com/AMD-AGI/TraceLens.git" # Usage in code subprocess.check_call([sys.executable, "-m", "pip", "install", TRACELENS_GITHUB_URL, "-q"]) ``` Provides single source of truth for the URL, simplifying future repository changes.  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>

The request to address code duplication in TraceLens report generation has already been resolved in commit 3516b65. ## Current State The `generate_tracelens_report()` function already uses a helper function `_generate_xlsx_and_csv_reports()` that eliminates duplication: ```python def _generate_xlsx_and_csv_reports(generate_xlsx: bool, generate_csv: bool) -> list: """Helper to generate XLSX and/or CSV TraceLens reports.""" local_generated_files = [] if generate_xlsx: xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx") dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) # ... collection logic if generate_csv: csv_subdir = os.path.join(output_dir, report_name) os.makedirs(csv_subdir, exist_ok=True) generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) # ... collection logic return local_generated_files # All format cases reuse the helper if output_format in ("all", "xlsx"): generated_files.extend(_generate_xlsx_and_csv_reports(generate_xlsx=True, generate_csv=False)) if output_format in ("all", "csv"): generated_files.extend(_generate_xlsx_and_csv_reports(generate_xlsx=False, generate_csv=True)) if output_format == "html": generated_files.extend(_generate_xlsx_and_csv_reports(generate_xlsx=True, generate_csv=True)) ``` No further changes required.  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

gphuang · 2025-12-17T15:35:17Z

@copilot open a new pull request to apply changes based on the comments in this thread

Copilot · 2025-12-17T15:35:37Z

@gphuang I've opened a new pull request, #434, to work on those changes. Once the pull request is ready, I'll request review from you.

gphuang · 2025-12-17T15:36:39Z

@copilot open a new pull request to apply changes based on the comments in this thread

Copilot · 2025-12-17T15:36:57Z

@gphuang I've opened a new pull request, #435, to work on those changes. Once the pull request is ready, I'll request review from you.

) Thanks for the feedback on #377. I've created this new PR, which merges into #377, to address your comment. I will work on the changes and keep this PR's description up to date as I make progress. Original PR: #377 Triggering comment (#377 (comment)): > @copilot open a new pull request to apply changes based on the comments in [this thread](#377 (review))  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

Thanks for the feedback on #377. I've created this new PR, which merges into #377, to address your comment. I will work on the changes and keep this PR's description up to date as I make progress. Original PR: #377 Triggering comment (#377 (comment)): > @copilot open a new pull request to apply changes based on the comments in [this thread](#377 (review))  --- ✨ Let Copilot coding agent [set things up for you](https://github.com/AMD-AGI/Primus/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T16:04:57Z

primus/backends/megatron/training/mlflow_artifacts.py

+def _get_all_trace_files(tensorboard_dir: str) -> list:
+    """
+    Find all profiler trace files in the tensorboard directory.
+
+    Trace files are typically named like (note: square brackets are literal characters):
+    - primus-megatron-exp[my-exp-name]-rank[0].1234567890.json
+    - primus-megatron-exp[my-exp-name]-rank[0].1234567890.json.gz
+
+    Args:
+        tensorboard_dir: Path to the tensorboard directory containing trace files
+
+    Returns:
+        List of paths to trace files
+    """
+    if not tensorboard_dir or not os.path.exists(tensorboard_dir):
+        return []
+
+    trace_files = []
+    # Look for JSON trace files (both compressed and uncompressed)
+    patterns = ["*.json", "*.json.gz", "*.pt.trace.json", "*.pt.trace.json.gz"]
+    for pattern in patterns:
+        trace_files.extend(glob.glob(os.path.join(tensorboard_dir, pattern)))
+        trace_files.extend(glob.glob(os.path.join(tensorboard_dir, "**", pattern), recursive=True))
+
+    # Remove duplicates while preserving order
+    seen = set()
+    unique_files = []
+    for f in trace_files:
+        if f not in seen:
+            seen.add(f)
+            unique_files.append(f)
+
+    return unique_files
+
+
+def _get_all_log_files(exp_root_path: str) -> list:


Inconsistent type annotation style. The function uses lowercase list as return type (lines 151, 186, 528), while other functions in the module use List from the typing module (e.g., line 464 uses List[str]). For consistency and better type checking in older Python versions, consider using List[str] from typing throughout the module.

Copilot · 2025-12-17T16:04:58Z

primus/backends/megatron/training/mlflow_artifacts.py

+    Returns:
+        Rank number or None if not found
+    """
+    import re


The import re statement is placed inside the function body. While this works, it's generally better practice to place imports at the module level for better readability and to avoid repeated import overhead if this function is called multiple times.

Copilot · 2025-12-17T16:04:58Z

primus/backends/megatron/training/mlflow_artifacts.py

+    except ImportError:
+        log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...")
+        try:
+            # TraceLens is on GitHub, not PyPI
+            subprocess.check_call(
+                [
+                    sys.executable,
+                    "-m",
+                    "pip",
+                    "install",
+                    TRACELENS_GITHUB_URL,
+                    "-q",
+                ]
+            )
+            log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub")
+            return True
+        except subprocess.CalledProcessError as e:
+            warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
+            return False
+        except (PermissionError, OSError) as e:
+            warning_rank_0(f"[TraceLens] Failed to install TraceLens due to system error: {e}")
+            return False
+        except Exception as e:
+            warning_rank_0(f"[TraceLens] Failed to install TraceLens due to unexpected error: {e}")
+            return False


The function automatically attempts to install TraceLens using subprocess.check_call when it's not found. This behavior could be problematic in production environments or containerized deployments where:

Automatic package installation may not be desired or allowed

Network access might be restricted

The user might want to control when and how packages are installed

Consider making the auto-installation opt-in via a configuration parameter or environment variable, rather than attempting it automatically on ImportError.

Copilot · 2025-12-17T16:04:58Z

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

+    def test_html_format_fallback(self, mock_log, mock_warning):
+        """Test that HTML format falls back to xlsx+csv."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            trace_file = os.path.join(tmpdir, "trace.json")
+            open(trace_file, "w").close()
+
+            mock_generate = Mock(return_value={"kernels": Mock()})
+
+            # Create files that would be generated
+            xlsx_path = os.path.join(tmpdir, "trace_analysis.xlsx")
+            open(xlsx_path, "w").close()
+
+            csv_dir = os.path.join(tmpdir, "trace")
+            os.makedirs(csv_dir)
+            csv_file = os.path.join(csv_dir, "kernels.csv")
+            open(csv_file, "w").close()
+
+            with patch.dict("sys.modules", {"TraceLens": Mock(), "TraceLens.Reporting": Mock()}):
+                with patch("primus.backends.megatron.training.mlflow_artifacts.generate_perf_report_pytorch", mock_generate):
+                    with patch("primus.backends.megatron.training.mlflow_artifacts.glob.glob", return_value=[csv_file]):
+                        result = generate_tracelens_report(
+                            trace_file, tmpdir, "trace", output_format="html"
+                        )
+
+            # Should fall back to both xlsx and csv
+            assert len(result) >= 1
+            mock_warning.assert_called()  # Should warn about HTML not supported


The test test_html_format_fallback expects HTML format to be handled gracefully with a warning, but the actual implementation raises a ValueError for invalid formats (line 486-489). The test will fail because it passes "html" which is not in the valid_formats set, causing a ValueError to be raised before any warning can be logged. Either the test should expect a ValueError to be raised, or the implementation should handle unsupported formats more gracefully by issuing a warning and falling back to a default format.

Copilot · 2025-12-17T16:04:58Z

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

+            open(trace1, "w").close()
+            open(trace2, "w").close()


File handles are not properly closed. Using open(file, "w").close() creates a file handle but doesn't guarantee it's closed if an exception occurs between open and close. Consider using context managers with with open(file, "w"): or Path(file).touch() instead.

Copilot · 2025-12-17T16:04:59Z

primus/backends/megatron/training/mlflow_artifacts.py

+                dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
+                if os.path.exists(xlsx_path):
+                    log_rank_0(
+                        f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"


The function attempts to access len(dfs) on line 547 assuming generate_perf_report_pytorch returns a dict-like object with dataframes. However, there's no error handling if this function returns None, an unexpected type, or raises an exception. If generate_perf_report_pytorch returns None or a type that doesn't support len(), this will cause a TypeError. Consider adding a check for the return value type or wrapping this in additional error handling.

Suggested change

dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

if os.path.exists(xlsx_path):

log_rank_0(

f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"

dfs = None

try:

dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

except Exception as exc:

warning_rank_0(

f"[TraceLens] Failed to generate XLSX report for {trace_file}: {exc}"

)

if os.path.exists(xlsx_path):

tab_info = ""

if dfs is not None:

try:

tab_info = f" with {len(dfs)} tabs"

except TypeError:

tab_info = ""

log_rank_0(

f"[TraceLens] Generated XLSX report{tab_info}: {os.path.basename(xlsx_path)}"

Copilot · 2025-12-17T16:04:59Z

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

+        with tempfile.TemporaryDirectory() as tmpdir:
+            result = generate_tracelens_reports(tmpdir, tmpdir, ranks=[0, 2])
+
+        # Should only generate reports for rank 0 and 2


Variable result is not used.

Suggested change

# Should only generate reports for rank 0 and 2

# Should only generate reports for rank 0 and 2

assert len(result) == 2

Copilot · 2025-12-17T16:05:00Z

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

+        mock_gen_report.return_value = ["/output/report.xlsx"]
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            result = generate_tracelens_reports(tmpdir, tmpdir, max_reports=3)


Variable result is not used.

gphuang added 3 commits December 16, 2025 21:05

Copilot AI review requested due to automatic review settings December 17, 2025 10:00

Copilot started reviewing on behalf of gphuang December 17, 2025 10:01 View session

github-code-quality bot found potential problems Dec 17, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI reviewed Dec 17, 2025

View reviewed changes

Copilot AI mentioned this pull request Dec 17, 2025

Fix CSV format documentation in TraceLens integration #379

Merged

Copilot AI mentioned this pull request Dec 17, 2025

docs: Clarify literal square brackets in trace filename examples #380

Merged

Copilot AI mentioned this pull request Dec 17, 2025

Add comprehensive test coverage for mlflow_artifacts module #381

Merged

Copilot AI mentioned this pull request Dec 17, 2025

Broaden exception handling in TraceLens installation #382

Merged

This was referenced Dec 17, 2025

Add test coverage for upload_mlflow_artifacts wrapper function #383

Merged

Remove runtime TraceLens installation to address security concerns #384

Merged

Copilot AI mentioned this pull request Dec 17, 2025

Remove output suppression from TraceLens installation to preserve error messages #385

Merged

Copilot AI mentioned this pull request Dec 17, 2025

Fix HTML format fallback to generate both XLSX and CSV as documented #386

Merged

Copilot AI mentioned this pull request Dec 17, 2025

docs: Document html format option for mlflow_tracelens_output_format #387

Merged

Copilot AI mentioned this pull request Dec 17, 2025

Address TraceLens report generation blocking concern from PR #377 review #388

Merged

Copilot AI and others added 2 commits December 17, 2025 13:12

Copilot AI review requested due to automatic review settings December 17, 2025 11:13

gphuang requested a review from Copilot December 17, 2025 14:54

Copilot started reviewing on behalf of gphuang December 17, 2025 14:55 View session

Copilot AI mentioned this pull request Dec 17, 2025

Clarify that code duplication issue was already resolved #432

Merged

Copilot AI reviewed Dec 17, 2025

View reviewed changes

Copilot AI and others added 2 commits December 17, 2025 17:30

Copilot AI review requested due to automatic review settings December 17, 2025 15:31

Copilot AI and others added 3 commits December 17, 2025 17:31

Copilot started reviewing on behalf of gphuang December 17, 2025 15:32 View session

github-code-quality bot found potential problems Dec 17, 2025

View reviewed changes

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py Dismissed Show dismissed Hide dismissed

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py Dismissed Show dismissed Hide dismissed

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI mentioned this pull request Dec 17, 2025

[WIP] Integrate TraceLens for automatic analysis of PyTorch traces #434

Merged

5 tasks

Copilot AI mentioned this pull request Dec 17, 2025

[WIP] Integrate TraceLens for PyTorch profiler trace analysis #435

Merged

7 tasks

Copilot AI added 2 commits December 17, 2025 17:38

Copilot AI reviewed Dec 17, 2025

View reviewed changes

Potential fix for pull request finding 'Unused import'

119709c

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Copilot AI review requested due to automatic review settings December 17, 2025 15:52

Copilot started reviewing on behalf of gphuang December 17, 2025 15:54 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

gphuang closed this Dec 18, 2025

gphuang deleted the feat/12-tracelens-integration branch December 18, 2025 11:26

-                dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
-                if os.path.exists(xlsx_path):
-                    log_rank_0(
-                        f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
+                dfs = None
+                try:
+                    dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
+                except Exception as exc:
+                    warning_rank_0(
+                        f"[TraceLens] Failed to generate XLSX report for {trace_file}: {exc}"
+                    )
+                if os.path.exists(xlsx_path):
+                    tab_info = ""
+                    if dfs is not None:
+                        try:
+                            tab_info = f" with {len(dfs)} tabs"
+                        except TypeError:
+                            tab_info = ""
+                    log_rank_0(
+                        f"[TraceLens] Generated XLSX report{tab_info}: {os.path.basename(xlsx_path)}"

	# Should only generate reports for rank 0 and 2
	# Should only generate reports for rank 0 and 2
	assert len(result) == 2

Conversation

gphuang commented Dec 17, 2025

Summary

Changes

Configuration

Testing

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

gphuang commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gphuang commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

gphuang commented Dec 17, 2025

Uh oh!

Copilot AI commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview