Skip to content

Feat/12 tracelens integration#377

Closed
gphuang wants to merge 38 commits intomainfrom
feat/12-tracelens-integration
Closed

Feat/12 tracelens integration#377
gphuang wants to merge 38 commits intomainfrom
feat/12-tracelens-integration

Conversation

@gphuang
Copy link

@gphuang gphuang commented Dec 17, 2025

Summary

Integrates TraceLens for automatic analysis of PyTorch profiler traces, with reports uploaded to MLflow.

Add TraceLens as a dependency to Primus (Install TraceLens directly from GitHub). Run trace analysis (generate_perf_report_pytorch.py) for the trace files if profiling is on. See TraceLens Doc

Changes

File Change
mlflow_artifacts.py TraceLens report generation logic
global_vars.py Extended upload_mlflow_artifacts()
trainer.py Call report generation at end of training
primus_megatron_module.yaml New config options

Configuration

mlflow_upload_tracelens_report: true # Enable TraceLens analysis
mlflow_tracelens_ranks: [0, 8] # Ranks to analyze (null = auto)
mlflow_tracelens_max_reports: 2 # Max reports
mlflow_tracelens_output_format: all # all | xlsx | csv## Output

  • XLSX: Multi-tab workbook with all analyses
  • CSV: Individual files per analysis type

Testing

Tested on 2-node MI300X cluster (16 GPUs) ✅

- Add mlflow_artifacts.py with TraceLens report generation and upload
- Add TraceLens config options to primus_megatron_module.yaml:
  - mlflow_upload_traces: Upload profiler trace files
  - mlflow_upload_logs: Upload training log files
  - mlflow_upload_tracelens_report: Generate and upload TraceLens reports
  - mlflow_tracelens_ranks: Filter ranks for TraceLens analysis
  - mlflow_tracelens_max_reports: Limit number of reports generated
- Update global_vars.py with upload_mlflow_artifacts function
- Integrate artifact upload in trainer.py before MLflow run ends

MLflow Artifact Structure:
    artifacts/
    ├── traces/              # PyTorch profiler trace files
    ├── logs/                # Training log files
    └── trace_analysis/      # TraceLens analysis reports
- Default output format changed from HTML to CSV
- CSV is viewable in more tools and lighter weight
- Added fallback CSV generation when TraceLens CLI unavailable:
  - Parses PyTorch profiler JSON directly
  - Extracts kernel/operation statistics
  - Outputs: Operation, Count, Total/Avg/Min/Max Time (ms), % of Total
- HTML files require download to view in MLflow; CSV is more practical
- Fix TraceLens installation: use GitHub repo instead of PyPI
- Use TraceLens Python API directly instead of CLI
- Add 'all' format option to generate both XLSX and CSV (new default)
- XLSX: Single multi-tab Excel file with detailed analysis
- CSV: Multiple CSV files per rank (kernels, memory, communication, etc.)
- Add mlflow_tracelens_output_format config option
- Fallback to simple CSV summary when TraceLens not available
Copilot AI review requested due to automatic review settings December 17, 2025 10:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates TraceLens for automatic analysis of PyTorch profiler traces, enabling detailed performance reports to be generated and uploaded to MLflow alongside training artifacts.

Key Changes:

  • Added TraceLens integration with automatic installation and report generation capabilities
  • Extended MLflow artifact upload to include profiler traces, training logs, and TraceLens analysis reports
  • Added configuration options to control TraceLens analysis (ranks, max reports, output formats)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
primus/backends/megatron/training/mlflow_artifacts.py New 725-line module implementing trace/log file discovery, TraceLens report generation (with fallback CSV summary), and MLflow artifact upload orchestration
primus/backends/megatron/training/global_vars.py Added upload_mlflow_artifacts() wrapper function to coordinate artifact uploads at training completion
primus/modules/trainer/megatron/trainer.py Integrated artifact upload call before MLflow run termination, passing through all TraceLens configuration parameters
primus/configs/modules/megatron/primus_megatron_module.yaml Added 6 new configuration options for controlling trace/log uploads and TraceLens report generation
Comments suppressed due to low confidence (2)

primus/backends/megatron/training/mlflow_artifacts.py:382

  • Variable dfs is not used.
            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

primus/backends/megatron/training/mlflow_artifacts.py:370

  • This assignment to 'dfs' is unnecessary as it is redefined before this value is used.
            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #379, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #380, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #381, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #382, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #383, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #384, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #385, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #386, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #387, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #388, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits December 17, 2025 13:12
…iew (#388)

Addresses feedback from PR #377 review comment regarding TraceLens
report generation potentially blocking training completion before MLflow
run closure.

## Analysis

Reviewed the TraceLens integration implementation in commits cb399c0,
48522c2, and 86685b2. The concern about blocking behavior has already
been mitigated through:

- **Opt-in design**: `mlflow_upload_tracelens_report` defaults to
`False`, preventing unexpected blocking
- **Configurable scope**: `mlflow_tracelens_ranks` and
`mlflow_tracelens_max_reports` parameters limit report generation to
specific ranks or count
- **Format options**: `mlflow_tracelens_output_format` allows choosing
lighter CSV format over XLSX

## Current behavior

```python
upload_mlflow_artifacts(
    tensorboard_dir=args.tensorboard_dir,
    exp_root_path=self.exp_root_path,
    upload_traces=getattr(args, "mlflow_upload_traces", True),
    upload_logs=getattr(args, "mlflow_upload_logs", True),
    upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False),  # Opt-in
    tracelens_ranks=getattr(args, "mlflow_tracelens_ranks", None),  # Control scope
    tracelens_max_reports=getattr(args, "mlflow_tracelens_max_reports", None),  # Limit count
    tracelens_output_format=getattr(args, "mlflow_tracelens_output_format", "all"),
)
mlflow_writer.end_run()  # Guaranteed to execute
```

The synchronous design is appropriate for this use case:
- Report generation is explicitly requested by users who need the
analysis
- Users control the scope and cost via configuration parameters
- Failure to generate reports is logged but doesn't prevent MLflow run
closure (handled in try/except blocks)

No changes required - comment thread marked as resolved.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…387)

The YAML config documentation for `mlflow_tracelens_output_format` was
incomplete - it listed "all", "xlsx", and "csv" as valid options, but
the implementation in `mlflow_artifacts.py` also accepts "html" (with
fallback to xlsx+csv since HTML generation is not yet implemented).

## Changes
- Updated `primus_megatron_module.yaml` line 13 to document "html" as a
valid format option with note that it falls back to xlsx+csv

```yaml
mlflow_tracelens_output_format: all  # TraceLens report format: all (xlsx+csv), xlsx, csv, or html (html not yet supported, falls back to xlsx+csv)
```

This aligns the configuration documentation with the actual
implementation behavior in `generate_tracelens_report()`.

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 17, 2025 11:13
@gphuang gphuang requested a review from Copilot December 17, 2025 14:54
@gphuang
Copy link
Author

gphuang commented Dec 17, 2025

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #432, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI and others added 2 commits December 17, 2025 17:30
Thanks for the feedback on #377. I've created this new PR, which merges
into #377, to address your comment. I will work on the changes and keep
this PR's description up to date as I make progress.

Original PR: #377
Triggering review:
#377 (comment)
> @copilot open a new pull request to apply changes based on [this
feedback](#377 (comment))

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…ion (#427)

- [x] Analyze the mlflow_artifacts.py module to understand functionality
- [ ] Create test_mlflow_artifacts.py file with comprehensive test
coverage
- [ ] Add tests for _get_all_trace_files function
- [ ] Add tests for _get_all_log_files function  
- [ ] Add tests for _extract_rank_from_filename function
- [ ] Add tests for _filter_traces_by_rank function
- [ ] Add tests for generate_tracelens_report function
- [ ] Add tests for upload_trace_files_to_mlflow function
- [ ] Add tests for upload_tracelens_reports_to_mlflow function
- [ ] Add tests for upload_artifacts_to_mlflow main function
- [ ] Run tests to ensure they all pass

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 17, 2025 15:31
Copilot AI and others added 3 commits December 17, 2025 17:31
- [ ] Remove HTML references from module docstring in
mlflow_artifacts.py (line 33)
- [ ] Remove HTML from generate_tracelens_report function docstring
(line 474)
- [ ] Remove HTML from valid_formats set (line 483)
- [ ] Remove HTML fallback logic and warning (lines 576-579)
- [ ] Remove HTML from generate_tracelens_reports function docstring
(line 730)
- [ ] Remove HTML from upload_tracelens_reports_to_mlflow function
docstring (line 791)
- [ ] Remove HTML from upload_artifacts_to_mlflow function docstring
(line 900)
- [ ] Remove HTML from upload_mlflow_artifacts function docstring in
global_vars.py (lines 125-126)
- [ ] Remove HTML from configuration file primus_megatron_module.yaml
(line 13)
- [ ] Test the changes to ensure no breakage
- [ ] Run code review
- [ ] Run security check

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>
User requested opening a new pull request to address review comments,
but this environment cannot create new PRs - only commit to the current
PR.

## Changes Made

- Replied to comment requesting clarification on whether to:
  - Address unresolved review comments in current PR #377, or
  - Have someone else create a separate PR for those changes

## Unresolved Review Items Available to Address

If addressing in current PR, can fix:
- Missing input validation for `output_format` parameter
- Ambiguous empty ranks list behavior (`[]` vs `None`)
- Missing imports (`subprocess`, `sys`)
- Code duplication in HTML fallback case
- Uninitialized `generated_files` variable
- Documentation inconsistencies
- Unused variable `dfs`

No code changes made pending user clarification on approach.

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
The TraceLens GitHub URL was hardcoded in multiple locations within
`mlflow_artifacts.py`, making updates error-prone and inconsistent.

## Changes

- Added `TRACELENS_GITHUB_URL` module-level constant
- Updated `_ensure_tracelens_installed()` to reference the constant in
subprocess call
- Docstring retains literal URL for user-facing documentation

```python
# Module level
TRACELENS_GITHUB_URL = "git+https://github.com/AMD-AGI/TraceLens.git"

# Usage in code
subprocess.check_call([sys.executable, "-m", "pip", "install", TRACELENS_GITHUB_URL, "-q"])
```

Provides single source of truth for the URL, simplifying future
repository changes.

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gphuang <13152353+gphuang@users.noreply.github.com>
The request to address code duplication in TraceLens report generation
has already been resolved in commit 3516b65.

## Current State

The `generate_tracelens_report()` function already uses a helper
function `_generate_xlsx_and_csv_reports()` that eliminates duplication:

```python
def _generate_xlsx_and_csv_reports(generate_xlsx: bool, generate_csv: bool) -> list:
    """Helper to generate XLSX and/or CSV TraceLens reports."""
    local_generated_files = []
    
    if generate_xlsx:
        xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
        dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
        # ... collection logic
    
    if generate_csv:
        csv_subdir = os.path.join(output_dir, report_name)
        os.makedirs(csv_subdir, exist_ok=True)
        generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
        # ... collection logic
    
    return local_generated_files

# All format cases reuse the helper
if output_format in ("all", "xlsx"):
    generated_files.extend(_generate_xlsx_and_csv_reports(generate_xlsx=True, generate_csv=False))

if output_format in ("all", "csv"):
    generated_files.extend(_generate_xlsx_and_csv_reports(generate_xlsx=False, generate_csv=True))

if output_format == "html":
    generated_files.extend(_generate_xlsx_and_csv_reports(generate_xlsx=True, generate_csv=True))
```

No further changes required.

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
@gphuang
Copy link
Author

gphuang commented Dec 17, 2025

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #434, to work on those changes. Once the pull request is ready, I'll request review from you.

@gphuang
Copy link
Author

gphuang commented Dec 17, 2025

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link

Copilot AI commented Dec 17, 2025

@gphuang I've opened a new pull request, #435, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI added 2 commits December 17, 2025 17:38
)

Thanks for the feedback on #377. I've created this new PR, which merges
into #377, to address your comment. I will work on the changes and keep
this PR's description up to date as I make progress.

Original PR: #377
Triggering comment
(#377 (comment)):
> @copilot open a new pull request to apply changes based on the
comments in [this
thread](#377 (review))


<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Thanks for the feedback on #377. I've created this new PR, which merges
into #377, to address your comment. I will work on the changes and keep
this PR's description up to date as I make progress.

Original PR: #377
Triggering comment
(#377 (comment)):
> @copilot open a new pull request to apply changes based on the
comments in [this
thread](#377 (review))


<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for
you](https://github.com/AMD-AGI/Primus/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 17, 2025 15:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +151 to +186
def _get_all_trace_files(tensorboard_dir: str) -> list:
"""
Find all profiler trace files in the tensorboard directory.

Trace files are typically named like (note: square brackets are literal characters):
- primus-megatron-exp[my-exp-name]-rank[0].1234567890.json
- primus-megatron-exp[my-exp-name]-rank[0].1234567890.json.gz

Args:
tensorboard_dir: Path to the tensorboard directory containing trace files

Returns:
List of paths to trace files
"""
if not tensorboard_dir or not os.path.exists(tensorboard_dir):
return []

trace_files = []
# Look for JSON trace files (both compressed and uncompressed)
patterns = ["*.json", "*.json.gz", "*.pt.trace.json", "*.pt.trace.json.gz"]
for pattern in patterns:
trace_files.extend(glob.glob(os.path.join(tensorboard_dir, pattern)))
trace_files.extend(glob.glob(os.path.join(tensorboard_dir, "**", pattern), recursive=True))

# Remove duplicates while preserving order
seen = set()
unique_files = []
for f in trace_files:
if f not in seen:
seen.add(f)
unique_files.append(f)

return unique_files


def _get_all_log_files(exp_root_path: str) -> list:
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent type annotation style. The function uses lowercase list as return type (lines 151, 186, 528), while other functions in the module use List from the typing module (e.g., line 464 uses List[str]). For consistency and better type checking in older Python versions, consider using List[str] from typing throughout the module.

Copilot uses AI. Check for mistakes.
Returns:
Rank number or None if not found
"""
import re
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import re statement is placed inside the function body. While this works, it's generally better practice to place imports at the module level for better readability and to avoid repeated import overhead if this function is called multiple times.

Copilot uses AI. Check for mistakes.
Comment on lines +377 to +401
except ImportError:
log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...")
try:
# TraceLens is on GitHub, not PyPI
subprocess.check_call(
[
sys.executable,
"-m",
"pip",
"install",
TRACELENS_GITHUB_URL,
"-q",
]
)
log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub")
return True
except subprocess.CalledProcessError as e:
warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
return False
except (PermissionError, OSError) as e:
warning_rank_0(f"[TraceLens] Failed to install TraceLens due to system error: {e}")
return False
except Exception as e:
warning_rank_0(f"[TraceLens] Failed to install TraceLens due to unexpected error: {e}")
return False
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function automatically attempts to install TraceLens using subprocess.check_call when it's not found. This behavior could be problematic in production environments or containerized deployments where:

  1. Automatic package installation may not be desired or allowed
  2. Network access might be restricted
  3. The user might want to control when and how packages are installed

Consider making the auto-installation opt-in via a configuration parameter or environment variable, rather than attempting it automatically on ImportError.

Copilot uses AI. Check for mistakes.
Comment on lines +733 to +759
def test_html_format_fallback(self, mock_log, mock_warning):
"""Test that HTML format falls back to xlsx+csv."""
with tempfile.TemporaryDirectory() as tmpdir:
trace_file = os.path.join(tmpdir, "trace.json")
open(trace_file, "w").close()

mock_generate = Mock(return_value={"kernels": Mock()})

# Create files that would be generated
xlsx_path = os.path.join(tmpdir, "trace_analysis.xlsx")
open(xlsx_path, "w").close()

csv_dir = os.path.join(tmpdir, "trace")
os.makedirs(csv_dir)
csv_file = os.path.join(csv_dir, "kernels.csv")
open(csv_file, "w").close()

with patch.dict("sys.modules", {"TraceLens": Mock(), "TraceLens.Reporting": Mock()}):
with patch("primus.backends.megatron.training.mlflow_artifacts.generate_perf_report_pytorch", mock_generate):
with patch("primus.backends.megatron.training.mlflow_artifacts.glob.glob", return_value=[csv_file]):
result = generate_tracelens_report(
trace_file, tmpdir, "trace", output_format="html"
)

# Should fall back to both xlsx and csv
assert len(result) >= 1
mock_warning.assert_called() # Should warn about HTML not supported
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_html_format_fallback expects HTML format to be handled gracefully with a warning, but the actual implementation raises a ValueError for invalid formats (line 486-489). The test will fail because it passes "html" which is not in the valid_formats set, causing a ValueError to be raised before any warning can be logged. Either the test should expect a ValueError to be raised, or the implementation should handle unsupported formats more gracefully by issuing a warning and falling back to a default format.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +72
open(trace1, "w").close()
open(trace2, "w").close()
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File handles are not properly closed. Using open(file, "w").close() creates a file handle but doesn't guarantee it's closed if an exception occurs between open and close. Consider using context managers with with open(file, "w"): or Path(file).touch() instead.

Copilot uses AI. Check for mistakes.
Comment on lines +544 to +547
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function attempts to access len(dfs) on line 547 assuming generate_perf_report_pytorch returns a dict-like object with dataframes. However, there's no error handling if this function returns None, an unexpected type, or raises an exception. If generate_perf_report_pytorch returns None or a type that doesn't support len(), this will cause a TypeError. Consider adding a check for the return value type or wrapping this in additional error handling.

Suggested change
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
dfs = None
try:
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
except Exception as exc:
warning_rank_0(
f"[TraceLens] Failed to generate XLSX report for {trace_file}: {exc}"
)
if os.path.exists(xlsx_path):
tab_info = ""
if dfs is not None:
try:
tab_info = f" with {len(dfs)} tabs"
except TypeError:
tab_info = ""
log_rank_0(
f"[TraceLens] Generated XLSX report{tab_info}: {os.path.basename(xlsx_path)}"

Copilot uses AI. Check for mistakes.
with tempfile.TemporaryDirectory() as tmpdir:
result = generate_tracelens_reports(tmpdir, tmpdir, ranks=[0, 2])

# Should only generate reports for rank 0 and 2
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable result is not used.

Suggested change
# Should only generate reports for rank 0 and 2
# Should only generate reports for rank 0 and 2
assert len(result) == 2

Copilot uses AI. Check for mistakes.
mock_gen_report.return_value = ["/output/report.xlsx"]

with tempfile.TemporaryDirectory() as tmpdir:
result = generate_tracelens_reports(tmpdir, tmpdir, max_reports=3)
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable result is not used.

Copilot uses AI. Check for mistakes.
@gphuang gphuang closed this Dec 18, 2025
@gphuang gphuang deleted the feat/12-tracelens-integration branch December 18, 2025 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants