Skip to content

fix(attribution): discover FR dumps beside app logs#345

Merged
namitdhameja merged 2 commits into
mainfrom
nvbug6233471-fr-discovery
Jun 4, 2026
Merged

fix(attribution): discover FR dumps beside app logs#345
namitdhameja merged 2 commits into
mainfrom
nvbug6233471-fr-discovery

Conversation

@namitdhameja

Copy link
Copy Markdown
Contributor

Summary

  • keep TORCH_FR_DUMP_TEMP_FILE=<prefix> as the explicit first FR discovery rule when the prefix resolves under the attrsvc allowed root
  • add fallback discovery for flat per-cycle logs where FR dumps live under <log_dir>/checkpoints, e.g. /mnt/logs/test_job_cycle0.log -> /mnt/logs/checkpoints
  • retain the existing <run>/checkpoints fallback for logs under <run>/logs/...

Root Cause

NVBUG 6233471 reports that launcher-managed attrsvc can produce app-log attribution while missing already-generated FR dumps. For flat per-cycle logs such as /mnt/logs/test_job_cycle0.log, the previous layout heuristic only tried /mnt/checkpoints, and the TORCH_FR_DUMP_TEMP_FILE fallback only helped when the analyzed app log contained a resolvable prefix. It did not try /mnt/logs/checkpoints.

Validation

  • PYTHONPATH=src python3.11 -m unittest tests.attribution.unit.test_fr_dump_path -v
  • python3 -m ruff check src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_support.py tests/attribution/unit/test_fr_dump_path.py
  • Dev login repro against unpatched and patched worktrees:
    • unpatched: resolved=None, pipeline_fr_analysis=None
    • patched: resolves .../mnt/logs/checkpoints, FR analysis runs
  • Dev Slurm repro job 3068564 on node nvl72150-T07: completed 0:0; unpatched missed FR, patched resolved .../mnt/logs/checkpoints and produced hanging ranks: [0]

@namitdhameja namitdhameja self-assigned this Jun 1, 2026
@namitdhameja namitdhameja marked this pull request as ready for review June 1, 2026 07:46
@greptile-apps

greptile-apps Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes FR dump discovery for flat per-cycle log layouts (e.g. /mnt/logs/test_job_cycle0.log) where dumps reside in <log_dir>/checkpoints rather than the previously assumed <run>/checkpoints. It also promotes TORCH_FR_DUMP_TEMP_FILE scanning to first priority so that explicit log-scanned prefixes win when they resolve under the attrsvc allowed root.

  • Refactors extract_fr_dump_path to check the log-scanned TORCH_FR_DUMP_TEMP_FILE prefix first, then <log_dir>/checkpoints (new), then the existing <run>/checkpoints logs-sibling heuristic; extracts _validated_torch_fr_dump_prefix and _valid_checkpoints_dir helpers to centralise validation logic.
  • Adds two new tests covering the flat per-cycle layout and the explicit-prefix priority ordering.

Confidence Score: 4/5

The change is safe to merge; the targeted bug fix works correctly and all existing tests continue to pass.

The core logic for the new flat per-cycle layout is correct and well-tested. Two minor concerns hold the score below a clean pass: the while-loop no longer exits immediately when a candidate checkpoints directory exists but has no traces (a subtle change from the original early-return), and test_checkpoints_sibling_of_logs now implicitly relies on /container/dump* not existing on the host, which was not a requirement before the ordering flip.

Both changed files warrant a second look — fr_support.py for the loop early-exit change, and test_fr_dump_path.py for the environment-sensitivity of the existing sibling test.

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_support.py Refactors FR dump discovery into two new helpers and promotes TORCH_FR_DUMP_TEMP_FILE to first priority; introduces a new <log_dir>/checkpoints fallback. The while-loop early-exit semantics changed subtly (no longer stops on an empty-but-existing sibling checkpoints dir).
tests/attribution/unit/test_fr_dump_path.py Adds two new tests for the flat per-cycle log layout and explicit-prefix priority. The existing test_checkpoints_sibling_of_logs is now subtly environment-sensitive because log-scan ordering changed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["extract_fr_dump_path(log_path, allowed_root)"] --> B["_read_torch_fr_dump_from_log\nScan first 1000 lines for\nTORCH_FR_DUMP_TEMP_FILE="]
    B --> C{Prefix found\nin log?}
    C -- Yes --> D["_validated_torch_fr_dump_prefix\nStrip quotes · check not-dir\ncheck traces exist · check allowed_root"]
    D --> E{Valid?}
    E -- Yes --> RETURN1["return prefix ✓"]
    E -- No --> F
    C -- No --> F["_infer_checkpoints_dir_from_log_path"]
    F --> G["Check log_dir/checkpoints\n_valid_checkpoints_dir\n(NEW fallback)"]
    G --> H{Exists + traces + allowed_root?}
    H -- Yes --> RETURN2["return log_dir/checkpoints ✓"]
    H -- No --> I["Walk up directory tree\nlooking for logs ancestor"]
    I --> J{basename == logs?}
    J -- Yes --> K["Check run_root/checkpoints\n_valid_checkpoints_dir\n(existing fallback)"]
    K --> L{Exists + traces + allowed_root?}
    L -- Yes --> RETURN3["return run/checkpoints ✓"]
    L -- No --> M{Reached root?}
    J -- No --> M
    M -- No --> I
    M -- Yes --> RETURN4["return None"]
Loading

Comments Outside Diff (2)

  1. src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_support.py, line 154-169 (link)

    P2 Early-exit semantics removed from the logs-sibling search

    The original while True loop returned None immediately when <run>/checkpoints existed as a directory but contained no _dump_* traces. The refactored version delegates to _valid_checkpoints_dir, which returns None and lets the loop continue up the tree. In a layout like /shared/logs/run/logs/slurm/job.log where run/checkpoints is present but empty, the new code will climb to the outer logs ancestor and consider /shared/checkpoints — a directory the original code would never have reached for that log path.

  2. tests/attribution/unit/test_fr_dump_path.py, line 21-37 (link)

    P2 test_checkpoints_sibling_of_logs is now environment-sensitive

    The test writes TORCH_FR_DUMP_TEMP_FILE=/container/_dump_ in the log file and expects extract_fr_dump_path to return the inferred ckpt. Before this PR the inferred path was tried first, so the result was stable regardless of host filesystem state. With the new ordering (log scan first), the test silently relies on /container/_dump_* not existing on the host. In a container-based CI environment where /container/ is a real mount point, glob could match those files and the function would return the container prefix instead of ckpt, causing a spurious failure. Adding allowed_root=run (or a dedicated tempfile-scoped prefix) would make the test deterministic.

Reviews (1): Last reviewed commit: "Merge branch 'main' into nvbug6233471-fr..." | Re-trigger Greptile

@namitdhameja namitdhameja added the ci-approved Approved to run CI label Jun 1, 2026
@hexinw-nvidia

Copy link
Copy Markdown
Contributor

Thanks for this — the refactor is clean and the helpers de-duplicate the validation nicely. A couple of points before merge:

1. Resolution order changed — please call it out. This flips the precedence: the log-scanned TORCH_FR_DUMP_TEMP_FILE prefix now wins before the inferred <run>/checkpoints dir (previously the inferred shared dir was preferred). Since TORCH_FR_DUMP_TEMP_FILE is often a container-local path, the only thing preventing a false match against an unrelated path that happens to exist on the attrsvc host (e.g. its own /tmp/checkpoints/_dump_*) is allowed_root. Worth an explicit note in the PR body and the docstring, since allowed_root is now the sole guard.

2. Let's fix the root cause rather than keep growing the layout heuristics. This PR brings the layout-guessing rules in extract_fr_dump_path to three (TORCH_FR_DUMP_TEMP_FILE scan, <log_dir>/checkpoints, <run>/checkpoints). Each new cluster convention tends to force another one — let's fix the root cause so we're not adding a fourth.

The underlying problem isn't the discovery algorithm; it's a missing data channel. attrsvc only receives log_path (http_api.py:29-44 — no checkpoint/FR field), so it has to reverse-engineer the dump location from log scanning + filesystem-layout guessing.

The launcher already knows the answer: it submits via _submit_log() (launcher.py:992-993), and its os.environ contains TORCH_FR_DUMP_TEMP_FILE at submit time. Proposal:

  • Add an optional fr_dump_path field to the submit payload (log_submit_payload / POST /logs).
  • Have ft_launcher populate it from os.environ["TORCH_FR_DUMP_TEMP_FILE"] when present.
  • In extract_fr_dump_path, prefer an explicitly-supplied fr_dump_path (still subject to the allowed_root + traces-exist validation), and fall back to the existing log-scan / layout heuristics only when it's absent (e.g. non-launcher submitters).

That makes resolution deterministic for launcher-managed jobs and demotes the heuristics to a genuine fallback.

Suggestion: I'm fine merging this as a stopgap if we open a tracking issue for the fr_dump_path change so we don't keep accreting layout rules. Happy to take that follow-up.

@namitdhameja

Copy link
Copy Markdown
Contributor Author

Thanks for this — the refactor is clean and the helpers de-duplicate the validation nicely. A couple of points before merge:

1. Resolution order changed — please call it out. This flips the precedence: the log-scanned TORCH_FR_DUMP_TEMP_FILE prefix now wins before the inferred <run>/checkpoints dir (previously the inferred shared dir was preferred). Since TORCH_FR_DUMP_TEMP_FILE is often a container-local path, the only thing preventing a false match against an unrelated path that happens to exist on the attrsvc host (e.g. its own /tmp/checkpoints/_dump_*) is allowed_root. Worth an explicit note in the PR body and the docstring, since allowed_root is now the sole guard.

2. Let's fix the root cause rather than keep growing the layout heuristics. This PR brings the layout-guessing rules in extract_fr_dump_path to three (TORCH_FR_DUMP_TEMP_FILE scan, <log_dir>/checkpoints, <run>/checkpoints). Each new cluster convention tends to force another one — let's fix the root cause so we're not adding a fourth.

The underlying problem isn't the discovery algorithm; it's a missing data channel. attrsvc only receives log_path (http_api.py:29-44 — no checkpoint/FR field), so it has to reverse-engineer the dump location from log scanning + filesystem-layout guessing.

The launcher already knows the answer: it submits via _submit_log() (launcher.py:992-993), and its os.environ contains TORCH_FR_DUMP_TEMP_FILE at submit time. Proposal:

  • Add an optional fr_dump_path field to the submit payload (log_submit_payload / POST /logs).
  • Have ft_launcher populate it from os.environ["TORCH_FR_DUMP_TEMP_FILE"] when present.
  • In extract_fr_dump_path, prefer an explicitly-supplied fr_dump_path (still subject to the allowed_root + traces-exist validation), and fall back to the existing log-scan / layout heuristics only when it's absent (e.g. non-launcher submitters).

That makes resolution deterministic for launcher-managed jobs and demotes the heuristics to a genuine fallback.

Suggestion: I'm fine merging this as a stopgap if we open a tracking issue for the fr_dump_path change so we don't keep accreting layout rules. Happy to take that follow-up.

@hexinw-nvidia The attrsvc gets the log_path but not the FR path. FR dump is completely controlled at the workload level, with no notification to attrsvc or any other outside component. https://docs.pytorch.org/tutorials/unstable/flight_recorder_tutorial.html#enabling-flight-recorder ; attrsvc gets to know the path via TORCH_FR_DUMP_TEMP_FILE being in the cycle log file. And if its not there, then its heuristics based. We can probably just do what we do for Megatron, and ignore other workloads. Lets check with bug submitter.

@namitdhameja namitdhameja merged commit f008f6e into main Jun 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants