fix(attribution): discover FR dumps beside app logs#345
Conversation
Greptile SummaryThis PR fixes FR dump discovery for flat per-cycle log layouts (e.g.
Confidence Score: 4/5The change is safe to merge; the targeted bug fix works correctly and all existing tests continue to pass. The core logic for the new flat per-cycle layout is correct and well-tested. Two minor concerns hold the score below a clean pass: the while-loop no longer exits immediately when a candidate checkpoints directory exists but has no traces (a subtle change from the original early-return), and test_checkpoints_sibling_of_logs now implicitly relies on /container/dump* not existing on the host, which was not a requirement before the ordering flip. Both changed files warrant a second look — fr_support.py for the loop early-exit change, and test_fr_dump_path.py for the environment-sensitivity of the existing sibling test. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["extract_fr_dump_path(log_path, allowed_root)"] --> B["_read_torch_fr_dump_from_log\nScan first 1000 lines for\nTORCH_FR_DUMP_TEMP_FILE="]
B --> C{Prefix found\nin log?}
C -- Yes --> D["_validated_torch_fr_dump_prefix\nStrip quotes · check not-dir\ncheck traces exist · check allowed_root"]
D --> E{Valid?}
E -- Yes --> RETURN1["return prefix ✓"]
E -- No --> F
C -- No --> F["_infer_checkpoints_dir_from_log_path"]
F --> G["Check log_dir/checkpoints\n_valid_checkpoints_dir\n(NEW fallback)"]
G --> H{Exists + traces + allowed_root?}
H -- Yes --> RETURN2["return log_dir/checkpoints ✓"]
H -- No --> I["Walk up directory tree\nlooking for logs ancestor"]
I --> J{basename == logs?}
J -- Yes --> K["Check run_root/checkpoints\n_valid_checkpoints_dir\n(existing fallback)"]
K --> L{Exists + traces + allowed_root?}
L -- Yes --> RETURN3["return run/checkpoints ✓"]
L -- No --> M{Reached root?}
J -- No --> M
M -- No --> I
M -- Yes --> RETURN4["return None"]
|
|
Thanks for this — the refactor is clean and the helpers de-duplicate the validation nicely. A couple of points before merge: 1. Resolution order changed — please call it out. This flips the precedence: the log-scanned 2. Let's fix the root cause rather than keep growing the layout heuristics. This PR brings the layout-guessing rules in The underlying problem isn't the discovery algorithm; it's a missing data channel. attrsvc only receives The launcher already knows the answer: it submits via
That makes resolution deterministic for launcher-managed jobs and demotes the heuristics to a genuine fallback. Suggestion: I'm fine merging this as a stopgap if we open a tracking issue for the |
@hexinw-nvidia The attrsvc gets the log_path but not the FR path. FR dump is completely controlled at the workload level, with no notification to attrsvc or any other outside component. https://docs.pytorch.org/tutorials/unstable/flight_recorder_tutorial.html#enabling-flight-recorder ; attrsvc gets to know the path via TORCH_FR_DUMP_TEMP_FILE being in the cycle log file. And if its not there, then its heuristics based. We can probably just do what we do for Megatron, and ignore other workloads. Lets check with bug submitter. |
Summary
TORCH_FR_DUMP_TEMP_FILE=<prefix>as the explicit first FR discovery rule when the prefix resolves under the attrsvc allowed root<log_dir>/checkpoints, e.g./mnt/logs/test_job_cycle0.log->/mnt/logs/checkpoints<run>/checkpointsfallback for logs under<run>/logs/...Root Cause
NVBUG 6233471 reports that launcher-managed attrsvc can produce app-log attribution while missing already-generated FR dumps. For flat per-cycle logs such as
/mnt/logs/test_job_cycle0.log, the previous layout heuristic only tried/mnt/checkpoints, and theTORCH_FR_DUMP_TEMP_FILEfallback only helped when the analyzed app log contained a resolvable prefix. It did not try/mnt/logs/checkpoints.Validation
PYTHONPATH=src python3.11 -m unittest tests.attribution.unit.test_fr_dump_path -vpython3 -m ruff check src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_support.py tests/attribution/unit/test_fr_dump_path.pyresolved=None,pipeline_fr_analysis=None.../mnt/logs/checkpoints, FR analysis runs3068564on nodenvl72150-T07: completed0:0; unpatched missed FR, patched resolved.../mnt/logs/checkpointsand producedhanging ranks: [0]