Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page #126

skarjala · 2025-07-22T00:39:49Z

Summary:
Implemented detection and warning message of diverging cache hit/miss patterns across ranks in multi-rank HTML reports, helping identify distributed execution issues caused by inconsistent caching behavior.

Cache sequence tracking: Extract cache hit/miss patterns from compile_directory.json artifacts and build per-rank cache sequences
Divergence detection: Group ranks by their cache sequences and identify when patterns differ across ranks
Warning display: Show cache divergence warnings in the multi-rank landing page alongside existing compile ID divergence warnings
Rank grouping: Display which ranks share the same cache patterns to help isolate problematic ranks

Testing:
Added comprehensive test coverage:

Basic divergence detection between two ranks with different patterns
No false positives when all ranks have identical patterns
Multiple divergent groups with complex rank distributions

Minor Update to existing tests to account for extra log added to multi_rank_logs dir.

Ex:

xmfan · 2025-07-22T18:36:54Z

src/cli.rs

+        rank_metadata
+            .iter()
+            .fold(FxHashMap::default(), |mut acc, md| {
+                acc.entry(md.cache_sequence.clone())


curious, why the clone here. rank_metadata seems unused after this. is it due to .fold?

Actually by changing iter() to into_iter() we consume the rank_metadata vector and get owned RankMetaData items, so we can move md.cache_sequence into the HashMap without cloning. You're right, since rank_metadata isn't needed after this point, consuming it makes more sense. Fixed.

xmfan · 2025-07-22T18:49:47Z

src/cli.rs

                if key != "unknown" && !key.starts_with("unknown_") {
                    compile_ids.insert(key.clone());
                }
+                if let Some(arr) = val.get("artifacts").and_then(|v| v.as_array()) {
+                    for art in arr {
+                        let suffix = art.get("suffix").and_then(|s| s.as_str()).unwrap_or("");


could you explain what is the value of "suffix" here

suffix wasn’t introduced in this PR, it’s an existing field on OutputFile (in types.rs), already used by the HTML template:

<li><a href="{path_idx.url}">{path_idx.name}</a> {path_idx.suffix} …</li>

All we’re doing here is populating that field with a tiny marker (✅, ❌).

xmfan · 2025-07-22T18:53:02Z

src/cli.rs

+                        if suffix.is_empty() {
+                            continue;
+                        }
+                        if let Some(num) = art.get("number").and_then(|n| n.as_u64()) {


same here for "number"

number is just the position of the artifact (0, 1, 2…) in the compile. We read it only so we can sort the artifacts chronologically before we build the hit/miss sequence.

yushangdi · 2025-07-22T21:41:43Z

src/templates.rs

+    <p><strong>Warning</strong>: Diverging Cache hit/miss patterns detected across ranks. Cache hit/miss pattern groups:</p>
+    <ul>
+        {{ for group in divergence_groups }}
+            <li>Ranks: {group.ranks}</li>


should we also shows the sequence here?

Will implement as a stretch goal in a future PR

remove comment

3c4d52b

facebook-github-bot added the cla signed label Jul 22, 2025

skarjala and others added 2 commits July 21, 2025 17:45

Fix assert

62d08f0

Update cli.rs

02cae05

skarjala changed the title ~~remove comment~~ Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page Jul 22, 2025

skarjala requested review from xmfan, StrongerXi and yushangdi July 22, 2025 05:44

skarjala marked this pull request as ready for review July 22, 2025 05:44

xmfan reviewed Jul 22, 2025

View reviewed changes

skarjala requested a review from xmfan July 22, 2025 20:27

skarjala added 2 commits July 22, 2025 13:41

take out unneeded clone

1b22904

fix lint

f70e968

yushangdi reviewed Jul 22, 2025

View reviewed changes

skarjala requested a review from yushangdi July 22, 2025 22:59

yushangdi approved these changes Jul 23, 2025

View reviewed changes

skarjala merged commit ff32379 into pytorch:main Jul 23, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page #126

Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page #126

Uh oh!

skarjala commented Jul 22, 2025 •

edited

Loading

Uh oh!

xmfan Jul 22, 2025 •

edited

Loading

Uh oh!

skarjala Jul 22, 2025

Uh oh!

xmfan Jul 22, 2025 •

edited

Loading

Uh oh!

skarjala Jul 22, 2025

Uh oh!

xmfan Jul 22, 2025

Uh oh!

skarjala Jul 22, 2025

Uh oh!

yushangdi Jul 22, 2025

Uh oh!

skarjala Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page #126

Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page #126

Uh oh!

Conversation

skarjala commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmfan Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skarjala Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

xmfan Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skarjala Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

xmfan Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

skarjala Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

yushangdi Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

skarjala Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

skarjala commented Jul 22, 2025 •

edited

Loading

xmfan Jul 22, 2025 •

edited

Loading

xmfan Jul 22, 2025 •

edited

Loading