Skip to content

Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page #126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 23, 2025

Conversation

skarjala
Copy link
Contributor

@skarjala skarjala commented Jul 22, 2025

Summary:
Implemented detection and warning message of diverging cache hit/miss patterns across ranks in multi-rank HTML reports, helping identify distributed execution issues caused by inconsistent caching behavior.

  • Cache sequence tracking: Extract cache hit/miss patterns from compile_directory.json artifacts and build per-rank cache sequences
  • Divergence detection: Group ranks by their cache sequences and identify when patterns differ across ranks
  • Warning display: Show cache divergence warnings in the multi-rank landing page alongside existing compile ID divergence warnings
  • Rank grouping: Display which ranks share the same cache patterns to help isolate problematic ranks

Testing:
Added comprehensive test coverage:

  • Basic divergence detection between two ranks with different patterns
  • No false positives when all ranks have identical patterns
  • Multiple divergent groups with complex rank distributions

Minor Update to existing tests to account for extra log added to multi_rank_logs dir.

Ex:
image

@skarjala skarjala changed the title remove comment Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page Jul 22, 2025
@skarjala skarjala marked this pull request as ready for review July 22, 2025 05:44
src/cli.rs Outdated
rank_metadata
.iter()
.fold(FxHashMap::default(), |mut acc, md| {
acc.entry(md.cache_sequence.clone())
Copy link
Member

@xmfan xmfan Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, why the clone here. rank_metadata seems unused after this. is it due to .fold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually by changing iter() to into_iter() we consume the rank_metadata vector and get owned RankMetaData items, so we can move md.cache_sequence into the HashMap without cloning. You're right, since rank_metadata isn't needed after this point, consuming it makes more sense. Fixed.

if key != "unknown" && !key.starts_with("unknown_") {
compile_ids.insert(key.clone());
}
if let Some(arr) = val.get("artifacts").and_then(|v| v.as_array()) {
for art in arr {
let suffix = art.get("suffix").and_then(|s| s.as_str()).unwrap_or("");
Copy link
Member

@xmfan xmfan Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain what is the value of "suffix" here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suffix wasn’t introduced in this PR, it’s an existing field on OutputFile (in types.rs), already used by the HTML template:

<li><a href="{path_idx.url}">{path_idx.name}</a> {path_idx.suffix} …</li>

All we’re doing here is populating that field with a tiny marker (, ).

if suffix.is_empty() {
continue;
}
if let Some(num) = art.get("number").and_then(|n| n.as_u64()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for "number"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number is just the position of the artifact (0, 1, 2…) in the compile. We read it only so we can sort the artifacts chronologically before we build the hit/miss sequence.

@skarjala skarjala requested a review from xmfan July 22, 2025 20:27
<p><strong>Warning</strong>: Diverging Cache hit/miss patterns detected across ranks. Cache hit/miss pattern groups:</p>
<ul>
{{ for group in divergence_groups }}
<li>Ranks: {group.ranks}</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also shows the sequence here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will implement as a stretch goal in a future PR

@skarjala skarjala requested a review from yushangdi July 22, 2025 22:59
@skarjala skarjala merged commit ff32379 into pytorch:main Jul 23, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants