-
Notifications
You must be signed in to change notification settings - Fork 17
Add cache hit/miss pattern divergence detection for multi-rank reports on TLParse landing page #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/cli.rs
Outdated
rank_metadata | ||
.iter() | ||
.fold(FxHashMap::default(), |mut acc, md| { | ||
acc.entry(md.cache_sequence.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, why the clone here. rank_metadata seems unused after this. is it due to .fold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually by changing iter() to into_iter() we consume the rank_metadata vector and get owned RankMetaData items, so we can move md.cache_sequence into the HashMap without cloning. You're right, since rank_metadata isn't needed after this point, consuming it makes more sense. Fixed.
if key != "unknown" && !key.starts_with("unknown_") { | ||
compile_ids.insert(key.clone()); | ||
} | ||
if let Some(arr) = val.get("artifacts").and_then(|v| v.as_array()) { | ||
for art in arr { | ||
let suffix = art.get("suffix").and_then(|s| s.as_str()).unwrap_or(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain what is the value of "suffix" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suffix
wasn’t introduced in this PR, it’s an existing field on OutputFile
(in types.rs), already used by the HTML template:
<li><a href="{path_idx.url}">{path_idx.name}</a> {path_idx.suffix} …</li>
All we’re doing here is populating that field with a tiny marker (✅
, ❌
).
if suffix.is_empty() { | ||
continue; | ||
} | ||
if let Some(num) = art.get("number").and_then(|n| n.as_u64()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for "number"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number is just the position of the artifact (0, 1, 2…) in the compile. We read it only so we can sort the artifacts chronologically before we build the hit/miss sequence.
<p><strong>Warning</strong>: Diverging Cache hit/miss patterns detected across ranks. Cache hit/miss pattern groups:</p> | ||
<ul> | ||
{{ for group in divergence_groups }} | ||
<li>Ranks: {group.ranks}</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also shows the sequence here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will implement as a stretch goal in a future PR
Summary:
Implemented detection and warning message of diverging cache hit/miss patterns across ranks in multi-rank HTML reports, helping identify distributed execution issues caused by inconsistent caching behavior.
compile_directory.json
artifacts and build per-rank cache sequencesTesting:
Added comprehensive test coverage:
Minor Update to existing tests to account for extra log added to multi_rank_logs dir.
Ex:
