Skip to content

Conversation

AlexanderPortland
Copy link
Contributor

@AlexanderPortland AlexanderPortland commented Jul 31, 2025

Motivation

When compiling with more than a single thread (as in #4236), the order in which we codegen harnesses can have a non-negligable impact on performance. Specifically, if we handle harness that will generate a lot of code near the end of compilation, the main compiler thread can get stuck waiting for worker threads to export that code, slowing down overall compilation. If we could ensure that more difficult to export harnesses are handled first, the thread pool will have more time to write them to disk off the critical path of compilation.

Approach

This PR switches reachability analysis for all harnesses to happen before codegen. It can then use the information we gather from reachability to reorder codegen based on the total number of items reachable from each harness (MostReachableItems).

This heuristic was chosen because it is:

  • simple -- only requiring a single field access for each harness, and taking only ~10µs to fully sort a typical CodegenUnit of 66 harnesses (acceptable given it took ~20ms to codegen the fastest harnesses in that same CodegenUnit)
  • seemingly effective -- on the s2n-codec benchmark it only ever places a slower-to-export harness after a faster one if their actual export times were within 10ms (or 7%). Since the goal is just to roughly order harnesses based on order of magnitude, this is a very good result 🎉

Results

This change made a small, but not insignificant 2.9% reduction in the end to end compile time of the standard library (@ commit 177d0fd).

And below is a recreation of the graph of the compiler's per-thread CPU usage from #4248 now that codegen is ordered.
Screenshot 2025-07-31 at 12 16 24 PM

While functionally complete on its own, this change will not really make a performance impact until #4236 is merged.

Resolves #4248

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.

@AlexanderPortland AlexanderPortland requested a review from a team as a code owner July 31, 2025 19:08
@github-actions github-actions bot added Z-EndToEndBenchCI Tag a PR to run benchmark CI Z-CompilerBenchCI Tag a PR to run benchmark CI labels Jul 31, 2025
@tautschnig
Copy link
Member

Let's get #4236 merged first so that this PR can get concrete data points in its PR description.

@tautschnig
Copy link
Member

#4236 has now been merged, so this one should be ready to be acted upon.

self.map(|mut harnesses| {
reorder_harnesses::<H>(&mut harnesses);
harnesses
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
})
}).collect::<Vec<_>>().into_iter()

I could not tell you why tbh, but testing is showing that forcing all of the ordering to happen up front with a collect rather than lazily with a map makes this change go from slightly worse to significantly better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look into why this is, as collecting seems to sneakily guzzle memory (peak goes from normal ~2 GB to ~6 GB), I think bc it requires one transformer per codegen unit to be kept in memory? If this change is doable without that drawback, it may be worthwhile but, if not, it should totally be dropped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cause of the memory overhead is that, now that we are doing reachability in a single pass first, we're inadvertently storing the reachable call tree for every harness while we figure out which order to codegen them in. Definitely not ideal as this takes up a ton of space :(. I'll look into how to fix this and if other heuristics could help

Copy link
Contributor

@zhassan-aws zhassan-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Is it possible to add a test that ensures the order is what we expect it to be?

github-merge-queue bot pushed a commit that referenced this pull request Aug 6, 2025
Kani currently has a special `handle_quantifiers()` pass after codegen
that we need to recursively inline function calls inside quantifier
expressions as needed by our hooks. This has to be done for every value
in the symbol table, but our current implementation implicitly clones
**all `Stmt` type values**, even if they don't contain any quantifiers
and wouldn't need to change. This makes the pass currently take up **~7%
of codegen on crates that don't even use quantifiers**.

Now, the `handle_quantifiers_in_stmt()` function just returns an option
with a new `Stmt` if inlining was needed, or `None` otherwise, allowing
us to avoid any allocations and even needing to insert back into the
table in the common case where a `Stmt` doesn't contain any quantifiers.

### Results
Flamegraphing shows that this change reduces the `handle_quantifiers()`
pass from 7.1% to just 0.8% of codegen for the `s2n-codec` crate.

There was a much less significant change on larger codebases like the
standard library, which only saw **a <1% improvement in end to end
compile time** (at std commit
[177d0fd](model-checking/verify-rust-std@177d0fd),
assuming #4257 & #4259 are merged into Kani). That being said, there was
a significant **1-7% decrease in codegen time** for individual crates of
the standard library, so it seems to still make a difference, even if
not reflected in e2e time.

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 and MIT licenses.
github-merge-queue bot pushed a commit that referenced this pull request Aug 6, 2025
Internally, Kani uses the `call_with_panic_debug_info()` function to
call closures with a string that explains what they're trying to do,
allowing better error messages if they internally panics. However, this
function is currently called for every harness we codegen and has to
pretty format the harness' function instance--something that calls into
rustc internals and takes a non-negligible amount of time.

We should instead accept a closure that can generate debug info, but
only when (and if) it's needed.

### Results
Based on initial flamegraphing, the formatting passed to
`call_with_panic_debug_info()` went from taking 2.3% of codegen on
`s2n-codec` to not being called a single time (since there are no
compiler errors in a typical execution).

This change made a solid **4.6% reduction in the end to end compile time
of the standard library** (at std commit
[177d0fd](model-checking/verify-rust-std@177d0fd);
assuming #4257, #4259 & #4268 are already merged into Kani).

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 and MIT licenses.
github-merge-queue bot pushed a commit that referenced this pull request Aug 6, 2025
Currently, we have to create a new `BodyTransformer` for each harness
individually. However, this takes a lot of time (as we have to
constantly re-validate all our Kani functions and reinitialize
everything based on the `QueryDb`) and the transformer's options are all
shared between harnesses of the same codegen unit anyway.

This PR makes the `BodyTransformer` struct `Clone`-able, allowing us to
initialize a single 'template' transformer for each codegen unit and
then cheaply clone the template for each harness within the unit. Based
on testing, the clone takes just ~3µs when using the default # of
transformation passes, which is much faster than initialization.

### Results
This change made a noticeable **4.8% reduction in the end to end compile
time of the standard library** (at std commit
[177d0fd](model-checking/verify-rust-std@177d0fd)
& assuming #4257 is merged into Kani).

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 and MIT licenses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Z-CompilerBenchCI Tag a PR to run benchmark CI Z-EndToEndBenchCI Tag a PR to run benchmark CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add heuristic for ordering harness codegen
3 participants