-
Notifications
You must be signed in to change notification settings - Fork 126
Add heuristic to order harness codegen #4257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add heuristic to order harness codegen #4257
Conversation
Let's get #4236 merged first so that this PR can get concrete data points in its PR description. |
#4236 has now been merged, so this one should be ready to be acted upon. |
self.map(|mut harnesses| { | ||
reorder_harnesses::<H>(&mut harnesses); | ||
harnesses | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}) | |
}).collect::<Vec<_>>().into_iter() |
I could not tell you why tbh, but testing is showing that forcing all of the ordering to happen up front with a collect rather than lazily with a map makes this change go from slightly worse to significantly better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look into why this is, as collecting seems to sneakily guzzle memory (peak goes from normal ~2 GB to ~6 GB), I think bc it requires one transformer per codegen unit to be kept in memory? If this change is doable without that drawback, it may be worthwhile but, if not, it should totally be dropped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cause of the memory overhead is that, now that we are doing reachability in a single pass first, we're inadvertently storing the reachable call tree for every harness while we figure out which order to codegen them in. Definitely not ideal as this takes up a ton of space :(. I'll look into how to fix this and if other heuristics could help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Is it possible to add a test that ensures the order is what we expect it to be?
Kani currently has a special `handle_quantifiers()` pass after codegen that we need to recursively inline function calls inside quantifier expressions as needed by our hooks. This has to be done for every value in the symbol table, but our current implementation implicitly clones **all `Stmt` type values**, even if they don't contain any quantifiers and wouldn't need to change. This makes the pass currently take up **~7% of codegen on crates that don't even use quantifiers**. Now, the `handle_quantifiers_in_stmt()` function just returns an option with a new `Stmt` if inlining was needed, or `None` otherwise, allowing us to avoid any allocations and even needing to insert back into the table in the common case where a `Stmt` doesn't contain any quantifiers. ### Results Flamegraphing shows that this change reduces the `handle_quantifiers()` pass from 7.1% to just 0.8% of codegen for the `s2n-codec` crate. There was a much less significant change on larger codebases like the standard library, which only saw **a <1% improvement in end to end compile time** (at std commit [177d0fd](model-checking/verify-rust-std@177d0fd), assuming #4257 & #4259 are merged into Kani). That being said, there was a significant **1-7% decrease in codegen time** for individual crates of the standard library, so it seems to still make a difference, even if not reflected in e2e time. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.
Internally, Kani uses the `call_with_panic_debug_info()` function to call closures with a string that explains what they're trying to do, allowing better error messages if they internally panics. However, this function is currently called for every harness we codegen and has to pretty format the harness' function instance--something that calls into rustc internals and takes a non-negligible amount of time. We should instead accept a closure that can generate debug info, but only when (and if) it's needed. ### Results Based on initial flamegraphing, the formatting passed to `call_with_panic_debug_info()` went from taking 2.3% of codegen on `s2n-codec` to not being called a single time (since there are no compiler errors in a typical execution). This change made a solid **4.6% reduction in the end to end compile time of the standard library** (at std commit [177d0fd](model-checking/verify-rust-std@177d0fd); assuming #4257, #4259 & #4268 are already merged into Kani). By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.
Currently, we have to create a new `BodyTransformer` for each harness individually. However, this takes a lot of time (as we have to constantly re-validate all our Kani functions and reinitialize everything based on the `QueryDb`) and the transformer's options are all shared between harnesses of the same codegen unit anyway. This PR makes the `BodyTransformer` struct `Clone`-able, allowing us to initialize a single 'template' transformer for each codegen unit and then cheaply clone the template for each harness within the unit. Based on testing, the clone takes just ~3µs when using the default # of transformation passes, which is much faster than initialization. ### Results This change made a noticeable **4.8% reduction in the end to end compile time of the standard library** (at std commit [177d0fd](model-checking/verify-rust-std@177d0fd) & assuming #4257 is merged into Kani). By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.
Motivation
When compiling with more than a single thread (as in #4236), the order in which we codegen harnesses can have a non-negligable impact on performance. Specifically, if we handle harness that will generate a lot of code near the end of compilation, the main compiler thread can get stuck waiting for worker threads to export that code, slowing down overall compilation. If we could ensure that more difficult to export harnesses are handled first, the thread pool will have more time to write them to disk off the critical path of compilation.
Approach
This PR switches reachability analysis for all harnesses to happen before codegen. It can then use the information we gather from reachability to reorder codegen based on the total number of items reachable from each harness (
MostReachableItems
).This heuristic was chosen because it is:
CodegenUnit
of 66 harnesses (acceptable given it took ~20ms to codegen the fastest harnesses in that sameCodegenUnit
)s2n-codec
benchmark it only ever places a slower-to-export harness after a faster one if their actual export times were within 10ms (or 7%). Since the goal is just to roughly order harnesses based on order of magnitude, this is a very good result 🎉Results
This change made a small, but not insignificant 2.9% reduction in the end to end compile time of the standard library (@ commit
177d0fd
).And below is a recreation of the graph of the compiler's per-thread CPU usage from #4248 now that codegen is ordered.

While functionally complete on its own, this change will not really make a performance impact until #4236 is merged.
Resolves #4248
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.