-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8346989: C2: deoptimization and re-execution cycle with Math.*Exact in case of frequent overflow #23916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back marc-chevalier! A progress list of the required criteria for merging this PR into |
@marc-chevalier This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 86 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@eme64, @iwanowww, @TobiHartmann) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
@marc-chevalier The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark generally looks good to me, I only have some minor suggestions ;)
You ask this in the PR description. I think I was not thinking about |
Ah. And is this only about And yet another idea: you could probably write an IR test that checks that we at first have the compilation with the trap, and another test where we trap too much and then get a different compilation (without the intrinsic?). Plus: the issue title is very generic. I think it should mention something about |
My fault. I used "inline" instead of "intrinsic" because the functions implementing the intrinsic are called I'll edit the text to fix that. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice benchmark, Marc!
@@ -1961,7 +1961,7 @@ void LibraryCallKit::inline_math_mathExact(Node* math, Node *test) { | |||
set_i_o(i_o()); | |||
|
|||
uncommon_trap(Deoptimization::Reason_intrinsic, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using builtin_throw
here? (Requires some tuning on builtin_throw
side.) How much does it affect performance? Also, passing must_throw = true
into uncommon_trap
may help a bit here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using builtin_throw
sounds nice! But indeed, it won't work so directly. I want to prevent intrinsic in case of too_many_traps
. But that's only when builtin_throw
will do something. But if I only rely on builtin_throw
, then, when the built-in throwing is not possible (that is when treat_throw_as_hot && method()->can_omit_stack_trace()
is false), we will have the repeated deopt again.
There is also throwing the right exception, which is right now determined only by the reason (which adapts poorly to this case).
I guess that's what you meant by tuning: be able to know if we would built-in throw, and if so, do it, otherwise, prevent infinitely repeated deopt.
The way I see doing that is by (maybe optionally) providing the preallocated exception to throw as a parameter so that we don't have to rely on the "reason to exception" decision (or we can override it), and factor out the decision whether we can take the nice branch of builtin_throw
so that we can bail out of intrinsic if we can't fast throw before we start setting up the intrinsic (that we would then need to undo). Does that match what you had in mind or you have another suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adapting and re-using builtin_throw
like you described is reasonable but I let @iwanowww confirm 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's basically what I had in mind.
Currently, the focus of the intrinsic is on well-behaved case (overflows are very rare). builtin_throw()
covers more ground and optimize for scenarios when exceptions are thrown. But it depends on ciMethod::can_omit_stack_trace()
where -XX:-OmitStackTraceInFastThrow
mode will suffer from the original problem (continuous deoptimizations), plus a round of recompilations before giving up.
I suggest to improve and reuse builtin_throw
here and add additional checks in the intrinsic to guard against problematic scenario with continuous deoptimizations. IMO it improves performance model for a wide range of use cases while addressing pathological scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I have done something like that (getting the exception object to throw from parameter, and factor out the logic whether builtin_throw is possible, so we can bailout of intrinsics instead of cycling again). Test seem to pass in the various cases I wrote. As for benchmark, it's quite a change. I post only the new part, the rest is pretty much the same. C2_no_builtin_throw does what the original C2 was (no builtin throw, just bailing out of intrinsics to cut our losses), and new C2 is with builtin_throw. tldr: builtin_throw makes the overflow case of the same order as the in-bound cases (1-4ms) instead of being about 100 times bigger (600-700ms with C1, C2 without intrinsics, C2 with bailing out).
MathExact.C2.loopAddIInBounds 1000000 avgt 3 1.657 ± 11.994 ms/op
MathExact.C2.loopAddIOverflow 1000000 avgt 3 1.313 ± 4.188 ms/op
MathExact.C2.loopAddLInBounds 1000000 avgt 3 0.980 ± 0.396 ms/op
MathExact.C2.loopAddLOverflow 1000000 avgt 3 2.474 ± 3.473 ms/op
MathExact.C2.loopDecrementIInBounds 1000000 avgt 3 3.733 ± 13.709 ms/op
MathExact.C2.loopDecrementIOverflow 1000000 avgt 3 2.792 ± 23.724 ms/op
MathExact.C2.loopDecrementLInBounds 1000000 avgt 3 2.761 ± 24.744 ms/op
MathExact.C2.loopDecrementLOverflow 1000000 avgt 3 2.730 ± 23.065 ms/op
MathExact.C2.loopIncrementIInBounds 1000000 avgt 3 3.134 ± 20.980 ms/op
MathExact.C2.loopIncrementIOverflow 1000000 avgt 3 3.271 ± 8.876 ms/op
MathExact.C2.loopIncrementLInBounds 1000000 avgt 3 2.756 ± 22.912 ms/op
MathExact.C2.loopIncrementLOverflow 1000000 avgt 3 4.549 ± 9.543 ms/op
MathExact.C2.loopMultiplyIInBounds 1000000 avgt 3 1.268 ± 0.574 ms/op
MathExact.C2.loopMultiplyIOverflow 1000000 avgt 3 1.572 ± 11.171 ms/op
MathExact.C2.loopMultiplyLInBounds 1000000 avgt 3 1.021 ± 1.054 ms/op
MathExact.C2.loopMultiplyLOverflow 1000000 avgt 3 3.167 ± 20.666 ms/op
MathExact.C2.loopNegateIInBounds 1000000 avgt 3 3.575 ± 29.997 ms/op
MathExact.C2.loopNegateIOverflow 1000000 avgt 3 4.222 ± 9.041 ms/op
MathExact.C2.loopNegateLInBounds 1000000 avgt 3 4.452 ± 6.680 ms/op
MathExact.C2.loopNegateLOverflow 1000000 avgt 3 4.739 ± 34.662 ms/op
MathExact.C2.loopSubtractIInBounds 1000000 avgt 3 1.087 ± 0.539 ms/op
MathExact.C2.loopSubtractIOverflow 1000000 avgt 3 3.027 ± 9.709 ms/op
MathExact.C2.loopSubtractLInBounds 1000000 avgt 3 1.197 ± 5.763 ms/op
MathExact.C2.loopSubtractLOverflow 1000000 avgt 3 1.765 ± 10.037 ms/op
MathExact.C2_no_builtin_throw.loopAddIInBounds 1000000 avgt 3 2.310 ± 2.990 ms/op
MathExact.C2_no_builtin_throw.loopAddIOverflow 1000000 avgt 3 594.036 ± 500.000 ms/op
MathExact.C2_no_builtin_throw.loopAddLInBounds 1000000 avgt 3 1.577 ± 14.053 ms/op
MathExact.C2_no_builtin_throw.loopAddLOverflow 1000000 avgt 3 631.345 ± 75.836 ms/op
MathExact.C2_no_builtin_throw.loopDecrementIInBounds 1000000 avgt 3 2.090 ± 0.937 ms/op
MathExact.C2_no_builtin_throw.loopDecrementIOverflow 1000000 avgt 3 618.080 ± 38.047 ms/op
MathExact.C2_no_builtin_throw.loopDecrementLInBounds 1000000 avgt 3 4.164 ± 6.184 ms/op
MathExact.C2_no_builtin_throw.loopDecrementLOverflow 1000000 avgt 3 596.031 ± 584.159 ms/op
MathExact.C2_no_builtin_throw.loopIncrementIInBounds 1000000 avgt 3 2.383 ± 11.729 ms/op
MathExact.C2_no_builtin_throw.loopIncrementIOverflow 1000000 avgt 3 626.425 ± 134.612 ms/op
MathExact.C2_no_builtin_throw.loopIncrementLInBounds 1000000 avgt 3 2.345 ± 13.927 ms/op
MathExact.C2_no_builtin_throw.loopIncrementLOverflow 1000000 avgt 3 630.535 ± 99.348 ms/op
MathExact.C2_no_builtin_throw.loopMultiplyIInBounds 1000000 avgt 3 1.419 ± 4.289 ms/op
MathExact.C2_no_builtin_throw.loopMultiplyIOverflow 1000000 avgt 3 587.796 ± 52.215 ms/op
MathExact.C2_no_builtin_throw.loopMultiplyLInBounds 1000000 avgt 3 0.934 ± 0.272 ms/op
MathExact.C2_no_builtin_throw.loopMultiplyLOverflow 1000000 avgt 3 589.736 ± 347.848 ms/op
MathExact.C2_no_builtin_throw.loopNegateIInBounds 1000000 avgt 3 2.236 ± 5.749 ms/op
MathExact.C2_no_builtin_throw.loopNegateIOverflow 1000000 avgt 3 618.711 ± 725.158 ms/op
MathExact.C2_no_builtin_throw.loopNegateLInBounds 1000000 avgt 3 2.605 ± 17.373 ms/op
MathExact.C2_no_builtin_throw.loopNegateLOverflow 1000000 avgt 3 627.055 ± 184.767 ms/op
MathExact.C2_no_builtin_throw.loopSubtractIInBounds 1000000 avgt 3 1.006 ± 0.584 ms/op
MathExact.C2_no_builtin_throw.loopSubtractIOverflow 1000000 avgt 3 588.062 ± 403.116 ms/op
MathExact.C2_no_builtin_throw.loopSubtractLInBounds 1000000 avgt 3 0.978 ± 0.193 ms/op
MathExact.C2_no_builtin_throw.loopSubtractLOverflow 1000000 avgt 3 611.004 ± 456.779 ms/op
@iwanowww Are you still reviewing or should I have a look? |
…e-with-C2-compiled-code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Marc.
It looks a bit too convoluted to me. IMO an unconditional call to builtin_throw
, plus too_many_traps
check should do the job. Do I miss something important here?
src/hotspot/share/opto/graphKit.hpp
Outdated
@@ -275,7 +275,10 @@ class GraphKit : public Phase { | |||
|
|||
// Helper to throw a built-in exception. | |||
// The JVMS must allow the bytecode to be re-executed via an uncommon trap. | |||
void builtin_throw(Deoptimization::DeoptReason reason); | |||
// If `exception_object` is nullptr, the exception to throw will be guessed based on `reason` | |||
void builtin_throw(Deoptimization::DeoptReason reason, ciInstance* exception_object = nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, introduce a new overload instead.
I suggest to extract Deoptimization::DeoptReason -> ciInstance mapping into a helper method and turn void builtin_throw(Deoptimization::DeoptReason reason)
into a wrapper:
void GraphKit::builtin_throw(Deoptimization::DeoptReason reason) {
builtin_throw(reason, exception_on_deopt(reason));
}
uncommon_trap(Deoptimization::Reason_intrinsic, | ||
Deoptimization::Action_none); | ||
if (use_builtin_throw) { | ||
builtin_throw(Deoptimization::Reason_intrinsic, env()->ArithmeticException_instance()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to unconditionally call builtin_throw()
. It should handle uncommon_trap
case as well.
What makes sense is to ensure that builtin_throw()
doesn't change deoptimization reason. It can be implemented with an extra argument to new GraphKit::builtin_throw
overload (e.g., bool allow_deopt_reason_none
).
// If builtin_throw would work (notably, the throw is hot and we don't care about backtraces), | ||
// instead of bailing out on intrinsic or potentially deopting, let's do that! | ||
use_builtin_throw = true; | ||
} else if (too_many_traps(Deoptimization::Reason_intrinsic)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why too_many_traps(Deoptimization::Reason_intrinsic)
check is not enough here?
Actually, yes, there is a reason I've made it so weird (and I agree it's pretty convoluted). jdk/src/hotspot/share/opto/graphKit.cpp Lines 540 to 555 in 59629f8
If treat_throw_as_hot is false (so before too many traps) it just ends up as a uncommon_trap with Action_maybe_recompile action. That is fine at first. But later, we would like builtin_throw to do its job, but it can only do if ifjdk/src/hotspot/share/opto/graphKit.cpp Line 563 in 59629f8
which is not too_many_traps(reason) . Which means that:
In other words, we need Some of your suggestions are still relevant tho! I'll apply them. |
@marc-chevalier this pull request can not be integrated into git checkout fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
…e-with-C2-compiled-code
I've applied the suggested refactoring. It looks fine to me, tests seems happy, microbench shows similar profile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a while to parse the code but the refactoring definitely improves the situation 🙂 Looks good!
I've made the test flags tighter as discussed offline. I'll need a fresh approval. And for completeness, there are the bench result on this last state. We can see that things behave as we expect: builtin_throw is taken and making the situation a lot better. When intrinsics or builtin_throw are disabled, we see C1-like perfs.
|
Great, thank you! |
/integrate Thanks @iwanowww and @TobiHartmann! |
@marc-chevalier |
/sponsor |
Going to push as commit 97ed536.
Your commit was automatically rebased without conflicts. |
@TobiHartmann @marc-chevalier Pushed as commit 97ed536. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Math.*Exact
intrinsics can cause many deopt when used repeatedly with problematic arguments.This fix proposes not to rely on intrinsics after
too_many_traps()
has been reached.Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
tl;dr:
Before the fix:
After:
Is it worth having intrinsics at all? @eme64 wondered, so I tried with this code:
and with much more runs (50 instead of 3), and in a more stable load for the rest of the system.
No intrinsic (inlined Java implem):
Always intrinsic (current behavior, and new behavior in absence of overflow, like in this example):
So it's... not very conclusive, but likely to be a bit useful. The gap between the means is about 0.4s, which is less than half the standard deviation.
Still, it seems good to have.
From a more theoretical point of view, we can see that the code generated for the instrinsics is mostly a
mul
and ajo
, while it is much more complicated for inlined java (with manymov
,movsx
,cmp
and conditional jumps, looking a lot like the Java code).Thanks,
Marc
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916
$ git checkout pull/23916
Update a local copy of the PR:
$ git checkout pull/23916
$ git pull https://git.openjdk.org/jdk.git pull/23916/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23916
View PR using the GUI difftool:
$ git pr show -t 23916
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23916.diff
Using Webrev
Link to Webrev Comment