Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8346989: Deoptimization and re-compilation cycle with C2 compiled code #23916

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

marc-chevalier
Copy link
Contributor

@marc-chevalier marc-chevalier commented Mar 5, 2025

Math.*Exact intrinsics can cause many deopt when used repeatedly with problematic arguments.
This fix proposes not to rely on inlining after too_many_traps() has been reached.

I've reproduced the issue on a slightly simpler code than the one in the bug:

public class Test {
    final static int N = 5_000_000;

    public static int square(int a) {
        return Math.multiplyExact(a, a);
    }

    public static int test(int i) {
        try {
            return square(i);
        } catch (Throwable e) {
            return 0;
        }
    }

    public static void loop() {
        for (int i = 0; i < N; i++) {
            test(i);
        }
    }

    public static void main(String[] args) {
        loop();
    }
}

And we can indeed reproduce the issue:

# With C1:
$ time ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement -XX:TieredStopAtLevel=3 Test.java
~/jdk/build/linux-x64/jdk/bin/java  -XX:-UseOnStackReplacement  Test.java  4,31s user 0,25s system 103% cpu 4,409 total

# With C2:
$ time ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement -XX:TieredStopAtLevel=4 Test.java
~/jdk/build/linux-x64/jdk/bin/java  -XX:-UseOnStackReplacement  Test.java  23,27s user 0,33s system 100% cpu 23,409 total

And with this fix, with C1:

Benchmark 1: ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement -XX:TieredStopAtLevel=3 Test.java
  Time (mean ± σ):      4.280 s ±  0.074 s    [User: 4.183 s, System: 0.272 s]
  Range (min … max):    4.172 s …  4.398 s    50 runs

and with C2:

Benchmark 1: ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement -XX:TieredStopAtLevel=4 Test.java
  Time (mean ± σ):      4.151 s ±  0.069 s    [User: 4.070 s, System: 0.274 s]
  Range (min … max):    4.005 s …  4.313 s    50 runs

As for the benchmark test, before the fix:

Benchmark                 (SIZE)  Mode  Cnt  Score   Error  Units
MultiplyExact.C1.loop    1000000  avgt    3  0.592 ± 0.052   s/op
MultiplyExact.C1_1.loop  1000000  avgt    3  0.593 ± 0.136   s/op
MultiplyExact.C1_2.loop  1000000  avgt    3  0.598 ± 0.108   s/op
MultiplyExact.C1_3.loop  1000000  avgt    3  0.602 ± 0.071   s/op
MultiplyExact.C2.loop    1000000  avgt    3  4.133 ± 0.649   s/op

and with the fix:

Benchmark                 (SIZE)  Mode  Cnt  Score   Error  Units
MultiplyExact.C1.loop    1000000  avgt    3  0.590 ± 0.133   s/op
MultiplyExact.C1_1.loop  1000000  avgt    3  0.589 ± 0.114   s/op
MultiplyExact.C1_2.loop  1000000  avgt    3  0.590 ± 0.147   s/op
MultiplyExact.C1_3.loop  1000000  avgt    3  0.617 ± 0.010   s/op
MultiplyExact.C2.loop    1000000  avgt    3  0.543 ± 0.086   s/op

Is it worth having intrinsics at all? @eme64 wondered, so I tried with this code:

public class Test {
    final static int N = 500_000_000;

    public static int test(int i) {
        try{
            return Math.multiplyExact(i, i);
        } catch (Throwable e){
            return 0;
        }
    }

    public static void loop() {
        for(int i = 0; i < N; i++) {
            test(i % 32_768);
        }
    }

    public static void main(String[] args) {
        loop();
    }
}

No intrinsic (inlined Java implem):

Benchmark 1: ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement Test.java
  Time (mean ± σ):      8.651 s ±  0.902 s    [User: 8.517 s, System: 0.155 s]
  Range (min … max):    6.853 s … 10.439 s    50 runs

Always intrinsic (current behavior, and new behavior in absence of overflow, like in this example):

Benchmark 1: ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement Test.java
  Time (mean ± σ):      8.222 s ±  1.024 s    [User: 8.090 s, System: 0.155 s]
  Range (min … max):    6.667 s … 10.406 s    50 runs

So it's... not very conclusive, but likely to be a bit useful. The gap between the means is about 0.4s, which is less than half the standard deviation.
Still, it seems good to have.

Thanks,
Marc


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8346989: Deoptimization and re-compilation cycle with C2 compiled code (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916
$ git checkout pull/23916

Update a local copy of the PR:
$ git checkout pull/23916
$ git pull https://git.openjdk.org/jdk.git pull/23916/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23916

View PR using the GUI difftool:
$ git pr show -t 23916

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23916.diff

@marc-chevalier marc-chevalier changed the title Limit inlining of math Exact operations in case of too many deopts 8346989: Deoptimization and re-compilation cycle with C2 compiled code Mar 5, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 5, 2025

👋 Welcome back marc-chevalier! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 5, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Mar 5, 2025

@marc-chevalier The following labels will be automatically applied to this pull request:

  • graal
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark generally looks good to me, I only have some minor suggestions ;)

Comment on lines +61 to +62
@Fork(value = 1)
public static class C2 extends MultiplyExact {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a C2 version where you just disable the intrinsic?

public int test(int i) {
try {
return square(i);
} catch (Throwable e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you catch a more specific exception? Catching very general exceptions can often mask other bugs. I suppose this is only a benchmark, but it would still be good practice ;)

@eme64
Copy link
Contributor

eme64 commented Mar 6, 2025

Is it worth inlining at all? @eme64 wondered, so I tried with this code:

You ask this in the PR description. I think I was not thinking about inlining but rather using the intrinsic. How much speedup does the intrinsic really deliver? Is it really better than pure Java?

@eme64
Copy link
Contributor

eme64 commented Mar 6, 2025

Ah. And is this only about multiplyExact, or are there other methods affected? Would be nice to extend the benchmark to those as well.

And yet another idea: you could probably write an IR test that checks that we at first have the compilation with the trap, and another test where we trap too much and then get a different compilation (without the intrinsic?).

Plus: the issue title is very generic. I think it should mention something about Math.*Exact as well ;)

@marc-chevalier
Copy link
Contributor Author

marc-chevalier commented Mar 6, 2025

You ask this in the PR description. I think I was not thinking about inlining but rather using the intrinsic. How much speedup does the intrinsic really deliver? Is it really better than pure Java?

My fault. I used "inline" instead of "intrinsic" because the functions implementing the intrinsic are called inline_math_mathExact and alike. So, I compared the intrinsic vs. the pure java implementation, that happens to be inlined. And intrinsic is a bit better.

I'll edit the text to fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants