|
| 1 | +## How to run a specific JMH benchmark attaching async-profiler? |
| 2 | + |
| 3 | +Install https://github.com/async-profiler/async-profiler/releases/tag/v3.0 |
| 4 | +and configure it as described as per https://github.com/async-profiler/async-profiler?tab=readme-ov-file#basic-usage. |
| 5 | + |
| 6 | +Build a proper JMH benchmark jar to have more options to run the benchmark. |
| 7 | +```shell |
| 8 | +$ ./gradlew jmhJar |
| 9 | +``` |
| 10 | + |
| 11 | +On x86 Linux, run `QueryEntityLazyInitCollectionLoop` attaching wall/cpu and alloc profiling, gc and perfnorm |
| 12 | +with 2 forks and 100 loop iterations (it's a specific benchmark parameter): |
| 13 | +```shell |
| 14 | +$ java -jar basic/target/libs/hibernate-orm-benchmark-basic-1.0-SNAPSHOT-jmh.jar QueryEntityLazyInitCollectionLoop -f 2 -prof gc -prof perfnorm -prof "async:rawCommand=alloc,wall;event=cpu;output=jfr;dir=/tmp;libPath=${ASYNC_PROFILER_HOME}/lib/libasyncProfiler.so" -pcount=100 |
| 15 | +``` |
| 16 | +The `rawCommand` parameter has been used to additionally set secondary events which are not yet included into the JMH async-profiler plugin |
| 17 | +i.e. wall-clock profiling and allocation profiling (using the default `alloc` event thresholds, see `JVMTI_EVENT_SAMPLED_OBJECT_ALLOC` at https://docs.oracle.com/en/java/javase/17/docs/specs/jvmti.html i.e. 512KB). |
| 18 | + |
| 19 | +To change the allocation threshold replace `alloc` with `alloc=128` for 128 bytes threshold. |
| 20 | + |
| 21 | +Running the previous command should print something like (assuming Linux perf to be installed): |
| 22 | +``` |
| 23 | +Secondary result "org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test:·async": |
| 24 | +Async profiler results: |
| 25 | + /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr |
| 26 | +Async profiler results: |
| 27 | + /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr |
| 28 | +``` |
| 29 | +and finally the results: |
| 30 | +``` |
| 31 | +Benchmark (count) Mode Cnt Score Error Units |
| 32 | +QueryEntityLazyInitCollectionLoop.test 100 thrpt 6 40.391 ± 1.919 ops/s |
| 33 | +QueryEntityLazyInitCollectionLoop.test:CPI 100 thrpt 2 0.303 clks/insn |
| 34 | +QueryEntityLazyInitCollectionLoop.test:IPC 100 thrpt 2 3.300 insns/clk |
| 35 | +QueryEntityLazyInitCollectionLoop.test:L1-dcache-load-misses 100 thrpt 2 5329572.054 #/op |
| 36 | +QueryEntityLazyInitCollectionLoop.test:L1-dcache-loads 100 thrpt 2 173216246.787 #/op |
| 37 | +QueryEntityLazyInitCollectionLoop.test:L1-icache-load-misses 100 thrpt 2 192632.126 #/op |
| 38 | +QueryEntityLazyInitCollectionLoop.test:L1-icache-loads 100 thrpt 2 14780212.318 #/op |
| 39 | +QueryEntityLazyInitCollectionLoop.test:branch-misses 100 thrpt 2 187771.566 #/op |
| 40 | +QueryEntityLazyInitCollectionLoop.test:branches 100 thrpt 2 92994927.715 #/op |
| 41 | +QueryEntityLazyInitCollectionLoop.test:cycles 100 thrpt 2 140706458.155 #/op |
| 42 | +QueryEntityLazyInitCollectionLoop.test:dTLB-load-misses 100 thrpt 2 38567.276 #/op |
| 43 | +QueryEntityLazyInitCollectionLoop.test:dTLB-loads 100 thrpt 2 3969019.541 #/op |
| 44 | +QueryEntityLazyInitCollectionLoop.test:iTLB-load-misses 100 thrpt 2 16704.402 #/op |
| 45 | +QueryEntityLazyInitCollectionLoop.test:iTLB-loads 100 thrpt 2 414076.369 #/op |
| 46 | +QueryEntityLazyInitCollectionLoop.test:instructions 100 thrpt 2 464215980.492 #/op |
| 47 | +QueryEntityLazyInitCollectionLoop.test:stalled-cycles-frontend 100 thrpt 2 13056272.972 #/op |
| 48 | +QueryEntityLazyInitCollectionLoop.test:·async 100 thrpt NaN --- |
| 49 | +QueryEntityLazyInitCollectionLoop.test:·gc.alloc.rate 100 thrpt 6 1155.831 ± 53.937 MB/sec |
| 50 | +QueryEntityLazyInitCollectionLoop.test:·gc.alloc.rate.norm 100 thrpt 6 30431129.790 ± 11700.499 B/op |
| 51 | +QueryEntityLazyInitCollectionLoop.test:·gc.count 100 thrpt 6 22.000 counts |
| 52 | +QueryEntityLazyInitCollectionLoop.test:·gc.time 100 thrpt 6 25.000 ms |
| 53 | +``` |
| 54 | + |
| 55 | +The position of the `jfr` files can be used along with the async-profiler converter, to produce the 3 types of flamegraphs we're interested in. |
| 56 | + |
| 57 | +For alloc flamegraphs: |
| 58 | +``` |
| 59 | +$ java -cp ${ASYNC_PROFILER_HOME}/lib/converter.jar jfr2flame --alloc --total /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr alloc.html |
| 60 | +``` |
| 61 | +The `--total` flag is used to produce flamegraph samples based on the configred allocation threshold, aiming to show the total size of allocations. |
| 62 | + |
| 63 | +Instead, for wall clock and cpu flamegraphs: |
| 64 | +``` |
| 65 | +$ java -cp ~/async-profiler/lib/converter.jar jfr2flame --state default /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr cpu.html |
| 66 | +
|
| 67 | +$ java -cp ~/async-profiler/lib/converter.jar jfr2flame --state runnable,sleeping /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr wall.html |
| 68 | +``` |
| 69 | +Why we're using `--state` here? |
| 70 | + |
| 71 | +see https://github.com/async-profiler/async-profiler/issues/740#issue-1650739133 for more info: TLDR both wall and cpu events are collected |
| 72 | +using `jdk.ExecutionSample`, hence to distinguish between them, async-profiler uses the `state` field to differentiate between them. |
| 73 | + |
| 74 | +To better understand this, we can type: |
| 75 | +```shell |
| 76 | +$ jfr print --events jdk.ExecutionSample /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr | grep -c STATE_DEFAULT |
| 77 | +302 |
| 78 | +$ jfr summary /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr |
| 79 | + |
| 80 | +Version: 2.0 |
| 81 | + Chunks: 1 |
| 82 | + Start: 2024-05-17 12:32:19 (UTC) |
| 83 | + Duration: 3 s |
| 84 | + |
| 85 | + Event Type Count Size (bytes) |
| 86 | +========================================================= |
| 87 | + jdk.ObjectAllocationInNewTLAB 7087 141954 |
| 88 | + jdk.ExecutionSample 3509 55622 |
| 89 | + jdk.NativeLibrary 28 2157 |
| 90 | + jdk.ActiveSetting 23 705 |
| 91 | + jdk.InitialSystemProperty 17 1066 |
| 92 | + jdk.CPULoad 3 63 |
| 93 | + jdk.GCHeapSummary 3 123 |
| 94 | + jdk.Metadata 1 6952 |
| 95 | + jdk.Checkpoint 1 294850 |
| 96 | + jdk.ActiveRecording 1 75 |
| 97 | + jdk.OSInformation 1 108 |
| 98 | + jdk.CPUInformation 1 365 |
| 99 | + jdk.JVMInformation 1 348 |
| 100 | + jdk.ObjectAllocationOutsideTLAB 0 0 |
| 101 | + jdk.JavaMonitorEnter 0 0 |
| 102 | + jdk.ThreadPark 0 0 |
| 103 | + profiler.Log 0 0 |
| 104 | + profiler.Window 0 0 |
| 105 | + profiler.LiveObject 0 0 |
| 106 | +``` |
| 107 | +Which shows 2 interesting things: |
| 108 | +- as expected, the cpu-related `jdk.ExecutionSample` events are just a part of the overall ones e.g. 302 vs 3509 |
| 109 | +- there are other events collected by flight-recorder while `async-profiler` agent has been attached |
0 commit comments