Optimize execution of `ZPure` #1306

kyri-petrou · 2024-05-06T00:33:48Z

This PR is an ambitious attempt at improving the already great performance of ZPure execution, which benchmarks across environment access, modification, logging and error handling.

Main changes

1. Create a `Runner` class, separate heap and stack variables and reuse `Stack` and `ChunkBuilder`:

First, I want to apologise for doing this as I'm certain it'll make reviewing the changes much harder, but I think it's probably for the best in the longer run.

This optimization allows us to reuse the Runner class via ThreadLocal and avoid allocations whenever we execute a new ZPure. While this might not provide much benefit when running ZPures, it makes a big difference for short-lived ones. In order to make this reentrant-safe, if the current thread is already running another ZPure, then we create a new Runner instance.

I've also removed the failed variable in favour of throwing a stackless error and catching it when we don't have any error handles in the Stack. This works with the assumption that in most cases, a ZPure will complete successfully.

2. Rely on vals / vars instead of Stack for environment / logs

Since ZPure is purely synchronous, we don't need a Stack to store modifications to the environment / logs. We can store the old one instead in a local val, and revert the value once that branch has finished execution. This way we save allocations but also improve performance since we don't need to use peek() to access the environment or write to logs

3. Don't start a fresh log segment when `keepLogOnError = true`

We only need to start a fresh log segment when we need to separate them (i.e., when keepLogOnError = false). Otherwise, we can just continue writing in the existing one

4. Reduce reliance on `Stack.push` / `Stack.pop` by adding manual handling of `Log` and `Environment` after a `FlatMap`

Writing to logs and accessing the environment are very common operations when composing ZPures. We optimize for this by adding special handling after a FlatMap that doesn't require us pushing / popping from the Stack

5. Custom implementation of `Stack`:

This was one of the first optimizations I did as part of this work, but in hindsight it might a bit unnecessary given that we've reduced reliance on Stack due to the points above. I decided to keep this though just cause it still provides some benefit:

The Stack from ZIO is already fast enough, but forces GC of values each time they're "popped". This makes sense in ZIO because the Stack might be long-lived, but for ZPure the stack entries are going to be GC'd automatically when the runloop finishes. Also, due to the internal structure of the Stack, at most we'll have 13 non-GC'd objects at any time - which is a small price to pay for not having to GC on every pop. In addition, ZIO's Stack doesn't provide a clear method, which we need to make (1) work.

Benchmarks

I've only included benchmarking results for the newly added benchmarks, since the existing ones which were only benchmarking ZPure.succeed haven't changed:

TLDR

The new changes provide a much bigger benefit whenever error handling is involved (i.e., Fold). Also, smaller ZPure's see a bigger improvement since the cost of creating a new Runner is amortized for longer-running ZPure

10-65% increase in throughput
8-45% reduced memory allocations

Full results

series/2.x:

[info] Benchmark                                                  (size)   Mode  Cnt        Score      Error   Units
[info] ZPureFullBenchmark.fallibleBenchmark                           10  thrpt    5   477838.967 ± 9094.529   ops/s
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate             10  thrpt    5     5620.851 ±  106.981  MB/sec
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate.norm        10  thrpt    5    12336.012 ±    0.001    B/op
[info] ZPureFullBenchmark.fallibleBenchmark:gc.count                  10  thrpt    5       78.000             counts
[info] ZPureFullBenchmark.fallibleBenchmark:gc.time                   10  thrpt    5       35.000                 ms
[info] ZPureFullBenchmark.fallibleBenchmark                         1000  thrpt    5     6222.224 ±   49.631   ops/s
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate           1000  thrpt    5     6609.630 ±   52.643  MB/sec
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate.norm      1000  thrpt    5  1113984.954 ±    0.007    B/op
[info] ZPureFullBenchmark.fallibleBenchmark:gc.count                1000  thrpt    5       91.000             counts
[info] ZPureFullBenchmark.fallibleBenchmark:gc.time                 1000  thrpt    5       42.000                 ms
[info] ZPureFullBenchmark.infallibleBenchmark                         10  thrpt    5   619155.376 ± 5948.625   ops/s
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate           10  thrpt    5     4793.988 ±   46.308  MB/sec
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate.norm      10  thrpt    5     8120.010 ±    0.001    B/op
[info] ZPureFullBenchmark.infallibleBenchmark:gc.count                10  thrpt    5       80.000             counts
[info] ZPureFullBenchmark.infallibleBenchmark:gc.time                 10  thrpt    5       34.000                 ms
[info] ZPureFullBenchmark.infallibleBenchmark                       1000  thrpt    5    10337.921 ±   74.983   ops/s
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate         1000  thrpt    5     6880.840 ±   50.454  MB/sec
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate.norm    1000  thrpt    5   698008.647 ±    0.615    B/op
[info] ZPureFullBenchmark.infallibleBenchmark:gc.count              1000  thrpt    5       95.000             counts
[info] ZPureFullBenchmark.infallibleBenchmark:gc.time               1000  thrpt    5       44.000                 ms

PR:

[info] Benchmark                                                  (size)   Mode  Cnt       Score       Error   Units
[info] ZPureFullBenchmark.fallibleBenchmark                           10  thrpt    5  790700.059 ± 10575.941   ops/s
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate             10  thrpt    5    5844.791 ±    77.647  MB/sec
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate.norm        10  thrpt    5    7752.007 ±     0.001    B/op
[info] ZPureFullBenchmark.fallibleBenchmark:gc.count                  10  thrpt    5      81.000              counts
[info] ZPureFullBenchmark.fallibleBenchmark:gc.time                   10  thrpt    5      37.000                  ms
[info] ZPureFullBenchmark.fallibleBenchmark                         1000  thrpt    5   10412.493 ±    96.469   ops/s
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate           1000  thrpt    5    7196.676 ±    66.646  MB/sec
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate.norm      1000  thrpt    5  724833.839 ±    10.915    B/op
[info] ZPureFullBenchmark.fallibleBenchmark:gc.count                1000  thrpt    5     100.000              counts
[info] ZPureFullBenchmark.fallibleBenchmark:gc.time                 1000  thrpt    5      47.000                  ms
[info] ZPureFullBenchmark.infallibleBenchmark                         10  thrpt    5  895840.885 ±  6466.530   ops/s
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate           10  thrpt    5    6006.956 ±    43.715  MB/sec
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate.norm      10  thrpt    5    7032.007 ±     0.001    B/op
[info] ZPureFullBenchmark.infallibleBenchmark:gc.count                10  thrpt    5      83.000              counts
[info] ZPureFullBenchmark.infallibleBenchmark:gc.time                 10  thrpt    5      37.000                  ms
[info] ZPureFullBenchmark.infallibleBenchmark                       1000  thrpt    5   11815.139 ±   104.042   ops/s
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate         1000  thrpt    5    7264.904 ±    63.565  MB/sec
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate.norm    1000  thrpt    5  644832.697 ±     1.659    B/op
[info] ZPureFullBenchmark.infallibleBenchmark:gc.count              1000  thrpt    5     101.000              counts
[info] ZPureFullBenchmark.infallibleBenchmark:gc.time               1000  thrpt    5      48.000                  ms

Special thanks to @ghostdogpr for writing the benchmarks and providing insights on realistic production use-cases of ZPure!

… runners

…mize-runloop

ghostdogpr · 2024-05-06T01:16:46Z

Looks great! I'll build locally and run my test suite at work tomorrow to see if I catch anything off.

ghostdogpr

Passed my tests at work + did a load test that showed nearly 10% improvement in response time, CPU and allocations.

kyri-petrou and others added 9 commits April 29, 2024 18:24

Optimizations for ZPure runloop

d39c8cd

Improve Fold

fb1ae01

Use packed0 != 1

0024977

One more test for sanity's safe

d98662c

Final cleanups

6b63fb7

Reimplement runloop via a Runner class and use thread locals to cache…

33e60b7

… runners

Merge branch 'zio:series/2.x' into optimize-runloop

9b7abd2

Cleanup Fail loop

2e861c9

Merge remote-tracking branch 'kyri-petrou/optimize-runloop' into opti…

a0bb158

…mize-runloop

kyri-petrou requested a review from a team as a code owner May 6, 2024 00:33

ghostdogpr approved these changes May 7, 2024

View reviewed changes

ghostdogpr merged commit 4e52744 into zio:series/2.x May 7, 2024
20 checks passed

kyri-petrou deleted the optimize-runloop branch May 7, 2024 04:55

kyri-petrou mentioned this pull request Aug 6, 2024

Improve ZSTM's performance zio/zio#9081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize execution of `ZPure` #1306

Optimize execution of `ZPure` #1306

kyri-petrou commented May 6, 2024

ghostdogpr commented May 6, 2024

ghostdogpr left a comment

Optimize execution of ZPure #1306

Optimize execution of ZPure #1306

Conversation

kyri-petrou commented May 6, 2024

Main changes

1. Create a Runner class, separate heap and stack variables and reuse Stack and ChunkBuilder:

2. Rely on vals / vars instead of Stack for environment / logs

3. Don't start a fresh log segment when keepLogOnError = true

4. Reduce reliance on Stack.push / Stack.pop by adding manual handling of Log and Environment after a FlatMap

5. Custom implementation of Stack:

Benchmarks

TLDR

Full results

ghostdogpr commented May 6, 2024

ghostdogpr left a comment

Choose a reason for hiding this comment

Optimize execution of `ZPure` #1306

Optimize execution of `ZPure` #1306

1. Create a `Runner` class, separate heap and stack variables and reuse `Stack` and `ChunkBuilder`:

3. Don't start a fresh log segment when `keepLogOnError = true`

4. Reduce reliance on `Stack.push` / `Stack.pop` by adding manual handling of `Log` and `Environment` after a `FlatMap`

5. Custom implementation of `Stack`: