Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize execution of ZPure #1306

Merged
merged 9 commits into from
May 7, 2024
Merged

Conversation

kyri-petrou
Copy link
Contributor

This PR is an ambitious attempt at improving the already great performance of ZPure execution, which benchmarks across environment access, modification, logging and error handling.

Main changes

1. Create a Runner class, separate heap and stack variables and reuse Stack and ChunkBuilder:

First, I want to apologise for doing this as I'm certain it'll make reviewing the changes much harder, but I think it's probably for the best in the longer run.

This optimization allows us to reuse the Runner class via ThreadLocal and avoid allocations whenever we execute a new ZPure. While this might not provide much benefit when running ZPures, it makes a big difference for short-lived ones. In order to make this reentrant-safe, if the current thread is already running another ZPure, then we create a new Runner instance.

I've also removed the failed variable in favour of throwing a stackless error and catching it when we don't have any error handles in the Stack. This works with the assumption that in most cases, a ZPure will complete successfully.

2. Rely on vals / vars instead of Stack for environment / logs

Since ZPure is purely synchronous, we don't need a Stack to store modifications to the environment / logs. We can store the old one instead in a local val, and revert the value once that branch has finished execution. This way we save allocations but also improve performance since we don't need to use peek() to access the environment or write to logs

3. Don't start a fresh log segment when keepLogOnError = true

We only need to start a fresh log segment when we need to separate them (i.e., when keepLogOnError = false). Otherwise, we can just continue writing in the existing one

4. Reduce reliance on Stack.push / Stack.pop by adding manual handling of Log and Environment after a FlatMap

Writing to logs and accessing the environment are very common operations when composing ZPures. We optimize for this by adding special handling after a FlatMap that doesn't require us pushing / popping from the Stack

5. Custom implementation of Stack:

This was one of the first optimizations I did as part of this work, but in hindsight it might a bit unnecessary given that we've reduced reliance on Stack due to the points above. I decided to keep this though just cause it still provides some benefit:

The Stack from ZIO is already fast enough, but forces GC of values each time they're "popped". This makes sense in ZIO because the Stack might be long-lived, but for ZPure the stack entries are going to be GC'd automatically when the runloop finishes. Also, due to the internal structure of the Stack, at most we'll have 13 non-GC'd objects at any time - which is a small price to pay for not having to GC on every pop. In addition, ZIO's Stack doesn't provide a clear method, which we need to make (1) work.

Benchmarks

I've only included benchmarking results for the newly added benchmarks, since the existing ones which were only benchmarking ZPure.succeed haven't changed:

TLDR

The new changes provide a much bigger benefit whenever error handling is involved (i.e., Fold). Also, smaller ZPure's see a bigger improvement since the cost of creating a new Runner is amortized for longer-running ZPure

  • 10-65% increase in throughput
  • 8-45% reduced memory allocations

Full results

series/2.x:

[info] Benchmark                                                  (size)   Mode  Cnt        Score      Error   Units
[info] ZPureFullBenchmark.fallibleBenchmark                           10  thrpt    5   477838.967 ± 9094.529   ops/s
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate             10  thrpt    5     5620.851 ±  106.981  MB/sec
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate.norm        10  thrpt    5    12336.012 ±    0.001    B/op
[info] ZPureFullBenchmark.fallibleBenchmark:gc.count                  10  thrpt    5       78.000             counts
[info] ZPureFullBenchmark.fallibleBenchmark:gc.time                   10  thrpt    5       35.000                 ms
[info] ZPureFullBenchmark.fallibleBenchmark                         1000  thrpt    5     6222.224 ±   49.631   ops/s
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate           1000  thrpt    5     6609.630 ±   52.643  MB/sec
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate.norm      1000  thrpt    5  1113984.954 ±    0.007    B/op
[info] ZPureFullBenchmark.fallibleBenchmark:gc.count                1000  thrpt    5       91.000             counts
[info] ZPureFullBenchmark.fallibleBenchmark:gc.time                 1000  thrpt    5       42.000                 ms
[info] ZPureFullBenchmark.infallibleBenchmark                         10  thrpt    5   619155.376 ± 5948.625   ops/s
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate           10  thrpt    5     4793.988 ±   46.308  MB/sec
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate.norm      10  thrpt    5     8120.010 ±    0.001    B/op
[info] ZPureFullBenchmark.infallibleBenchmark:gc.count                10  thrpt    5       80.000             counts
[info] ZPureFullBenchmark.infallibleBenchmark:gc.time                 10  thrpt    5       34.000                 ms
[info] ZPureFullBenchmark.infallibleBenchmark                       1000  thrpt    5    10337.921 ±   74.983   ops/s
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate         1000  thrpt    5     6880.840 ±   50.454  MB/sec
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate.norm    1000  thrpt    5   698008.647 ±    0.615    B/op
[info] ZPureFullBenchmark.infallibleBenchmark:gc.count              1000  thrpt    5       95.000             counts
[info] ZPureFullBenchmark.infallibleBenchmark:gc.time               1000  thrpt    5       44.000                 ms

PR:

[info] Benchmark                                                  (size)   Mode  Cnt       Score       Error   Units
[info] ZPureFullBenchmark.fallibleBenchmark                           10  thrpt    5  790700.059 ± 10575.941   ops/s
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate             10  thrpt    5    5844.791 ±    77.647  MB/sec
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate.norm        10  thrpt    5    7752.007 ±     0.001    B/op
[info] ZPureFullBenchmark.fallibleBenchmark:gc.count                  10  thrpt    5      81.000              counts
[info] ZPureFullBenchmark.fallibleBenchmark:gc.time                   10  thrpt    5      37.000                  ms
[info] ZPureFullBenchmark.fallibleBenchmark                         1000  thrpt    5   10412.493 ±    96.469   ops/s
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate           1000  thrpt    5    7196.676 ±    66.646  MB/sec
[info] ZPureFullBenchmark.fallibleBenchmark:gc.alloc.rate.norm      1000  thrpt    5  724833.839 ±    10.915    B/op
[info] ZPureFullBenchmark.fallibleBenchmark:gc.count                1000  thrpt    5     100.000              counts
[info] ZPureFullBenchmark.fallibleBenchmark:gc.time                 1000  thrpt    5      47.000                  ms
[info] ZPureFullBenchmark.infallibleBenchmark                         10  thrpt    5  895840.885 ±  6466.530   ops/s
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate           10  thrpt    5    6006.956 ±    43.715  MB/sec
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate.norm      10  thrpt    5    7032.007 ±     0.001    B/op
[info] ZPureFullBenchmark.infallibleBenchmark:gc.count                10  thrpt    5      83.000              counts
[info] ZPureFullBenchmark.infallibleBenchmark:gc.time                 10  thrpt    5      37.000                  ms
[info] ZPureFullBenchmark.infallibleBenchmark                       1000  thrpt    5   11815.139 ±   104.042   ops/s
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate         1000  thrpt    5    7264.904 ±    63.565  MB/sec
[info] ZPureFullBenchmark.infallibleBenchmark:gc.alloc.rate.norm    1000  thrpt    5  644832.697 ±     1.659    B/op
[info] ZPureFullBenchmark.infallibleBenchmark:gc.count              1000  thrpt    5     101.000              counts
[info] ZPureFullBenchmark.infallibleBenchmark:gc.time               1000  thrpt    5      48.000                  ms

Special thanks to @ghostdogpr for writing the benchmarks and providing insights on realistic production use-cases of ZPure!

@kyri-petrou kyri-petrou requested a review from a team as a code owner May 6, 2024 00:33
@ghostdogpr
Copy link
Member

Looks great! I'll build locally and run my test suite at work tomorrow to see if I catch anything off.

Copy link
Member

@ghostdogpr ghostdogpr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passed my tests at work + did a load test that showed nearly 10% improvement in response time, CPU and allocations.

@ghostdogpr ghostdogpr merged commit 4e52744 into zio:series/2.x May 7, 2024
20 checks passed
@kyri-petrou kyri-petrou deleted the optimize-runloop branch May 7, 2024 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants