performance improvements for `any` and `all` #57091

nsajko · 2025-01-18T08:12:19Z

Help ensure vectorization of any and all for homogeneous tuples of Bool, by avoiding bounds-checking and avoiding short-circuiting. Inspired by fast methods for any and all for Bool tuples #55673, but more general by relying on a loop instead of on recursion, thus being performant for any input length.
Delete a bunch of now-useless methods. The dispatch doesn't happen on Bool or on Missing any more.
- eight methods of all are deleted
- four methods of any are deleted

nsajko · 2025-01-18T08:23:58Z

Here's the code_llvm:

julia> code_llvm(all, Tuple{Tuple{Vararg{Bool, 32}}}; debuginfo=:none)
; Function Signature: all(NTuple{32, Bool})
define i8 @julia_all_2401(ptr nocapture noundef nonnull readonly align 1 dereferenceable(32) %"itr::Tuple") #0 {
top:
  %0 = load <32 x i8>, ptr %"itr::Tuple", align 1
  %1 = icmp eq <32 x i8> %0, zeroinitializer
  %2 = bitcast <32 x i1> %1 to i32
  %3 = icmp eq i32 %2, 0
  %4 = zext i1 %3 to i8
  ret i8 %4
}

julia> code_llvm(any, Tuple{Tuple{Vararg{Bool, 32}}}; debuginfo=:none)
; Function Signature: any(NTuple{32, Bool})
define i8 @julia_any_2403(ptr nocapture noundef nonnull readonly align 1 dereferenceable(32) %"itr::Tuple") #0 {
top:
  %0 = load <32 x i8>, ptr %"itr::Tuple", align 1
  %1 = call i8 @llvm.vector.reduce.or.v32i8(<32 x i8> %0)
  %2 = icmp ne i8 %1, 0
  %3 = zext i1 %2 to i8
  ret i8 %3
}

julia> code_llvm(all, Tuple{Tuple{Vararg{Bool, 64}}}; debuginfo=:none)
; Function Signature: all(NTuple{64, Bool})
define i8 @julia_all_2405(ptr nocapture noundef nonnull readonly align 1 dereferenceable(64) %"itr::Tuple") #0 {
top:
  %wide.load = load <32 x i8>, ptr %"itr::Tuple", align 1
  %0 = icmp ne <32 x i8> %wide.load, zeroinitializer
  %1 = getelementptr inbounds i8, ptr %"itr::Tuple", i64 32
  %wide.load.1 = load <32 x i8>, ptr %1, align 1
  %2 = icmp ne <32 x i8> %wide.load.1, zeroinitializer
  %3 = and <32 x i1> %0, %2
  %4 = bitcast <32 x i1> %3 to i32
  %5 = icmp eq i32 %4, -1
  %6 = zext i1 %5 to i8
  ret i8 %6
}

julia> code_llvm(any, Tuple{Tuple{Vararg{Bool, 64}}}; debuginfo=:none)
; Function Signature: any(NTuple{64, Bool})
define i8 @julia_any_2407(ptr nocapture noundef nonnull readonly align 1 dereferenceable(64) %"itr::Tuple") #0 {
top:
  %wide.load = load <32 x i8>, ptr %"itr::Tuple", align 1
  %0 = getelementptr inbounds i8, ptr %"itr::Tuple", i64 32
  %wide.load.1 = load <32 x i8>, ptr %0, align 1
  %1 = or <32 x i8> %wide.load, %wide.load.1
  %2 = icmp ne <32 x i8> %1, zeroinitializer
  %3 = bitcast <32 x i1> %2 to i32
  %4 = icmp ne i32 %3, 0
  %5 = zext i1 %4 to i8
  ret i8 %5
}

julia> versioninfo()
Julia Version 1.12.0-DEV.unknown
Commit 319082c (2025-01-18 07:54 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
  GC: Built with stock GC
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

So it vectorizes, like in #55673, but it also vectorizes for lengths above 32.

base/tuple.jl

Seelengrab · 2025-01-18T09:15:27Z

Do you have some benchmarks showing the improved performance?

nsajko · 2025-01-18T10:14:06Z

Benchmarking script

using BenchmarkTools

for f ∈ (all, any)
    println("f: $f")
    for b ∈ (false, true)
        println("  b: $b")
        for l ∈ 32:32:96
            println("    l: $l")
            print("    ")
            @btime ($f)(t) setup=(t = ntuple(Returns($b), $l);)
        end
    end
end

Results

`master` branch

Commit 1740287 (0 days old master):

f: all
  b: false
    l: 32
      13.392 ns (0 allocations: 0 bytes)
    l: 64
      13.143 ns (0 allocations: 0 bytes)
    l: 96
      18.287 ns (0 allocations: 0 bytes)
  b: true
    l: 32
      24.725 ns (0 allocations: 0 bytes)
    l: 64
      46.582 ns (0 allocations: 0 bytes)
    l: 96
      64.334 ns (0 allocations: 0 bytes)
f: any
  b: false
    l: 32
      29.442 ns (0 allocations: 0 bytes)
    l: 64
      54.097 ns (0 allocations: 0 bytes)
    l: 96
      74.817 ns (0 allocations: 0 bytes)
  b: true
    l: 32
      13.652 ns (0 allocations: 0 bytes)
    l: 64
      13.160 ns (0 allocations: 0 bytes)
    l: 96
      22.520 ns (0 allocations: 0 bytes)

PR branch

any_all_vectorized_tuple_bool/e022f12cce1 (fork: 7 commits, 0 days):

f: all
  b: false
    l: 32
      13.472 ns (0 allocations: 0 bytes)
    l: 64
      18.027 ns (0 allocations: 0 bytes)
    l: 96
      13.913 ns (0 allocations: 0 bytes)
  b: true
    l: 32
      16.995 ns (0 allocations: 0 bytes)
    l: 64
      21.898 ns (0 allocations: 0 bytes)
    l: 96
      14.746 ns (0 allocations: 0 bytes)
f: any
  b: false
    l: 32
      13.662 ns (0 allocations: 0 bytes)
    l: 64
      19.315 ns (0 allocations: 0 bytes)
    l: 96
      13.662 ns (0 allocations: 0 bytes)
  b: true
    l: 32
      18.279 ns (0 allocations: 0 bytes)
    l: 64
      13.512 ns (0 allocations: 0 bytes)
    l: 96
      24.966 ns (0 allocations: 0 bytes)

Interpretation

Speedups for (all, true) and (any, false).

nsajko

Be more conservative by restricting the new methods to homogeneous tuples.

base/anyall.jl

In particular: * Help ensure vectorization for homogeneous tuples of `Bool`. Inspired by JuliaLang#55673, but more general by using a loop, thus being performant for any input length. * Delete single-argument methods, instead define methods dispatching on `typeof(identity)`. This makes the methods more generally useful. * Make some optimizations defined for `all` also be defined for `any` in a symmetric manner. * Delete the methods specific to the empty tuple, as they're not required for such calls to be foldable. Closes JuliaLang#55673

Because the short-circuiting is promised by the docs.

nsajko changed the title ~~performance and effects improvements for any and all~~ performance improvements for any and all Jan 18, 2025

nsajko mentioned this pull request Jan 18, 2025

fast methods for any and all for Bool tuples #55673

Open

nsajko added performance Must go faster collections Data structures holding multiple items, e.g. sets labels Jan 18, 2025

nsajko commented Jan 18, 2025

View reviewed changes

base/tuple.jl Outdated Show resolved Hide resolved

nsajko force-pushed the any_all_vectorized_tuple_bool branch from 68013f3 to a6a6f6b Compare January 18, 2025 08:50

This comment was marked as resolved.

Sign in to view

nsajko force-pushed the any_all_vectorized_tuple_bool branch 2 times, most recently from 56bdef8 to cd02788 Compare January 18, 2025 12:05

nsajko marked this pull request as draft January 19, 2025 08:43

nsajko mentioned this pull request Jan 19, 2025

use @inbounds when getting the element of a tuple? JuliaArrays/CheckedSizeProduct.jl#15

Open

nsajko force-pushed the any_all_vectorized_tuple_bool branch 3 times, most recently from 0626951 to e022f12 Compare January 19, 2025 11:37

nsajko marked this pull request as ready for review January 19, 2025 14:23

nsajko commented Jan 19, 2025

View reviewed changes

base/anyall.jl Outdated Show resolved Hide resolved

base/anyall.jl Outdated Show resolved Hide resolved

nsajko and others added 8 commits January 21, 2025 07:08

delete unnecessary vararg missing methods

3484262

try to improve _any, _all by avoiding unnecessary bounds checking

1c9f671

_any, _all: avoid short-circuiting to help vectorization

9dda2b3

delete unnecessary code

9c51562

delete more unnecessary code

1b02c72

fix: only avoid short-circuiting when the function is identity

099784a

Because the short-circuiting is promised by the docs.

restrict the new methods to homogeneous tuples only

481adeb

nsajko force-pushed the any_all_vectorized_tuple_bool branch from 83c59d5 to 481adeb Compare January 21, 2025 06:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance improvements for `any` and `all` #57091

performance improvements for `any` and `all` #57091

nsajko commented Jan 18, 2025 •

edited

Loading

nsajko commented Jan 18, 2025

Seelengrab commented Jan 18, 2025

nsajko commented Jan 18, 2025 •

edited

Loading

This comment was marked as resolved.

nsajko left a comment

performance improvements for any and all #57091

Are you sure you want to change the base?

performance improvements for any and all #57091

Conversation

nsajko commented Jan 18, 2025 • edited Loading

nsajko commented Jan 18, 2025

Seelengrab commented Jan 18, 2025

nsajko commented Jan 18, 2025 • edited Loading

Benchmarking script

Results

master branch

PR branch

Interpretation

This comment was marked as resolved.

nsajko left a comment

Choose a reason for hiding this comment

performance improvements for `any` and `all` #57091

performance improvements for `any` and `all` #57091

nsajko commented Jan 18, 2025 •

edited

Loading

nsajko commented Jan 18, 2025 •

edited

Loading

`master` branch