Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance improvements for any and all #57091

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

nsajko
Copy link
Contributor

@nsajko nsajko commented Jan 18, 2025

  • Help ensure vectorization of any and all for homogeneous tuples of Bool, by avoiding bounds-checking and avoiding short-circuiting. Inspired by fast methods for any and all for Bool tuples #55673, but more general by relying on a loop instead of on recursion, thus being performant for any input length.
  • Delete a bunch of now-useless methods. The dispatch doesn't happen on Bool or on Missing any more.
    • eight methods of all are deleted
    • four methods of any are deleted

Closes #55673

@nsajko nsajko changed the title performance and effects improvements for any and all performance improvements for any and all Jan 18, 2025
@nsajko
Copy link
Contributor Author

nsajko commented Jan 18, 2025

Here's the code_llvm:

julia> code_llvm(all, Tuple{Tuple{Vararg{Bool, 32}}}; debuginfo=:none)
; Function Signature: all(NTuple{32, Bool})
define i8 @julia_all_2401(ptr nocapture noundef nonnull readonly align 1 dereferenceable(32) %"itr::Tuple") #0 {
top:
  %0 = load <32 x i8>, ptr %"itr::Tuple", align 1
  %1 = icmp eq <32 x i8> %0, zeroinitializer
  %2 = bitcast <32 x i1> %1 to i32
  %3 = icmp eq i32 %2, 0
  %4 = zext i1 %3 to i8
  ret i8 %4
}

julia> code_llvm(any, Tuple{Tuple{Vararg{Bool, 32}}}; debuginfo=:none)
; Function Signature: any(NTuple{32, Bool})
define i8 @julia_any_2403(ptr nocapture noundef nonnull readonly align 1 dereferenceable(32) %"itr::Tuple") #0 {
top:
  %0 = load <32 x i8>, ptr %"itr::Tuple", align 1
  %1 = call i8 @llvm.vector.reduce.or.v32i8(<32 x i8> %0)
  %2 = icmp ne i8 %1, 0
  %3 = zext i1 %2 to i8
  ret i8 %3
}

julia> code_llvm(all, Tuple{Tuple{Vararg{Bool, 64}}}; debuginfo=:none)
; Function Signature: all(NTuple{64, Bool})
define i8 @julia_all_2405(ptr nocapture noundef nonnull readonly align 1 dereferenceable(64) %"itr::Tuple") #0 {
top:
  %wide.load = load <32 x i8>, ptr %"itr::Tuple", align 1
  %0 = icmp ne <32 x i8> %wide.load, zeroinitializer
  %1 = getelementptr inbounds i8, ptr %"itr::Tuple", i64 32
  %wide.load.1 = load <32 x i8>, ptr %1, align 1
  %2 = icmp ne <32 x i8> %wide.load.1, zeroinitializer
  %3 = and <32 x i1> %0, %2
  %4 = bitcast <32 x i1> %3 to i32
  %5 = icmp eq i32 %4, -1
  %6 = zext i1 %5 to i8
  ret i8 %6
}

julia> code_llvm(any, Tuple{Tuple{Vararg{Bool, 64}}}; debuginfo=:none)
; Function Signature: any(NTuple{64, Bool})
define i8 @julia_any_2407(ptr nocapture noundef nonnull readonly align 1 dereferenceable(64) %"itr::Tuple") #0 {
top:
  %wide.load = load <32 x i8>, ptr %"itr::Tuple", align 1
  %0 = getelementptr inbounds i8, ptr %"itr::Tuple", i64 32
  %wide.load.1 = load <32 x i8>, ptr %0, align 1
  %1 = or <32 x i8> %wide.load, %wide.load.1
  %2 = icmp ne <32 x i8> %1, zeroinitializer
  %3 = bitcast <32 x i1> %2 to i32
  %4 = icmp ne i32 %3, 0
  %5 = zext i1 %4 to i8
  ret i8 %5
}

julia> versioninfo()
Julia Version 1.12.0-DEV.unknown
Commit 319082c (2025-01-18 07:54 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
  GC: Built with stock GC
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

So it vectorizes, like in #55673, but it also vectorizes for lengths above 32.

@nsajko nsajko added performance Must go faster collections Data structures holding multiple items, e.g. sets labels Jan 18, 2025
base/tuple.jl Outdated Show resolved Hide resolved
@nsajko nsajko force-pushed the any_all_vectorized_tuple_bool branch from 68013f3 to a6a6f6b Compare January 18, 2025 08:50
@Seelengrab
Copy link
Contributor

Do you have some benchmarks showing the improved performance?

@nsajko
Copy link
Contributor Author

nsajko commented Jan 18, 2025

Benchmarking script

using BenchmarkTools

for f  (all, any)
    println("f: $f")
    for b  (false, true)
        println("  b: $b")
        for l  32:32:96
            println("    l: $l")
            print("    ")
            @btime ($f)(t) setup=(t = ntuple(Returns($b), $l);)
        end
    end
end

Results

master branch

Commit 1740287 (0 days old master):

f: all
  b: false
    l: 32
      13.392 ns (0 allocations: 0 bytes)
    l: 64
      13.143 ns (0 allocations: 0 bytes)
    l: 96
      18.287 ns (0 allocations: 0 bytes)
  b: true
    l: 32
      24.725 ns (0 allocations: 0 bytes)
    l: 64
      46.582 ns (0 allocations: 0 bytes)
    l: 96
      64.334 ns (0 allocations: 0 bytes)
f: any
  b: false
    l: 32
      29.442 ns (0 allocations: 0 bytes)
    l: 64
      54.097 ns (0 allocations: 0 bytes)
    l: 96
      74.817 ns (0 allocations: 0 bytes)
  b: true
    l: 32
      13.652 ns (0 allocations: 0 bytes)
    l: 64
      13.160 ns (0 allocations: 0 bytes)
    l: 96
      22.520 ns (0 allocations: 0 bytes)

PR branch

any_all_vectorized_tuple_bool/e022f12cce1 (fork: 7 commits, 0 days):

f: all
  b: false
    l: 32
      13.472 ns (0 allocations: 0 bytes)
    l: 64
      18.027 ns (0 allocations: 0 bytes)
    l: 96
      13.913 ns (0 allocations: 0 bytes)
  b: true
    l: 32
      16.995 ns (0 allocations: 0 bytes)
    l: 64
      21.898 ns (0 allocations: 0 bytes)
    l: 96
      14.746 ns (0 allocations: 0 bytes)
f: any
  b: false
    l: 32
      13.662 ns (0 allocations: 0 bytes)
    l: 64
      19.315 ns (0 allocations: 0 bytes)
    l: 96
      13.662 ns (0 allocations: 0 bytes)
  b: true
    l: 32
      18.279 ns (0 allocations: 0 bytes)
    l: 64
      13.512 ns (0 allocations: 0 bytes)
    l: 96
      24.966 ns (0 allocations: 0 bytes)

Interpretation

Speedups for (all, true) and (any, false).

@nsajko

This comment was marked as resolved.

@nsajko nsajko force-pushed the any_all_vectorized_tuple_bool branch 2 times, most recently from 56bdef8 to cd02788 Compare January 18, 2025 12:05
@nsajko nsajko marked this pull request as draft January 19, 2025 08:43
@nsajko nsajko force-pushed the any_all_vectorized_tuple_bool branch 3 times, most recently from 0626951 to e022f12 Compare January 19, 2025 11:37
@nsajko nsajko marked this pull request as ready for review January 19, 2025 14:23
Copy link
Contributor Author

@nsajko nsajko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more conservative by restricting the new methods to homogeneous tuples.

base/anyall.jl Outdated Show resolved Hide resolved
base/anyall.jl Outdated Show resolved Hide resolved
nsajko and others added 8 commits January 21, 2025 07:08
In particular:
* Help ensure vectorization for homogeneous tuples of `Bool`. Inspired
  by JuliaLang#55673, but more general by using a loop, thus being performant
  for any input length.
* Delete single-argument methods, instead define methods dispatching on
  `typeof(identity)`. This makes the methods more generally useful.
* Make some optimizations defined for `all` also be defined for `any`
  in a symmetric manner.
* Delete the methods specific to the empty tuple, as they're not
  required for such calls to be foldable.

Closes JuliaLang#55673
Because the short-circuiting is promised by the docs.
@nsajko nsajko force-pushed the any_all_vectorized_tuple_bool branch from 83c59d5 to 481adeb Compare January 21, 2025 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants