-
-
Notifications
You must be signed in to change notification settings - Fork 611
Open
Description
This is just to track the performance of the Dense layer on CPU. I use the following script:
using BenchmarkTools, Flux
using Zygote: pullback
using LinearAlgebra
BLAS.set_num_threads(1)
function perf_test(n)
r = rand(Float32, n, n, relu)
d = Dense(n, n)
println(" FORW")
@btime sum($d($r))
println(" GRADIENT")
@btime gradient(() -> sum($d($r)), $(Flux.params(d)))
@btime gradient((d) -> sum(d($r)), $d)
println(" PULLBACK")
y, back = pullback((d) -> sum(d(r)), d)
@btime pullback((d) -> sum(d($r)), $d)
@btime $back(1f0)
end
println("SMALL NET n=2")
perf_test(2)
println("MEDIUM NET n=20")
perf_test(20)
println("LARGE NET n=200")
perf_test(200)
println("VERY LARGE NET n=2000")
perf_test(2000)
and on my system:
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1* (2020-11-09 13:37 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-10.0.1 (ORCJIT, skylake)
(Flux) pkg> st
Project Flux v0.12.0-dev
Status `~/.julia/dev/Flux/Project.toml`
[1520ce14] AbstractTrees v0.3.3
[79e6a3ab] Adapt v2.3.0
[052768ef] CUDA v2.3.0
[944b1d66] CodecZlib v0.7.0
[5ae59095] Colors v0.12.4
[d9f16b24] Functors v0.1.0
[e5e0dc1b] Juno v0.8.4
[1914dd2f] MacroTools v0.5.6
[872c559c] NNlib v0.7.7
[189a3867] Reexport v0.2.0
[2913bbd2] StatsBase v0.33.2
[a5390f91] ZipFile v0.9.3
[e88e6eb3] Zygote v0.5.15
[8bb1440f] DelimitedFiles
[37e2e46d] LinearAlgebra
[44cfe95a] Pkg
[de0858da] Printf
[9a3f8284] Random
[ea8e919c] SHA
[10745b16] Statistics
[8dfed614] Test
I obtain the following output
SMALL NET n=2
FORW
99.930 ns (2 allocations: 192 bytes)
GRADIENT
2.096 μs (40 allocations: 2.92 KiB)
1.045 μs (31 allocations: 1.77 KiB)
PULLBACK
167.077 ns (5 allocations: 512 bytes)
814.164 ns (24 allocations: 928 bytes)
MEDIUM NET n=20
FORW
1.049 μs (2 allocations: 3.53 KiB)
GRADIENT
5.334 μs (38 allocations: 12.95 KiB)
4.222 μs (31 allocations: 11.86 KiB)
PULLBACK
1.383 μs (5 allocations: 5.52 KiB)
2.747 μs (24 allocations: 5.98 KiB)
LARGE NET n=200
FORW
205.632 μs (4 allocations: 312.66 KiB)
GRADIENT
643.443 μs (44 allocations: 941.05 KiB)
626.491 μs (37 allocations: 939.95 KiB)
PULLBACK
219.883 μs (8 allocations: 469.20 KiB)
405.049 μs (27 allocations: 470.39 KiB)
VERY LARGE NET n=2000
FORW
214.841 ms (4 allocations: 30.52 MiB)
GRADIENT
637.410 ms (44 allocations: 91.56 MiB)
637.142 ms (37 allocations: 91.56 MiB)
PULLBACK
217.240 ms (8 allocations: 45.78 MiB)
418.468 ms (27 allocations: 45.78 MiB)
Some observations:
- the expected O(n^3) asymptotic scaling only kicks in at the largest sizes
- the pullback is ~2x slower than the forward (and even slower at very small sizes)
- for the smallest networks, there is a significant speed difference between the 2
grandient
's calling styles.
ToucheSir and mcabbott
Metadata
Metadata
Assignees
Labels
No labels