performance Dense layer on CPU

This is just to track the performance of the Dense layer on CPU. I use the following script:

```julia
using BenchmarkTools, Flux
using Zygote: pullback

using LinearAlgebra
BLAS.set_num_threads(1)

function perf_test(n)
    r = rand(Float32, n, n, relu) 
    d = Dense(n, n)
    println("  FORW")
    @btime sum($d($r))
    println("  GRADIENT")
    @btime gradient(() -> sum($d($r)), $(Flux.params(d)))
    @btime gradient((d) -> sum(d($r)), $d)
    println("  PULLBACK")
    y, back =  pullback((d) -> sum(d(r)), d)
    @btime pullback((d) -> sum(d($r)), $d)
    @btime $back(1f0)
end

println("SMALL NET n=2")
perf_test(2)
println("MEDIUM NET n=20")
perf_test(20)
println("LARGE NET n=200")
perf_test(200)
println("VERY LARGE NET n=2000")
perf_test(2000)
```
and on my system:
```julia
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1* (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.1 (ORCJIT, skylake)

(Flux) pkg> st
Project Flux v0.12.0-dev
Status `~/.julia/dev/Flux/Project.toml`
  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v2.3.0
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.7
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.15
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test
```
I obtain the following output
```julia
SMALL NET n=2
  FORW
  99.930 ns (2 allocations: 192 bytes)
  GRADIENT
  2.096 μs (40 allocations: 2.92 KiB)
  1.045 μs (31 allocations: 1.77 KiB)
  PULLBACK
  167.077 ns (5 allocations: 512 bytes)
  814.164 ns (24 allocations: 928 bytes)
MEDIUM NET n=20
  FORW
  1.049 μs (2 allocations: 3.53 KiB)
  GRADIENT
  5.334 μs (38 allocations: 12.95 KiB)
  4.222 μs (31 allocations: 11.86 KiB)
  PULLBACK
  1.383 μs (5 allocations: 5.52 KiB)
  2.747 μs (24 allocations: 5.98 KiB)
LARGE NET n=200
  FORW
  205.632 μs (4 allocations: 312.66 KiB)
  GRADIENT
  643.443 μs (44 allocations: 941.05 KiB)
  626.491 μs (37 allocations: 939.95 KiB)
  PULLBACK
  219.883 μs (8 allocations: 469.20 KiB)
  405.049 μs (27 allocations: 470.39 KiB)
VERY LARGE NET n=2000
  FORW
  214.841 ms (4 allocations: 30.52 MiB)
  GRADIENT
  637.410 ms (44 allocations: 91.56 MiB)
  637.142 ms (37 allocations: 91.56 MiB)
  PULLBACK
  217.240 ms (8 allocations: 45.78 MiB)
  418.468 ms (27 allocations: 45.78 MiB)
```

Some observations:
- the expected O(n^3) asymptotic scaling only kicks in at the largest sizes
- the pullback is ~2x slower than the forward (and even slower at very small sizes)
- for the smallest networks, there is a significant speed difference between the 2  `grandient`'s calling styles. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

performance Dense layer on CPU #1414

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

performance Dense layer on CPU #1414

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions