Use LAPack for polynomials ≤ degree ~450 (real) ~250 (complex)? #3

dlfivefifty · 2015-01-08T06:31:36Z

In my tests LAPack's zhseqr_/dhseqr_ is faster unless the polynomials are of higher degree, using my wrapper

https://github.com/ApproxFun/ApproxFun.jl/blob/master/src/LinearAlgebra/hesseneigs.jl

Maybe its worth wrapping these in AMVW for such cases? Unless I'm missing something.

andreasnoack · 2015-01-08T15:01:19Z

I think it depends on scope of this package. For a roots function in a general polynomial package I completely agree with you, but this package was mainly meant as a wrapper for the code in the paper. I think the best thing would be to let the Polynomials package depend of this package and use the AMVW solver when the degree is very high.

thomasmach · 2015-01-08T15:30:28Z

I am confused with this observation. In my tests comparing the Fortran code of AMVW with Fortran LAPACK the LAPACK code is only faster for degree smaller than 20.
Why do the tests in Fortran and in julia differ so much? And do your Lapack tests include the time for assembling the matrix?

dlfivefifty · 2015-01-08T20:23:49Z

Hmm, I reran the tests and am getting very different numbers, now the critical point seems to be about 75.  I suspect my laptop was in low battery mode yesterday which can be unpredictable.  The LAPack routine also has a weird jump in time at n~75 where it goes from being about 30% faster to 30% slower.  I suspect this is an OpenBLAS quirk where it switches modes too early, but setting blas_set_num_threads(1) didn’t seem to have an effect.

gc_disable()
p=[1.0:80]

@time for k=1:30
rootsjul(p)
end
@time for k=1:30
rootshess(p)
end
@time for k=1:30
AMVW.rootsAMVW(p)
end

import Base.BLAS: blas_int,BlasChar,BlasInt
for (hseqr,elty) in ((:zhseqr_,:Complex128),)
@eval function hesseneigvals(M::Matrix{$elty})
A=vec(M)

    N=size(M,1)
    info  =blas_int(0)

    ilo = blas_int(1); ihi = blas_int(N); ldh=blas_int(N);ldz=blas_int(N);lwork = blas_int(N)

    z=zero($elty)
    work  = Array($elty, N*N)
    w=Array($elty,N)

    Ec='E'
    Nc='N'
    ccall(($(BLAS.blasfunc(hseqr)),LAPACK.liblapack),
        Void,
        (Ptr{BlasChar},Ptr{BlasChar},
    Ptr{BlasInt},Ptr{BlasInt},Ptr{BlasInt},Ptr{$elty}, #A
    Ptr{BlasInt},Ptr{$elty},Ptr{$elty}, #z
    Ptr{BlasInt},Ptr{$elty},Ptr{BlasInt},Ptr{BlasInt}),
    &Ec,&Nc,&N , &ilo, &ihi, A, &ldh, w, &z, &ldz, work, &lwork, &info) 
    w
end

end

for (hseqr,elty) in ((:dhseqr_,:Float64),)
@eval function hesseneigvals(M::Matrix{$elty})
A=vec(M)

    N=size(M,1)
    info  =blas_int(0)

    ilo = blas_int(1); ihi = blas_int(N); ldh=blas_int(N);ldz=blas_int(N);

    lwork = blas_int(-1)

    z=zero($elty)
    work  = Array($elty, 1)
    wr=Array($elty,N)
    wi=Array($elty,N)

    Ec='E'
    Nc='N'
    for i=1:2
        ccall(($(BLAS.blasfunc(hseqr)),LAPACK.liblapack),
            Void,
            (Ptr{BlasChar},Ptr{BlasChar},
        Ptr{BlasInt},Ptr{BlasInt},Ptr{BlasInt},Ptr{$elty}, #A
        Ptr{BlasInt},Ptr{$elty},Ptr{$elty},Ptr{$elty}, #z
        Ptr{BlasInt},Ptr{$elty},Ptr{BlasInt},Ptr{BlasInt}),
        &Ec,&Nc,&N , &ilo, &ihi, A, &ldh, wr,wi, &z, &ldz, work, &lwork, &info) 

        if lwork < 0
            lwork=blas_int(real(work[1]))
            work=Array($elty,lwork)
        end
    end

    wr+im*wi    
end

end

function companion_matrix{T}(c::Vector{T})
n=length(c)-1

if n==0
    T[]
else
    A=zeros(T,n,n)
    @simd for k=1:n
        @inbounds A[k,end]=-c[k]/c[end]
    end
    @simd for k=2:n
        @inbounds A[k,k-1]=one(T)
    end
    A
end

end

rootshess(c)=hesseneigvals(companion_matrix(c))
rootsjul(c)=eigvals(companion_matrix(c))

On 9 Jan 2015, at 2:30 am, Thomas Mach [email protected] wrote:

I am confused with this observation. In my tests comparing the Fortran code of AMVW with Fortran LAPACK the LAPACK code is only faster for degree smaller than 20.
Why do the tests in Fortran and in julia differ so much? And do your Lapack tests include the time for assembling the matrix?

—
Reply to this email directly or view it on GitHub #3 (comment).

andreasnoack · 2015-01-09T16:56:51Z

I just tried this on my machine and I also get a crossover at 75. This was the same for OpenBLAS and MKL (which is not surprising because the QR algorithm is not that BLAS heavy).

The dhseqr routine is pretty wild and it appears to change algorithm at exactly n=75. See the "further details" section. From the graph I conjecture that the cross over point could be moved quite a bit (log scale) to the right if the point for algorithm change is modified.

andreasnoack · 2015-01-09T18:30:46Z

I've tried to change the default in LAPACK's dhseqr such that it uses the algorithm for small problems until the size is 500. This gives the following plot

and the crossover is somewhere around 120-130.

@thomasmach The timings include the construction of the companion matrix.

alanedelman · 2015-01-09T18:39:29Z

Encouragement: glad you guys are looking into this.
While at it, maybe see if answers agree very as well as the numerics would
expect?

thomasmach · 2015-01-09T18:41:54Z

For some reason Lapack is a relative factor of two than in my Fortran tests for AMVW. I don't why and will have a look into this next week.

andreasnoack · 2015-01-09T18:44:30Z

The scaling coefficients of the last plots are

AMVW: 1.85
Dense: 2.34

thomasmach · 2015-01-12T11:27:09Z

@andreasnoack Does you last comment mean that dhseqr is not in O(n^3)?

For the complex case the crossover point is much smaller, since AMVW's complex single shift version is only slightly slower, while zhseqr is almost 3 times slower than dhseqr.

andreasnoack · 2015-01-12T22:46:42Z

@thomasmach No I don't think so. The asymptotic behavior just takes a little longer to kick in. There is a significant second order effect in the fit of the dgseqr line and if I extend the degree of the polynomial, the slope approaches 3.

You are also completely right about the complex case. The plot for the complex case looks

and the cross over is 40. Why is your real solver relatively slower than LAPACK's?

thomasmach · 2015-01-13T08:35:57Z

Why is the complex code of AMVW almost as fast as the real code?

Lapack's complex arithmetic is about twice as expensive as the real arithmetic. We, however use complex rotations represented by 3 reals and real rotations represented by 2 reals. Thus, in our case the complex arithmetic is, (very) roughly speaking, 1.5 times as expensive as real arithmetic.
The real double shift code requires 7 turnovers compared to 3 turnovers in the complex single shift code. This means 1/7 more arithmetic operations.

Additionally, the real double shift code needs slightly more iterations to converge.

thomasmach mentioned this issue Mar 7, 2017

Reimplement in pure Julia? #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LAPack for polynomials ≤ degree ~450 (real) ~250 (complex)? #3

Use LAPack for polynomials ≤ degree ~450 (real) ~250 (complex)? #3

dlfivefifty commented Jan 8, 2015

andreasnoack commented Jan 8, 2015

thomasmach commented Jan 8, 2015

dlfivefifty commented Jan 8, 2015

andreasnoack commented Jan 9, 2015

andreasnoack commented Jan 9, 2015

alanedelman commented Jan 9, 2015

thomasmach commented Jan 9, 2015

andreasnoack commented Jan 9, 2015

thomasmach commented Jan 12, 2015

andreasnoack commented Jan 12, 2015

thomasmach commented Jan 13, 2015

Use LAPack for polynomials ≤ degree ~450 (real) ~250 (complex)? #3

Use LAPack for polynomials ≤ degree ~450 (real) ~250 (complex)? #3

Comments

dlfivefifty commented Jan 8, 2015

andreasnoack commented Jan 8, 2015

thomasmach commented Jan 8, 2015

dlfivefifty commented Jan 8, 2015

andreasnoack commented Jan 9, 2015

andreasnoack commented Jan 9, 2015

alanedelman commented Jan 9, 2015

thomasmach commented Jan 9, 2015

andreasnoack commented Jan 9, 2015

thomasmach commented Jan 12, 2015

andreasnoack commented Jan 12, 2015

thomasmach commented Jan 13, 2015