Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LAPack for polynomials ≤ degree ~450 (real) ~250 (complex)? #3

Open
dlfivefifty opened this issue Jan 8, 2015 · 11 comments
Open

Comments

@dlfivefifty
Copy link

In my tests LAPack's zhseqr_/dhseqr_ is faster unless the polynomials are of higher degree, using my wrapper

https://github.com/ApproxFun/ApproxFun.jl/blob/master/src/LinearAlgebra/hesseneigs.jl

Maybe its worth wrapping these in AMVW for such cases? Unless I'm missing something.

@andreasnoack
Copy link
Owner

I think it depends on scope of this package. For a roots function in a general polynomial package I completely agree with you, but this package was mainly meant as a wrapper for the code in the paper. I think the best thing would be to let the Polynomials package depend of this package and use the AMVW solver when the degree is very high.

@thomasmach
Copy link

I am confused with this observation. In my tests comparing the Fortran code of AMVW with Fortran LAPACK the LAPACK code is only faster for degree smaller than 20.
Why do the tests in Fortran and in julia differ so much? And do your Lapack tests include the time for assembling the matrix?

@dlfivefifty
Copy link
Author

Hmm, I reran the tests and am getting very different numbers, now the critical point seems to be about 75.  I suspect my laptop was in low battery mode yesterday which can be unpredictable.  The LAPack routine also has a weird jump in time at n~75 where it goes from being about 30% faster to 30% slower.  I suspect this is an OpenBLAS quirk where it switches modes too early, but setting blas_set_num_threads(1) didn’t seem to have an effect.

gc_disable()
p=[1.0:80]

@time for k=1:30
rootsjul(p)
end
@time for k=1:30
rootshess(p)
end
@time for k=1:30
AMVW.rootsAMVW(p)
end

import Base.BLAS: blas_int,BlasChar,BlasInt
for (hseqr,elty) in ((:zhseqr_,:Complex128),)
@eval function hesseneigvals(M::Matrix{$elty})
A=vec(M)

    N=size(M,1)
    info  =blas_int(0)

    ilo = blas_int(1); ihi = blas_int(N); ldh=blas_int(N);ldz=blas_int(N);lwork = blas_int(N)

    z=zero($elty)
    work  = Array($elty, N*N)
    w=Array($elty,N)

    Ec='E'
    Nc='N'
    ccall(($(BLAS.blasfunc(hseqr)),LAPACK.liblapack),
        Void,
        (Ptr{BlasChar},Ptr{BlasChar},
    Ptr{BlasInt},Ptr{BlasInt},Ptr{BlasInt},Ptr{$elty}, #A
    Ptr{BlasInt},Ptr{$elty},Ptr{$elty}, #z
    Ptr{BlasInt},Ptr{$elty},Ptr{BlasInt},Ptr{BlasInt}),
    &Ec,&Nc,&N , &ilo, &ihi, A, &ldh, w, &z, &ldz, work, &lwork, &info) 
    w
end

end

for (hseqr,elty) in ((:dhseqr_,:Float64),)
@eval function hesseneigvals(M::Matrix{$elty})
A=vec(M)

    N=size(M,1)
    info  =blas_int(0)

    ilo = blas_int(1); ihi = blas_int(N); ldh=blas_int(N);ldz=blas_int(N);

    lwork = blas_int(-1)

    z=zero($elty)
    work  = Array($elty, 1)
    wr=Array($elty,N)
    wi=Array($elty,N)

    Ec='E'
    Nc='N'
    for i=1:2
        ccall(($(BLAS.blasfunc(hseqr)),LAPACK.liblapack),
            Void,
            (Ptr{BlasChar},Ptr{BlasChar},
        Ptr{BlasInt},Ptr{BlasInt},Ptr{BlasInt},Ptr{$elty}, #A
        Ptr{BlasInt},Ptr{$elty},Ptr{$elty},Ptr{$elty}, #z
        Ptr{BlasInt},Ptr{$elty},Ptr{BlasInt},Ptr{BlasInt}),
        &Ec,&Nc,&N , &ilo, &ihi, A, &ldh, wr,wi, &z, &ldz, work, &lwork, &info) 

        if lwork < 0
            lwork=blas_int(real(work[1]))
            work=Array($elty,lwork)
        end
    end

    wr+im*wi    
end

end

function companion_matrix{T}(c::Vector{T})
n=length(c)-1

if n==0
    T[]
else
    A=zeros(T,n,n)
    @simd for k=1:n
        @inbounds A[k,end]=-c[k]/c[end]
    end
    @simd for k=2:n
        @inbounds A[k,k-1]=one(T)
    end
    A
end

end

rootshess(c)=hesseneigvals(companion_matrix(c))
rootsjul(c)=eigvals(companion_matrix(c))

On 9 Jan 2015, at 2:30 am, Thomas Mach [email protected] wrote:

I am confused with this observation. In my tests comparing the Fortran code of AMVW with Fortran LAPACK the LAPACK code is only faster for degree smaller than 20.
Why do the tests in Fortran and in julia differ so much? And do your Lapack tests include the time for assembling the matrix?


Reply to this email directly or view it on GitHub #3 (comment).

@andreasnoack
Copy link
Owner

I just tried this on my machine and I also get a crossover at 75. This was the same for OpenBLAS and MKL (which is not surprising because the QR algorithm is not that BLAS heavy).
roottimings

The dhseqr routine is pretty wild and it appears to change algorithm at exactly n=75. See the "further details" section. From the graph I conjecture that the cross over point could be moved quite a bit (log scale) to the right if the point for algorithm change is modified.

@andreasnoack
Copy link
Owner

I've tried to change the default in LAPACK's dhseqr such that it uses the algorithm for small problems until the size is 500. This gives the following plot

roottimings2

and the crossover is somewhere around 120-130.

@thomasmach The timings include the construction of the companion matrix.

@alanedelman
Copy link
Collaborator

Encouragement: glad you guys are looking into this.
While at it, maybe see if answers agree very as well as the numerics would
expect?

@thomasmach
Copy link

For some reason Lapack is a relative factor of two than in my Fortran tests for AMVW. I don't why and will have a look into this next week.

@andreasnoack
Copy link
Owner

The scaling coefficients of the last plots are

  • AMVW: 1.85
  • Dense: 2.34

@thomasmach
Copy link

@andreasnoack Does you last comment mean that dhseqr is not in O(n^3)?

For the complex case the crossover point is much smaller, since AMVW's complex single shift version is only slightly slower, while zhseqr is almost 3 times slower than dhseqr.

@andreasnoack
Copy link
Owner

@thomasmach No I don't think so. The asymptotic behavior just takes a little longer to kick in. There is a significant second order effect in the fit of the dgseqr line and if I extend the degree of the polynomial, the slope approaches 3.

You are also completely right about the complex case. The plot for the complex case looks

roottimings3

and the cross over is 40. Why is your real solver relatively slower than LAPACK's?

@thomasmach
Copy link

Why is the complex code of AMVW almost as fast as the real code?

  1. Lapack's complex arithmetic is about twice as expensive as the real arithmetic. We, however use complex rotations represented by 3 reals and real rotations represented by 2 reals. Thus, in our case the complex arithmetic is, (very) roughly speaking, 1.5 times as expensive as real arithmetic.
  2. The real double shift code requires 7 turnovers compared to 3 turnovers in the complex single shift code. This means 1/7 more arithmetic operations.

Additionally, the real double shift code needs slightly more iterations to converge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants