Releases: MikaelSlevinsky/FastTransforms
Spherical isometries, triangular block-banded eigensolvers, generalized Zernike and Dunkl-Xu OPs on the disk
This release:
- adds support for spherical isometries (rotations and reflections and general orthogonal coordinate transformations of spherical harmonic expansions), thanks in part to @ikaroruan.
- includes an implementation of the factored alternating direction implicit method for diagonal matrix equations with low-rank right-hand sides, thanks in part to @BrockKlippenstein.
- adds methods for the complete elliptic integrals K(k) and E(k) and Jacobian elliptic functions sn(x,k), cn(x,k), and dn(x,k).
- converts associated classical orthogonal polynomial connection problems from two parameter triangular banded eigenvalue problems to quadratic triangular banded generalized eigenvalue problems.
- adds fast divide and conquer methods for block 2x2 triangular-banded generalized eigenvalue problems (for linearizations of the QEPs above).
- generalizes Zernike polynomials to a real orthonormal basis for L2(D2, r2α+1(1-r^2)β dr dθ), a breaking change to the API.
- adds Dunkl-Xu polynomials on the disk, a real orthonormal basis for L2(D2, (1-x^2-y^2)β dx dy), updating the holomorphic.c example to show the duality of the approaches.
Support AArch64 linux
The SIMD required here is predominately double precision, so this invokes NEON on armv8-a, which is always available. This isn't applicable for armv7, though that distinction could be made later.
Also adds missing fallbacks for AVX and AVX_FMA spin-weighted spherical harmonic computational kernels.
Cleanup Code
Allow code to be compiled by clang, make quadmath optional, sequester x86-specific code, fix miscellaneous warnings generated by clang.
Fix C -- Julia BigFloat interoperability
Julia BigFloat
does not have the same size as mpfr_t
.
So we give all Julia-owned data its own address, and dereference to retrieve the number.
AVX (FMA) support for spin-weighted spherical harmonics
- fix tetrahedral drivers to use AVX (not 512F)
- Make docs more precise
- Docs use an independent job on Travis
Spin-weighted spherical harmonics
New features in this release:
- Support for spin-weighted spherical harmonic transforms. They are orthonormalized in L^2, with the complex Fourier series as longitudinal basis and with complex coefficients.
- Three new functions (x2 for float/double) for Horner's rule and Clenshaw's algorithm for Chebyshev series and for orthogonal polynomial series as well.
Improvements in this release:
- The code is now designed with a cross-compiler in mind. For performance-critical tasks, SIMD is hidden from the user interface and instead is dispatched based on CPU ID. This allows a cross-compiler to include functions with more advanced SIMD than legal for the host computer, but a runtime check ensures that only the best SIMD level is dispatched (closes #12 and closes #41).
- The computational kernels for the spherical/triangular/disk harmonics are refactored to not only use the correct types of registers, but also help the compiler maximize throughput. This relies on a property of Givens rotations that two adjacent rotations commute if they do not act on the same rows. This property allows one to re-order the Givens rotations to increase the ratio of computation to memory loads/stores. The computational kernels and execute drivers are largely generated by a macro, which means the code may already be prepared for AVX-1024 when the instruction sets are available in GCC. Part of this is the introduction of the ft_simd struct to store a bit-field of a variety of SIMD extensions.
- The API for the computational kernels now includes transformation from orders m1 to m2 (rather than 0/1 to m), and includes a stride parameter in the data.
- The real-to-real FFTW routines now use fftw_execute_dft_r2c and fftw_execute_dft_r2c instead of FFTW_R2HC and FFTW_HC2R-type real-to-real transforms to avoid a global transpose of the data.
- The performance benchmark timings were not scaling as O(n3) because one needs to call a function a few times, typically at least twice, before peak performance is realized. These are now updated and the macro FT_TIME helps to bring this support system-wide.
New examples in this release:
- spinweighted.c is a basic tutorial on how to use spin-weighted spherical harmonic transforms.
Releases no longer trigger the attachment of binaries, as compilation with -march=native may fail on a host computer.
Ofast to O3
-Ofast enables -ffast-math, which ignores strict IEEE compliance. This affects the global state for denormal numbers through DAZ (denormals-are-zero) and FTZ (flush-to-zero).
With -fno-protect-parens as well, -Ofast overly aggressively optimizes flops in definitions such as
static FLT X(diff)(FLT x, FLT y) {return x - y;}
resulting in the failure of test_tdc for the past while. It should be possible to reactivate this test now.
Update CFLAGS
only add -march=native and friends by default. If CFLAGS is erstwhile defined, then these are not added.
Expand platform support
If target is not defined, gcc -dumpversion is used. This helps in a cross-compiler environment where OS may not be defined.
march=native, mtune=native, and mvzeroupper are only available on x86.
More versatile make options
Make improvements (#40) * introduce FT_LIBBLAS with a default backup ==blas check if linking to gmp or mpir is necessary * position of libmath * FT_PREFIX seems simpler than FT_USE_PREDEFINED_LIBRARIES similarly for FT_LIBBLAS => FT_BLAS