implement multidimensional matrix multiply which supports blas and parallelization #929

SparrowLii · 2021-02-24T10:09:31Z

Fixes #16
Fixes #886
Updates #678
Refer to the implementation of numpy.dot
The general rules are as follows:
If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation).
If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred.
If either a or b is 0-D (scalar), it is equivalent to multiply and using numpy.multiply(a, b) or a * b is preferred.
If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b.
If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

I conducted a three-dimensional multiplication test on my linux-x86_64:

    use ndarray::{Array, linalg::{Dot,ParDot}};
    extern crate openblas_src;
    fn main() {
        let a = Array::range(0., 96000., 1.).into_shape((40, 40, 60)).unwrap();
        let b = Array::range(0., 96000., 1.).into_shape((40, 60, 40)).unwrap();
    
        // without rayon
        let start=std::time::SystemTime::now();
        let c =a.dot(&b);
        let end=std::time::SystemTime::now();
        println!("{:?}",end.duration_since(start));

        // with rayon
        let start=std::time::SystemTime::now();
        let par_c =a.par_dot(&b);
        let end=std::time::SystemTime::now();
        println!("{:?}",end.duration_since(start));

        assert_eq!(c, par_c);
    }

and the test results are as follows:

    generic:                     20.288079241s
    with rayon:                  2.689783602s
    with openblas:               2.733138535s
    with rayon+openblas:         434.623734ms

The main modification is to implement the Dot trait for dimensions above two dimensions with the other dimentions in impl_linalg.rs. Panic when the shape does not match or the number of result elements overflows.

Because multi-dimensional matrix multiplication will be converted into many vector inner product calculations(refers to numpy.dot, I used lanes().into_iter() method to get the iterator that produces the required vectors. And use the dot_impl() method to calculate the vector inner product, so we can get the acceleration effect of blas.

    let v = self.lanes(Axis(l1-1))
        .into_iter()
        .map(|l1| { rhs.lanes(Axis(match_dim))
            .into_iter()
            .map(|l2| l2.dot_impl(&l1))
            .collect::<Vec<_>>()
        })
        .flatten()
        .collect::<Vec<_>>();

On the other hand, the process of calculating the inner product of vectors can be parallelized using rayon.

    let v = self.lanes(Axis(l1-1))
        .into_iter()
        .into_par_iter()
        .map(|l1| { rhs.lanes(Axis(match_dim))
            .into_iter()
            .into_par_iter()
            .map(|l2| l2.dot_impl(&l1))
            .collect::<Vec<_>>()
        })
        .flatten()
        .collect::<Vec<_>>();

But since the internal iter of LanesIter is Baseiter, it cannot be converted into a parallel iterator simply by using par_iter_wrapper!(LanesIter, [Sync]).

    pub struct Baseiter<A, D> {
        ptr: *mut A,
        dim: D,
        strides: D,
        index: Option<D>,
    }

So I created a new iterator named LanesIterCore instead of Baseiter.

    pub struct LanesIterCore<A, D> {
        ptr: *mut A,
        dim: D,
        strides: D,
        dis: usize,
        end: usize,
    }

LanesIterCore uses dis and end parameters to replace the index in Baseiter. So it can simply implement the ExactSizeIterator DoubleEndedIterator and split_at required by a parallel iterator.

However its current method of obtaining dimension indexes is to use the newly added index_from_distance() method, which needs to use several divisions and remainder to calculate an dimension index

    /// Return the index of the dimension from a specific distance from first index.
    ///
    /// Panics if dis is greater than or equal to the count of dim elements.
    pub(crate) fn index_from_distance<D: Dimension>(dim: &D, dis: usize) -> D {
        let mut dis = dis;
        let mut index = D::zeros(dim.ndim());
        let len = dim.ndim();
        for i in 0..len {
            let m = len - i -1;
            index[m] = dis % dim[m];
            dis = dis / dim[m];
            if dis == 0 {
                break;
            }
        }
        assert!(dis == 0);
        index
    }

so will be much slower than next_for(). I have not thought of a better solution at present, but when the amount of data is large, this effect can be ignored.

…rallelization

SparrowLii · 2021-03-01T00:20:07Z

@bluss @jturner314 Could you provide some advice, if you have time?

bluss · 2021-03-01T18:40:07Z

Hi, as you can see I haven't had time for github the last weeks unfortunately, but will be back to get the ndarray 0.15 release done - some of your prs will be part of that.

For this PR I can offer some feedback, but only the big picture. 🙂
Matrix multiplication is a well-studied problem and the memory access pattern in the algorithm used here does not scale very well with very big matrices in terms of performance (to not miss out on using blas to its full potential, its matrix multiplication should be used - i.e. the *gemm functions). Look into how BLIS, OpenBLAS or matrixmultiply are implemented. OpenBLAS and matrixmultiply both already implement parallelization as well, and maybe that's the right layer to implement it. Running the benchmark (examples/ directory) in matrixmultiply can let you compare performance - consider big matrices like 1000 x 1000 or larger for performance.

You could make your own crate for experimenting with this problem. I won't be able to review it further right now.

bluss · 2021-03-12T09:24:27Z

Closing for now

SparrowLii added 2 commits February 24, 2021 18:08

implement multidimensional matrix multiply which supports blas and pa…

8a3afbf

…rallelization

Add test cases

7577542

bluss closed this Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement multidimensional matrix multiply which supports blas and parallelization #929

implement multidimensional matrix multiply which supports blas and parallelization #929

SparrowLii commented Feb 24, 2021 •

edited

Loading

SparrowLii commented Mar 1, 2021

bluss commented Mar 1, 2021 •

edited

Loading

bluss commented Mar 12, 2021

implement multidimensional matrix multiply which supports blas and parallelization #929

implement multidimensional matrix multiply which supports blas and parallelization #929

Conversation

SparrowLii commented Feb 24, 2021 • edited Loading

SparrowLii commented Mar 1, 2021

bluss commented Mar 1, 2021 • edited Loading

bluss commented Mar 12, 2021

SparrowLii commented Feb 24, 2021 •

edited

Loading

bluss commented Mar 1, 2021 •

edited

Loading