Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement multidimensional matrix multiply which supports blas and parallelization #929

Closed
wants to merge 2 commits into from

Conversation

SparrowLii
Copy link
Contributor

@SparrowLii SparrowLii commented Feb 24, 2021

Fixes #16
Fixes #886
Updates #678
Refer to the implementation of numpy.dot
The general rules are as follows:
If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation).
If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred.
If either a or b is 0-D (scalar), it is equivalent to multiply and using numpy.multiply(a, b) or a * b is preferred.
If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b.
If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

I conducted a three-dimensional multiplication test on my linux-x86_64:

    use ndarray::{Array, linalg::{Dot,ParDot}};
    extern crate openblas_src;
    fn main() {
        let a = Array::range(0., 96000., 1.).into_shape((40, 40, 60)).unwrap();
        let b = Array::range(0., 96000., 1.).into_shape((40, 60, 40)).unwrap();
    
        // without rayon
        let start=std::time::SystemTime::now();
        let c =a.dot(&b);
        let end=std::time::SystemTime::now();
        println!("{:?}",end.duration_since(start));

        // with rayon
        let start=std::time::SystemTime::now();
        let par_c =a.par_dot(&b);
        let end=std::time::SystemTime::now();
        println!("{:?}",end.duration_since(start));

        assert_eq!(c, par_c);
    }

and the test results are as follows:

    generic:                     20.288079241s
    with rayon:                  2.689783602s
    with openblas:               2.733138535s
    with rayon+openblas:         434.623734ms

The main modification is to implement the Dot trait for dimensions above two dimensions with the other dimentions in impl_linalg.rs. Panic when the shape does not match or the number of result elements overflows.

Because multi-dimensional matrix multiplication will be converted into many vector inner product calculations(refers to numpy.dot, I used lanes().into_iter() method to get the iterator that produces the required vectors. And use the dot_impl() method to calculate the vector inner product, so we can get the acceleration effect of blas.

    let v = self.lanes(Axis(l1-1))
        .into_iter()
        .map(|l1| { rhs.lanes(Axis(match_dim))
            .into_iter()
            .map(|l2| l2.dot_impl(&l1))
            .collect::<Vec<_>>()
        })
        .flatten()
        .collect::<Vec<_>>();

On the other hand, the process of calculating the inner product of vectors can be parallelized using rayon.

    let v = self.lanes(Axis(l1-1))
        .into_iter()
        .into_par_iter()
        .map(|l1| { rhs.lanes(Axis(match_dim))
            .into_iter()
            .into_par_iter()
            .map(|l2| l2.dot_impl(&l1))
            .collect::<Vec<_>>()
        })
        .flatten()
        .collect::<Vec<_>>();

But since the internal iter of LanesIter is Baseiter, it cannot be converted into a parallel iterator simply by using par_iter_wrapper!(LanesIter, [Sync]).

    pub struct Baseiter<A, D> {
        ptr: *mut A,
        dim: D,
        strides: D,
        index: Option<D>,
    }

So I created a new iterator named LanesIterCore instead of Baseiter.

    pub struct LanesIterCore<A, D> {
        ptr: *mut A,
        dim: D,
        strides: D,
        dis: usize,
        end: usize,
    }

LanesIterCore uses dis and end parameters to replace the index in Baseiter. So it can simply implement the ExactSizeIterator DoubleEndedIterator and split_at required by a parallel iterator.

However its current method of obtaining dimension indexes is to use the newly added index_from_distance() method, which needs to use several divisions and remainder to calculate an dimension index

    /// Return the index of the dimension from a specific distance from first index.
    ///
    /// Panics if dis is greater than or equal to the count of dim elements.
    pub(crate) fn index_from_distance<D: Dimension>(dim: &D, dis: usize) -> D {
        let mut dis = dis;
        let mut index = D::zeros(dim.ndim());
        let len = dim.ndim();
        for i in 0..len {
            let m = len - i -1;
            index[m] = dis % dim[m];
            dis = dis / dim[m];
            if dis == 0 {
                break;
            }
        }
        assert!(dis == 0);
        index
    }

so will be much slower than next_for(). I have not thought of a better solution at present, but when the amount of data is large, this effect can be ignored.

@SparrowLii
Copy link
Contributor Author

@bluss @jturner314 Could you provide some advice, if you have time?

@bluss
Copy link
Member

bluss commented Mar 1, 2021

Hi, as you can see I haven't had time for github the last weeks unfortunately, but will be back to get the ndarray 0.15 release done - some of your prs will be part of that.

For this PR I can offer some feedback, but only the big picture. 🙂
Matrix multiplication is a well-studied problem and the memory access pattern in the algorithm used here does not scale very well with very big matrices in terms of performance (to not miss out on using blas to its full potential, its matrix multiplication should be used - i.e. the *gemm functions). Look into how BLIS, OpenBLAS or matrixmultiply are implemented. OpenBLAS and matrixmultiply both already implement parallelization as well, and maybe that's the right layer to implement it. Running the benchmark (examples/ directory) in matrixmultiply can let you compare performance - consider big matrices like 1000 x 1000 or larger for performance.

You could make your own crate for experimenting with this problem. I won't be able to review it further right now.

@bluss bluss closed this Mar 12, 2021
@bluss
Copy link
Member

bluss commented Mar 12, 2021

Closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a multidimensional matrix multiply Dot product with dyn arrays
2 participants