implement multidimensional matrix multiply which supports blas and parallelization #929
+520
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #16
Fixes #886
Updates #678
Refer to the implementation of numpy.dot
The general rules are as follows:
If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation).
If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred.
If either a or b is 0-D (scalar), it is equivalent to multiply and using numpy.multiply(a, b) or a * b is preferred.
If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b.
If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
I conducted a three-dimensional multiplication test on my linux-x86_64:
and the test results are as follows:
The main modification is to implement the Dot trait for dimensions above two dimensions with the other dimentions in impl_linalg.rs. Panic when the shape does not match or the number of result elements overflows.
Because multi-dimensional matrix multiplication will be converted into many vector inner product calculations(refers to numpy.dot, I used lanes().into_iter() method to get the iterator that produces the required vectors. And use the dot_impl() method to calculate the vector inner product, so we can get the acceleration effect of blas.
On the other hand, the process of calculating the inner product of vectors can be parallelized using rayon.
But since the internal iter of
LanesIter
isBaseiter
, it cannot be converted into a parallel iterator simply by usingpar_iter_wrapper!(LanesIter, [Sync])
.So I created a new iterator named
LanesIterCore
instead ofBaseiter
.LanesIterCore
usesdis
andend
parameters to replace theindex
inBaseiter
. So it can simply implement theExactSizeIterator
DoubleEndedIterator
andsplit_at
required by a parallel iterator.However its current method of obtaining dimension indexes is to use the newly added
index_from_distance()
method, which needs to use several divisions and remainder to calculate an dimension indexso will be much slower than
next_for()
. I have not thought of a better solution at present, but when the amount of data is large, this effect can be ignored.