You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Comparing (on x86) the MLIR code generated by two approaches.
Approach 1: Multi-dimensional vector using contract.
This approach uses Muti-dimensional vector and affine maps to define reductions and parallel operations. It currently works only on an entire vector. Subset can be extracted, but indices of where to extract must be literals. This is high level, with the good (generic, abstract, use full infrastructure) and the potentially problematic (what code is generated, how to apply specific tricks).
This approach use arbitrary dimensions of 'memref' of small vectors (e.g. vector<8xf32>). Vectorization is done directly by loading, splatting, and performing SIMD operations directly. This is low level, with the good (full control, apply arbitrary pattern) and the potentially problematic (code all of the details, specific to an architecture)
Running both example using MLIR mlir-opt --convert-vector-to-scf --lower-affine --convert-scf-to-std --convert-vector-to-llvm test.mlir | mlir-translate -mlir-to-llvmir | opt -O3 -S | llc -O3 and looking at the instruction count, I got the following.
Ops.
Multi-dim vectors (# 1)
Array of simple vectors (# 2)
Multiplications (mulps)
16
16
Add
64 (adds)
16 (adds)
Load/store/move
69 (movaps)
44 ops with 16 (movaps), 12 (movq), 16 (movss)
Unpack (unpcklps)
9
0
Shuffle (shufps)
53
16
Total
256
94
Tracking the add operations, the first approach failed to simdize the additions, whereas the second approach succeeded. memory operations could be as low as 12 load and 4 store; if load/splat are used, then the 12 loads would expand to 24 load/load splats. Some memory may be due to the calling convention.
I am sure that both can possibly fixed, but the second approach generates better code at this time. It was also validated in a comparison with BLAS routines on x86, resulting in nearly as good performance as the most optimized routines. It used a heavily tiled loop nests, with buffers and transposed buffer for cache locality.
Other issues with Multi-dim vectors (currently investigated)
To generate vectors, the vector dialect recommend vector.transfer_read and its equivalent write. Initial investigation shows that a simple pattern below generate a very long code. Ideally, in this simple code, a simple memcpy should be used.
results in the asm below. Note that the transfer goes from a 4x4 memref to a 4x4 vector, so no padding or masking of any kinds should happen. I do not understand the code below, at this time.
Interestingly, the new approach is nearly as good as simple vector, but uses unaligned loads.. probably something that can be fixed. Also, the number of men ops are way off the min, but again we can probably handle it.
Comparing (on x86) the MLIR code generated by two approaches.
Approach 1: Multi-dimensional vector using contract.
This approach uses Muti-dimensional vector and affine maps to define reductions and parallel operations. It currently works only on an entire vector. Subset can be extracted, but indices of where to extract must be literals. This is high level, with the good (generic, abstract, use full infrastructure) and the potentially problematic (what code is generated, how to apply specific tricks).
Approach 2: Arrays of simple vectors
This approach use arbitrary dimensions of 'memref' of small vectors (e.g.
vector<8xf32>
). Vectorization is done directly by loading, splatting, and performing SIMD operations directly. This is low level, with the good (full control, apply arbitrary pattern) and the potentially problematic (code all of the details, specific to an architecture)Comparison.
Running both example using MLIR
mlir-opt --convert-vector-to-scf --lower-affine --convert-scf-to-std --convert-vector-to-llvm test.mlir | mlir-translate -mlir-to-llvmir | opt -O3 -S | llc -O3
and looking at the instruction count, I got the following.Tracking the add operations, the first approach failed to simdize the additions, whereas the second approach succeeded. memory operations could be as low as 12 load and 4 store; if load/splat are used, then the 12 loads would expand to 24 load/load splats. Some memory may be due to the calling convention.
I am sure that both can possibly fixed, but the second approach generates better code at this time. It was also validated in a comparison with BLAS routines on x86, resulting in nearly as good performance as the most optimized routines. It used a heavily tiled loop nests, with buffers and transposed buffer for cache locality.
Other issues with Multi-dim vectors (currently investigated)
To generate vectors, the vector dialect recommend
vector.transfer_read
and its equivalent write. Initial investigation shows that a simple pattern below generate a very long code. Ideally, in this simple code, a simple memcpy should be used.results in the asm below. Note that the transfer goes from a
4x4
memref to a4x4
vector, so no padding or masking of any kinds should happen. I do not understand the code below, at this time.I asked the MLIR forum to see if there is something very wrong that I am doing.
The text was updated successfully, but these errors were encountered: