SIMD code generation #206

AlexandreEichenberger · 2020-07-06T20:57:43Z

Comparing (on x86) the MLIR code generated by two approaches.

Approach 1: Multi-dimensional vector using contract.

This approach uses Muti-dimensional vector and affine maps to define reductions and parallel operations. It currently works only on an entire vector. Subset can be extracted, but indices of where to extract must be literals. This is high level, with the good (generic, abstract, use full infrastructure) and the potentially problematic (what code is generated, how to apply specific tricks).

#contraction_accesses = [
  affine_map<(i, j, k) -> (i, k)>,
  affine_map<(i, j, k) -> (k, j)>,
  affine_map<(i, j, k) -> (i, j)>
]
#contraction_trait = {
  indexing_maps = #contraction_accesses,
  iterator_types = ["parallel", "parallel", "reduction"]
}
func @matMul(%A : vector<4x4xf32>, %B : vector<4x4xf32>, %C : vector<4x4xf32>) -> vector<4x4xf32> {
    %res = vector.contract #contraction_trait %A, %B, %C
        : vector<4x4xf32>, vector<4x4xf32> into vector<4x4xf32>
  return %res : vector<4x4xf32>
}

Approach 2: Arrays of simple vectors

This approach use arbitrary dimensions of 'memref' of small vectors (e.g. vector<8xf32>). Vectorization is done directly by loading, splatting, and performing SIMD operations directly. This is low level, with the good (full control, apply arbitrary pattern) and the potentially problematic (code all of the details, specific to an architecture)

func @matMul(%A : memref<4x4xf32>, %B : memref<4xvector<4xf32>>, %C : memref<4xvector<4xf32>>) -> memref<4xvector<4xf32>> {
   affine.for %i = 0 to 4 {
        affine.for %k = 0 to 4 {
            %a = affine.load %A[%i, %k] : memref<4x4xf32>
            %va = splat %a : vector<4xf32>
            %vb = affine.load %B[%k] : memref<4xvector<4xf32>>
            %vm = mulf %va, %vb : vector<4xf32>
            %vc = affine.load %C[%i] : memref<4xvector<4xf32>>
            %vres = addf %vm, %vc : vector<4xf32>
            affine.store %vres, %C[%i] : memref<4xvector<4xf32>>
       }
   }
  return %C : memref<4xvector<4xf32>>
}

Comparison.

Running both example using MLIR mlir-opt --convert-vector-to-scf --lower-affine --convert-scf-to-std --convert-vector-to-llvm test.mlir | mlir-translate -mlir-to-llvmir | opt -O3 -S | llc -O3 and looking at the instruction count, I got the following.

Ops.	Multi-dim vectors (# 1)	Array of simple vectors (# 2)
Multiplications (mulps)	16	16
Add	64 (adds)	16 (adds)
Load/store/move	69 (movaps)	44 ops with 16 (movaps), 12 (movq), 16 (movss)
Unpack (unpcklps)	9	0
Shuffle (shufps)	53	16
Total	256	94

Tracking the add operations, the first approach failed to simdize the additions, whereas the second approach succeeded. memory operations could be as low as 12 load and 4 store; if load/splat are used, then the 12 loads would expand to 24 load/load splats. Some memory may be due to the calling convention.

I am sure that both can possibly fixed, but the second approach generates better code at this time. It was also validated in a comparison with BLAS routines on x86, resulting in nearly as good performance as the most optimized routines. It used a heavily tiled loop nests, with buffers and transposed buffer for cache locality.

Other issues with Multi-dim vectors (currently investigated)

To generate vectors, the vector dialect recommend vector.transfer_read and its equivalent write. Initial investigation shows that a simple pattern below generate a very long code. Ideally, in this simple code, a simple memcpy should be used.

func @transfer_read_2d(%A : memref<4x4xf32>, %base1: index, %base2: index) -> vector<4x4xf32>{
  %fm42 = constant -42.0: f32
  %f = vector.transfer_read %A[%base1, %base2], %fm42
      //{permutation_map = affine_map<(d0, d1) -> (d0, d1)>} 
    : memref<4x4xf32>, vector<4x4xf32>
  return %f : vector<4x4xf32>
}

results in the asm below. Note that the transfer goes from a 4x4 memref to a 4x4 vector, so no padding or masking of any kinds should happen. I do not understand the code below, at this time.

_transfer_read_2d:                      ## @transfer_read_2d
Lfunc_begin0:
	.file	1 "/Users/alexe/MLIR/atests/simd/<stdin>"
	.loc	1 5 0                   ## <stdin>:5:0
	.cfi_sections .debug_frame
	.cfi_startproc
## %bb.0:
	movq	24(%rsp), %rax
	movq	16(%rsp), %rcx
Ltmp0:
	.loc	1 0 0 prologue_end      ## <stdin>:0:0
	movq	%rax, %xmm0
	pshufd	$68, %xmm0, %xmm9       ## xmm9 = xmm0[0,1,0,1]
	movdqa	LCPI0_0(%rip), %xmm8    ## xmm8 = [2,3]
	paddq	%xmm9, %xmm8
	paddq	LCPI0_1(%rip), %xmm9
	movaps	LCPI0_2(%rip), %xmm3    ## xmm3 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 40 11                 ## <stdin>:40:11
	cmpq	$3, %rcx
	.loc	1 0 0 is_stmt 0         ## <stdin>:0:0
	movaps	%xmm3, %xmm0
	.loc	1 41 5 is_stmt 1        ## <stdin>:41:5
	jg	LBB0_10
## %bb.1:
	.loc	1 50 11                 ## <stdin>:50:11
	leaq	(%rax,%rcx,4), %rdx
	.loc	1 51 11                 ## <stdin>:51:11
	leaq	(%rsi,%rdx,4), %rdx
	.loc	1 67 11                 ## <stdin>:67:11
	movdqa	LCPI0_3(%rip), %xmm0    ## xmm0 = [2147483648,2147483648]
	movdqa	%xmm9, %xmm1
	pxor	%xmm0, %xmm1
	movdqa	LCPI0_4(%rip), %xmm2    ## xmm2 = [2147483652,2147483652]
	movdqa	%xmm2, %xmm6
	pcmpeqd	%xmm1, %xmm6
	movdqa	%xmm2, %xmm7
	pcmpgtd	%xmm1, %xmm7
	pshufd	$160, %xmm7, %xmm1      ## xmm1 = xmm7[0,0,2,2]
	pand	%xmm6, %xmm1
	por	%xmm7, %xmm1
	pxor	%xmm8, %xmm0
	movdqa	%xmm2, %xmm6
	pcmpeqd	%xmm0, %xmm6
	pcmpgtd	%xmm0, %xmm2
	pshufd	$160, %xmm2, %xmm0      ## xmm0 = xmm2[0,0,2,2]
	pand	%xmm6, %xmm0
	por	%xmm2, %xmm0
	packssdw	%xmm0, %xmm1
	movmskps	%xmm1, %edi
	testb	$1, %dil
	je	LBB0_2
## %bb.3:                               ## %cond.load
	movss	(%rdx), %xmm1           ## xmm1 = mem[0],zero,zero,zero
	movaps	LCPI0_5(%rip), %xmm0    ## xmm0 = <u,-4.2E+1,-4.2E+1,-4.2E+1>
	movss	%xmm1, %xmm0            ## xmm0 = xmm1[0],xmm0[1,2,3]
	testb	$2, %dil
	jne	LBB0_5
	jmp	LBB0_6
LBB0_2:
	.loc	1 0 11 is_stmt 0        ## <stdin>:0:11
	movaps	LCPI0_2(%rip), %xmm0    ## xmm0 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 67 11                 ## <stdin>:67:11
	testb	$2, %dil
	je	LBB0_6
LBB0_5:                                 ## %cond.load1
	movss	4(%rdx), %xmm1          ## xmm1 = mem[0],zero,zero,zero
	shufps	$0, %xmm0, %xmm1        ## xmm1 = xmm1[0,0],xmm0[0,0]
	shufps	$226, %xmm0, %xmm1      ## xmm1 = xmm1[2,0],xmm0[2,3]
	movaps	%xmm1, %xmm0
LBB0_6:                                 ## %else2
	testb	$4, %dil
	jne	LBB0_7
## %bb.8:                               ## %else5
	testb	$8, %dil
	je	LBB0_10
LBB0_9:                                 ## %cond.load7
	movss	12(%rdx), %xmm1         ## xmm1 = mem[0],zero,zero,zero
	shufps	$32, %xmm0, %xmm1       ## xmm1 = xmm1[0,0],xmm0[2,0]
	shufps	$36, %xmm1, %xmm0       ## xmm0 = xmm0[0,1],xmm1[2,0]
LBB0_10:
	.loc	1 39 11 is_stmt 1       ## <stdin>:39:11
	leaq	1(%rcx), %rdx
	.loc	1 40 11                 ## <stdin>:40:11
	cmpq	$3, %rdx
	.loc	1 0 0 is_stmt 0         ## <stdin>:0:0
	movaps	%xmm3, %xmm1
	.loc	1 41 5 is_stmt 1        ## <stdin>:41:5
	jg	LBB0_20
## %bb.11:
	.loc	1 50 11                 ## <stdin>:50:11
	leaq	(%rax,%rdx,4), %rdx
	.loc	1 51 11                 ## <stdin>:51:11
	leaq	(%rsi,%rdx,4), %rdx
	.loc	1 67 11                 ## <stdin>:67:11
	movdqa	LCPI0_3(%rip), %xmm1    ## xmm1 = [2147483648,2147483648]
	movdqa	%xmm9, %xmm2
	pxor	%xmm1, %xmm2
	movdqa	LCPI0_4(%rip), %xmm6    ## xmm6 = [2147483652,2147483652]
	movdqa	%xmm6, %xmm7
	pcmpeqd	%xmm2, %xmm7
	movdqa	%xmm6, %xmm4
	pcmpgtd	%xmm2, %xmm4
	pshufd	$160, %xmm4, %xmm2      ## xmm2 = xmm4[0,0,2,2]
	pand	%xmm7, %xmm2
	por	%xmm4, %xmm2
	pxor	%xmm8, %xmm1
	movdqa	%xmm6, %xmm4
	pcmpeqd	%xmm1, %xmm4
	pcmpgtd	%xmm1, %xmm6
	pshufd	$160, %xmm6, %xmm1      ## xmm1 = xmm6[0,0,2,2]
	pand	%xmm4, %xmm1
	por	%xmm6, %xmm1
	packssdw	%xmm1, %xmm2
	movmskps	%xmm2, %edi
	testb	$1, %dil
	je	LBB0_12
## %bb.13:                              ## %cond.load11
	movss	(%rdx), %xmm2           ## xmm2 = mem[0],zero,zero,zero
	movaps	LCPI0_5(%rip), %xmm1    ## xmm1 = <u,-4.2E+1,-4.2E+1,-4.2E+1>
	movss	%xmm2, %xmm1            ## xmm1 = xmm2[0],xmm1[1,2,3]
	testb	$2, %dil
	jne	LBB0_15
	jmp	LBB0_16
LBB0_7:                                 ## %cond.load4
	movss	8(%rdx), %xmm1          ## xmm1 = mem[0],zero,zero,zero
	shufps	$48, %xmm0, %xmm1       ## xmm1 = xmm1[0,0],xmm0[3,0]
	shufps	$132, %xmm1, %xmm0      ## xmm0 = xmm0[0,1],xmm1[0,2]
	testb	$8, %dil
	jne	LBB0_9
	jmp	LBB0_10
LBB0_12:
	.loc	1 0 11 is_stmt 0        ## <stdin>:0:11
	movaps	LCPI0_2(%rip), %xmm1    ## xmm1 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 67 11                 ## <stdin>:67:11
	testb	$2, %dil
	je	LBB0_16
LBB0_15:                                ## %cond.load14
	movss	4(%rdx), %xmm2          ## xmm2 = mem[0],zero,zero,zero
	shufps	$0, %xmm1, %xmm2        ## xmm2 = xmm2[0,0],xmm1[0,0]
	shufps	$226, %xmm1, %xmm2      ## xmm2 = xmm2[2,0],xmm1[2,3]
	movaps	%xmm2, %xmm1
LBB0_16:                                ## %else15
	testb	$4, %dil
	jne	LBB0_17
## %bb.18:                              ## %else18
	testb	$8, %dil
	je	LBB0_20
LBB0_19:                                ## %cond.load20
	movss	12(%rdx), %xmm2         ## xmm2 = mem[0],zero,zero,zero
	shufps	$32, %xmm1, %xmm2       ## xmm2 = xmm2[0,0],xmm1[2,0]
	shufps	$36, %xmm2, %xmm1       ## xmm1 = xmm1[0,1],xmm2[2,0]
LBB0_20:
	.loc	1 39 11 is_stmt 1       ## <stdin>:39:11
	leaq	2(%rcx), %rdx
	.loc	1 40 11                 ## <stdin>:40:11
	cmpq	$3, %rdx
	.loc	1 0 0 is_stmt 0         ## <stdin>:0:0
	movaps	%xmm3, %xmm2
	.loc	1 41 5 is_stmt 1        ## <stdin>:41:5
	jg	LBB0_30
## %bb.21:
	.loc	1 50 11                 ## <stdin>:50:11
	leaq	(%rax,%rdx,4), %rdx
	.loc	1 51 11                 ## <stdin>:51:11
	leaq	(%rsi,%rdx,4), %rdx
	.loc	1 67 11                 ## <stdin>:67:11
	movdqa	LCPI0_3(%rip), %xmm2    ## xmm2 = [2147483648,2147483648]
	movdqa	%xmm9, %xmm4
	pxor	%xmm2, %xmm4
	movdqa	LCPI0_4(%rip), %xmm6    ## xmm6 = [2147483652,2147483652]
	movdqa	%xmm6, %xmm7
	pcmpeqd	%xmm4, %xmm7
	movdqa	%xmm6, %xmm5
	pcmpgtd	%xmm4, %xmm5
	pshufd	$160, %xmm5, %xmm4      ## xmm4 = xmm5[0,0,2,2]
	pand	%xmm7, %xmm4
	por	%xmm5, %xmm4
	pxor	%xmm8, %xmm2
	movdqa	%xmm6, %xmm5
	pcmpeqd	%xmm2, %xmm5
	pcmpgtd	%xmm2, %xmm6
	pshufd	$160, %xmm6, %xmm2      ## xmm2 = xmm6[0,0,2,2]
	pand	%xmm5, %xmm2
	por	%xmm6, %xmm2
	packssdw	%xmm2, %xmm4
	movmskps	%xmm4, %edi
	testb	$1, %dil
	je	LBB0_22
## %bb.23:                              ## %cond.load24
	movss	(%rdx), %xmm4           ## xmm4 = mem[0],zero,zero,zero
	movaps	LCPI0_5(%rip), %xmm2    ## xmm2 = <u,-4.2E+1,-4.2E+1,-4.2E+1>
	movss	%xmm4, %xmm2            ## xmm2 = xmm4[0],xmm2[1,2,3]
	testb	$2, %dil
	jne	LBB0_25
	jmp	LBB0_26
LBB0_17:                                ## %cond.load17
	movss	8(%rdx), %xmm2          ## xmm2 = mem[0],zero,zero,zero
	shufps	$48, %xmm1, %xmm2       ## xmm2 = xmm2[0,0],xmm1[3,0]
	shufps	$132, %xmm2, %xmm1      ## xmm1 = xmm1[0,1],xmm2[0,2]
	testb	$8, %dil
	jne	LBB0_19
	jmp	LBB0_20
LBB0_22:
	.loc	1 0 11 is_stmt 0        ## <stdin>:0:11
	movaps	LCPI0_2(%rip), %xmm2    ## xmm2 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 67 11                 ## <stdin>:67:11
	testb	$2, %dil
	je	LBB0_26
LBB0_25:                                ## %cond.load27
	movss	4(%rdx), %xmm4          ## xmm4 = mem[0],zero,zero,zero
	shufps	$0, %xmm2, %xmm4        ## xmm4 = xmm4[0,0],xmm2[0,0]
	shufps	$226, %xmm2, %xmm4      ## xmm4 = xmm4[2,0],xmm2[2,3]
	movaps	%xmm4, %xmm2
LBB0_26:                                ## %else28
	testb	$4, %dil
	jne	LBB0_27
## %bb.28:                              ## %else31
	testb	$8, %dil
	je	LBB0_30
LBB0_29:                                ## %cond.load33
	movss	12(%rdx), %xmm4         ## xmm4 = mem[0],zero,zero,zero
	shufps	$32, %xmm2, %xmm4       ## xmm4 = xmm4[0,0],xmm2[2,0]
	shufps	$36, %xmm4, %xmm2       ## xmm2 = xmm2[0,1],xmm4[2,0]
LBB0_30:
	.loc	1 39 11 is_stmt 1       ## <stdin>:39:11
	addq	$3, %rcx
	.loc	1 40 11                 ## <stdin>:40:11
	cmpq	$3, %rcx
	.loc	1 41 5                  ## <stdin>:41:5
	jg	LBB0_40
## %bb.31:
	.loc	1 50 11                 ## <stdin>:50:11
	leaq	(%rax,%rcx,4), %rax
	.loc	1 51 11                 ## <stdin>:51:11
	leaq	(%rsi,%rax,4), %rax
	.loc	1 67 11                 ## <stdin>:67:11
	movdqa	LCPI0_3(%rip), %xmm3    ## xmm3 = [2147483648,2147483648]
	pxor	%xmm3, %xmm9
	movdqa	LCPI0_4(%rip), %xmm4    ## xmm4 = [2147483652,2147483652]
	movdqa	%xmm4, %xmm5
	pcmpeqd	%xmm9, %xmm5
	movdqa	%xmm4, %xmm6
	pcmpgtd	%xmm9, %xmm6
	pshufd	$160, %xmm6, %xmm7      ## xmm7 = xmm6[0,0,2,2]
	pand	%xmm5, %xmm7
	por	%xmm6, %xmm7
	pxor	%xmm3, %xmm8
	movdqa	%xmm4, %xmm3
	pcmpeqd	%xmm8, %xmm3
	pcmpgtd	%xmm8, %xmm4
	pshufd	$160, %xmm4, %xmm5      ## xmm5 = xmm4[0,0,2,2]
	pand	%xmm3, %xmm5
	por	%xmm4, %xmm5
	packssdw	%xmm5, %xmm7
	movmskps	%xmm7, %ecx
	testb	$1, %cl
	je	LBB0_32
## %bb.33:                              ## %cond.load37
	movss	(%rax), %xmm4           ## xmm4 = mem[0],zero,zero,zero
	movaps	LCPI0_5(%rip), %xmm3    ## xmm3 = <u,-4.2E+1,-4.2E+1,-4.2E+1>
	movss	%xmm4, %xmm3            ## xmm3 = xmm4[0],xmm3[1,2,3]
	testb	$2, %cl
	jne	LBB0_35
	jmp	LBB0_36
LBB0_27:                                ## %cond.load30
	movss	8(%rdx), %xmm4          ## xmm4 = mem[0],zero,zero,zero
	shufps	$48, %xmm2, %xmm4       ## xmm4 = xmm4[0,0],xmm2[3,0]
	shufps	$132, %xmm4, %xmm2      ## xmm2 = xmm2[0,1],xmm4[0,2]
	testb	$8, %dil
	jne	LBB0_29
	jmp	LBB0_30
LBB0_32:
	.loc	1 0 11 is_stmt 0        ## <stdin>:0:11
	movaps	LCPI0_2(%rip), %xmm3    ## xmm3 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 67 11                 ## <stdin>:67:11
	testb	$2, %cl
	je	LBB0_36
LBB0_35:                                ## %cond.load40
	movss	4(%rax), %xmm4          ## xmm4 = mem[0],zero,zero,zero
	shufps	$0, %xmm3, %xmm4        ## xmm4 = xmm4[0,0],xmm3[0,0]
	shufps	$226, %xmm3, %xmm4      ## xmm4 = xmm4[2,0],xmm3[2,3]
	movaps	%xmm4, %xmm3
LBB0_36:                                ## %else41
	testb	$4, %cl
	jne	LBB0_37
## %bb.38:                              ## %else44
	testb	$8, %cl
	je	LBB0_40
LBB0_39:                                ## %cond.load46
	movss	12(%rax), %xmm4         ## xmm4 = mem[0],zero,zero,zero
	shufps	$32, %xmm3, %xmm4       ## xmm4 = xmm4[0,0],xmm3[2,0]
	shufps	$36, %xmm4, %xmm3       ## xmm3 = xmm3[0,1],xmm4[2,0]
LBB0_40:
	.loc	1 102 5 is_stmt 1       ## <stdin>:102:5
	retq
LBB0_37:                                ## %cond.load43
	.loc	1 67 11                 ## <stdin>:67:11
	movss	8(%rax), %xmm4          ## xmm4 = mem[0],zero,zero,zero
	shufps	$48, %xmm3, %xmm4       ## xmm4 = xmm4[0,0],xmm3[3,0]
	shufps	$132, %xmm4, %xmm3      ## xmm3 = xmm3[0,1],xmm4[0,2]
	testb	$8, %cl
	jne	LBB0_39
	jmp	LBB0_40
Ltmp1:
Lfunc_end0:
	.cfi_endproc
                                        ## -- End function
	.globl	_matMul                 ## -- Begin function matMul
	.p2align	4, 0x90

I asked the MLIR forum to see if there is something very wrong that I am doing.

The text was updated successfully, but these errors were encountered:

AlexandreEichenberger · 2020-07-06T20:58:49Z

@tjingrant comparison you asked for.

AlexandreEichenberger · 2020-07-16T19:35:44Z

Found that they have added a vector affine load and store.. this is even easier, so no changes to mlir needed. May need partial read and write

func @matMul(%A : memref<4x4xf32>, %B : memref<4x4xf32>, %C : memref<4x4xf32>) -> memref<4x4xf32> {
   %i0 = constant 0 : index
   affine.for %i = 0 to 4 {
        affine.for %k = 0 to 4 {
            %a = affine.load %A[%i, %k] : memref<4x4xf32>
            %va = splat %a : vector<4xf32>
            %vb = affine.vector_load %B[%k, %i0] : memref<4x4xf32>, vector<4xf32>
            %vm = mulf %va, %vb : vector<4xf32>
            %vc = affine.vector_load %C[%i, %i0] : memref<4x4xf32>,vector<4xf32>
            %vres = addf %vm, %vc : vector<4xf32>
            affine.vector_store %vres, %C[%i, %i0] : memref<4x4xf32>, vector<4xf32>
       }
   }
  return %C : memref<4x4xf32>
}

results in the results listed in 3rd column

Ops.	Multi-dim vectors (# 1)	Array of simple vectors (# 2)	Affine vector memory (# 3 here)
Multiplications (mulps)	16	16	16
Add	64 (adds)	16 (adds)	16
Load/store/move	69 (movaps)	44 ops with 16 (movaps), 12 (movq), 16 (movss)	72 with 36 (movaps) 16 (movq), 16 movss)
Unpack (unpcklps)	9	0	0
Shuffle (shufps)	53	16	16
Total	256	94	119

Interestingly, the new approach is nearly as good as simple vector, but uses unaligned loads.. probably something that can be fixed. Also, the number of men ops are way off the min, but again we can probably handle it.

AlexandreEichenberger self-assigned this Jul 6, 2020

AlexandreEichenberger added the KRNL IR and lowering Support for lowering of KRNL IR to lower MLIR dialects. label Jul 6, 2020

AlexandreEichenberger added this to the Networks Optimized for CPU execution milestone Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD code generation #206

SIMD code generation #206

AlexandreEichenberger commented Jul 6, 2020

AlexandreEichenberger commented Jul 6, 2020

AlexandreEichenberger commented Jul 16, 2020 •

edited

Loading

SIMD code generation #206

SIMD code generation #206

Comments

AlexandreEichenberger commented Jul 6, 2020

Approach 1: Multi-dimensional vector using contract.

Approach 2: Arrays of simple vectors

Comparison.

Other issues with Multi-dim vectors (currently investigated)

AlexandreEichenberger commented Jul 6, 2020

AlexandreEichenberger commented Jul 16, 2020 • edited Loading

AlexandreEichenberger commented Jul 16, 2020 •

edited

Loading