Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD code generation #206

Open
AlexandreEichenberger opened this issue Jul 6, 2020 · 2 comments
Open

SIMD code generation #206

AlexandreEichenberger opened this issue Jul 6, 2020 · 2 comments
Assignees
Labels
KRNL IR and lowering Support for lowering of KRNL IR to lower MLIR dialects.

Comments

@AlexandreEichenberger
Copy link
Collaborator

Comparing (on x86) the MLIR code generated by two approaches.

Approach 1: Multi-dimensional vector using contract.

This approach uses Muti-dimensional vector and affine maps to define reductions and parallel operations. It currently works only on an entire vector. Subset can be extracted, but indices of where to extract must be literals. This is high level, with the good (generic, abstract, use full infrastructure) and the potentially problematic (what code is generated, how to apply specific tricks).

#contraction_accesses = [
  affine_map<(i, j, k) -> (i, k)>,
  affine_map<(i, j, k) -> (k, j)>,
  affine_map<(i, j, k) -> (i, j)>
]
#contraction_trait = {
  indexing_maps = #contraction_accesses,
  iterator_types = ["parallel", "parallel", "reduction"]
}
func @matMul(%A : vector<4x4xf32>, %B : vector<4x4xf32>, %C : vector<4x4xf32>) -> vector<4x4xf32> {
    %res = vector.contract #contraction_trait %A, %B, %C
        : vector<4x4xf32>, vector<4x4xf32> into vector<4x4xf32>
  return %res : vector<4x4xf32>
}

Approach 2: Arrays of simple vectors

This approach use arbitrary dimensions of 'memref' of small vectors (e.g. vector<8xf32>). Vectorization is done directly by loading, splatting, and performing SIMD operations directly. This is low level, with the good (full control, apply arbitrary pattern) and the potentially problematic (code all of the details, specific to an architecture)

func @matMul(%A : memref<4x4xf32>, %B : memref<4xvector<4xf32>>, %C : memref<4xvector<4xf32>>) -> memref<4xvector<4xf32>> {
   affine.for %i = 0 to 4 {
        affine.for %k = 0 to 4 {
            %a = affine.load %A[%i, %k] : memref<4x4xf32>
            %va = splat %a : vector<4xf32>
            %vb = affine.load %B[%k] : memref<4xvector<4xf32>>
            %vm = mulf %va, %vb : vector<4xf32>
            %vc = affine.load %C[%i] : memref<4xvector<4xf32>>
            %vres = addf %vm, %vc : vector<4xf32>
            affine.store %vres, %C[%i] : memref<4xvector<4xf32>>
       }
   }
  return %C : memref<4xvector<4xf32>>
}

Comparison.

Running both example using MLIR mlir-opt --convert-vector-to-scf --lower-affine --convert-scf-to-std --convert-vector-to-llvm test.mlir | mlir-translate -mlir-to-llvmir | opt -O3 -S | llc -O3 and looking at the instruction count, I got the following.

Ops. Multi-dim vectors (# 1) Array of simple vectors (# 2)
Multiplications (mulps) 16 16
Add 64 (adds) 16 (adds)
Load/store/move 69 (movaps) 44 ops with 16 (movaps), 12 (movq), 16 (movss)
Unpack (unpcklps) 9 0
Shuffle (shufps) 53 16
Total 256 94

Tracking the add operations, the first approach failed to simdize the additions, whereas the second approach succeeded. memory operations could be as low as 12 load and 4 store; if load/splat are used, then the 12 loads would expand to 24 load/load splats. Some memory may be due to the calling convention.

I am sure that both can possibly fixed, but the second approach generates better code at this time. It was also validated in a comparison with BLAS routines on x86, resulting in nearly as good performance as the most optimized routines. It used a heavily tiled loop nests, with buffers and transposed buffer for cache locality.

Other issues with Multi-dim vectors (currently investigated)

To generate vectors, the vector dialect recommend vector.transfer_read and its equivalent write. Initial investigation shows that a simple pattern below generate a very long code. Ideally, in this simple code, a simple memcpy should be used.

func @transfer_read_2d(%A : memref<4x4xf32>, %base1: index, %base2: index) -> vector<4x4xf32>{
  %fm42 = constant -42.0: f32
  %f = vector.transfer_read %A[%base1, %base2], %fm42
      //{permutation_map = affine_map<(d0, d1) -> (d0, d1)>} 
    : memref<4x4xf32>, vector<4x4xf32>
  return %f : vector<4x4xf32>
}

results in the asm below. Note that the transfer goes from a 4x4 memref to a 4x4 vector, so no padding or masking of any kinds should happen. I do not understand the code below, at this time.

_transfer_read_2d:                      ## @transfer_read_2d
Lfunc_begin0:
	.file	1 "/Users/alexe/MLIR/atests/simd/<stdin>"
	.loc	1 5 0                   ## <stdin>:5:0
	.cfi_sections .debug_frame
	.cfi_startproc
## %bb.0:
	movq	24(%rsp), %rax
	movq	16(%rsp), %rcx
Ltmp0:
	.loc	1 0 0 prologue_end      ## <stdin>:0:0
	movq	%rax, %xmm0
	pshufd	$68, %xmm0, %xmm9       ## xmm9 = xmm0[0,1,0,1]
	movdqa	LCPI0_0(%rip), %xmm8    ## xmm8 = [2,3]
	paddq	%xmm9, %xmm8
	paddq	LCPI0_1(%rip), %xmm9
	movaps	LCPI0_2(%rip), %xmm3    ## xmm3 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 40 11                 ## <stdin>:40:11
	cmpq	$3, %rcx
	.loc	1 0 0 is_stmt 0         ## <stdin>:0:0
	movaps	%xmm3, %xmm0
	.loc	1 41 5 is_stmt 1        ## <stdin>:41:5
	jg	LBB0_10
## %bb.1:
	.loc	1 50 11                 ## <stdin>:50:11
	leaq	(%rax,%rcx,4), %rdx
	.loc	1 51 11                 ## <stdin>:51:11
	leaq	(%rsi,%rdx,4), %rdx
	.loc	1 67 11                 ## <stdin>:67:11
	movdqa	LCPI0_3(%rip), %xmm0    ## xmm0 = [2147483648,2147483648]
	movdqa	%xmm9, %xmm1
	pxor	%xmm0, %xmm1
	movdqa	LCPI0_4(%rip), %xmm2    ## xmm2 = [2147483652,2147483652]
	movdqa	%xmm2, %xmm6
	pcmpeqd	%xmm1, %xmm6
	movdqa	%xmm2, %xmm7
	pcmpgtd	%xmm1, %xmm7
	pshufd	$160, %xmm7, %xmm1      ## xmm1 = xmm7[0,0,2,2]
	pand	%xmm6, %xmm1
	por	%xmm7, %xmm1
	pxor	%xmm8, %xmm0
	movdqa	%xmm2, %xmm6
	pcmpeqd	%xmm0, %xmm6
	pcmpgtd	%xmm0, %xmm2
	pshufd	$160, %xmm2, %xmm0      ## xmm0 = xmm2[0,0,2,2]
	pand	%xmm6, %xmm0
	por	%xmm2, %xmm0
	packssdw	%xmm0, %xmm1
	movmskps	%xmm1, %edi
	testb	$1, %dil
	je	LBB0_2
## %bb.3:                               ## %cond.load
	movss	(%rdx), %xmm1           ## xmm1 = mem[0],zero,zero,zero
	movaps	LCPI0_5(%rip), %xmm0    ## xmm0 = <u,-4.2E+1,-4.2E+1,-4.2E+1>
	movss	%xmm1, %xmm0            ## xmm0 = xmm1[0],xmm0[1,2,3]
	testb	$2, %dil
	jne	LBB0_5
	jmp	LBB0_6
LBB0_2:
	.loc	1 0 11 is_stmt 0        ## <stdin>:0:11
	movaps	LCPI0_2(%rip), %xmm0    ## xmm0 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 67 11                 ## <stdin>:67:11
	testb	$2, %dil
	je	LBB0_6
LBB0_5:                                 ## %cond.load1
	movss	4(%rdx), %xmm1          ## xmm1 = mem[0],zero,zero,zero
	shufps	$0, %xmm0, %xmm1        ## xmm1 = xmm1[0,0],xmm0[0,0]
	shufps	$226, %xmm0, %xmm1      ## xmm1 = xmm1[2,0],xmm0[2,3]
	movaps	%xmm1, %xmm0
LBB0_6:                                 ## %else2
	testb	$4, %dil
	jne	LBB0_7
## %bb.8:                               ## %else5
	testb	$8, %dil
	je	LBB0_10
LBB0_9:                                 ## %cond.load7
	movss	12(%rdx), %xmm1         ## xmm1 = mem[0],zero,zero,zero
	shufps	$32, %xmm0, %xmm1       ## xmm1 = xmm1[0,0],xmm0[2,0]
	shufps	$36, %xmm1, %xmm0       ## xmm0 = xmm0[0,1],xmm1[2,0]
LBB0_10:
	.loc	1 39 11 is_stmt 1       ## <stdin>:39:11
	leaq	1(%rcx), %rdx
	.loc	1 40 11                 ## <stdin>:40:11
	cmpq	$3, %rdx
	.loc	1 0 0 is_stmt 0         ## <stdin>:0:0
	movaps	%xmm3, %xmm1
	.loc	1 41 5 is_stmt 1        ## <stdin>:41:5
	jg	LBB0_20
## %bb.11:
	.loc	1 50 11                 ## <stdin>:50:11
	leaq	(%rax,%rdx,4), %rdx
	.loc	1 51 11                 ## <stdin>:51:11
	leaq	(%rsi,%rdx,4), %rdx
	.loc	1 67 11                 ## <stdin>:67:11
	movdqa	LCPI0_3(%rip), %xmm1    ## xmm1 = [2147483648,2147483648]
	movdqa	%xmm9, %xmm2
	pxor	%xmm1, %xmm2
	movdqa	LCPI0_4(%rip), %xmm6    ## xmm6 = [2147483652,2147483652]
	movdqa	%xmm6, %xmm7
	pcmpeqd	%xmm2, %xmm7
	movdqa	%xmm6, %xmm4
	pcmpgtd	%xmm2, %xmm4
	pshufd	$160, %xmm4, %xmm2      ## xmm2 = xmm4[0,0,2,2]
	pand	%xmm7, %xmm2
	por	%xmm4, %xmm2
	pxor	%xmm8, %xmm1
	movdqa	%xmm6, %xmm4
	pcmpeqd	%xmm1, %xmm4
	pcmpgtd	%xmm1, %xmm6
	pshufd	$160, %xmm6, %xmm1      ## xmm1 = xmm6[0,0,2,2]
	pand	%xmm4, %xmm1
	por	%xmm6, %xmm1
	packssdw	%xmm1, %xmm2
	movmskps	%xmm2, %edi
	testb	$1, %dil
	je	LBB0_12
## %bb.13:                              ## %cond.load11
	movss	(%rdx), %xmm2           ## xmm2 = mem[0],zero,zero,zero
	movaps	LCPI0_5(%rip), %xmm1    ## xmm1 = <u,-4.2E+1,-4.2E+1,-4.2E+1>
	movss	%xmm2, %xmm1            ## xmm1 = xmm2[0],xmm1[1,2,3]
	testb	$2, %dil
	jne	LBB0_15
	jmp	LBB0_16
LBB0_7:                                 ## %cond.load4
	movss	8(%rdx), %xmm1          ## xmm1 = mem[0],zero,zero,zero
	shufps	$48, %xmm0, %xmm1       ## xmm1 = xmm1[0,0],xmm0[3,0]
	shufps	$132, %xmm1, %xmm0      ## xmm0 = xmm0[0,1],xmm1[0,2]
	testb	$8, %dil
	jne	LBB0_9
	jmp	LBB0_10
LBB0_12:
	.loc	1 0 11 is_stmt 0        ## <stdin>:0:11
	movaps	LCPI0_2(%rip), %xmm1    ## xmm1 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 67 11                 ## <stdin>:67:11
	testb	$2, %dil
	je	LBB0_16
LBB0_15:                                ## %cond.load14
	movss	4(%rdx), %xmm2          ## xmm2 = mem[0],zero,zero,zero
	shufps	$0, %xmm1, %xmm2        ## xmm2 = xmm2[0,0],xmm1[0,0]
	shufps	$226, %xmm1, %xmm2      ## xmm2 = xmm2[2,0],xmm1[2,3]
	movaps	%xmm2, %xmm1
LBB0_16:                                ## %else15
	testb	$4, %dil
	jne	LBB0_17
## %bb.18:                              ## %else18
	testb	$8, %dil
	je	LBB0_20
LBB0_19:                                ## %cond.load20
	movss	12(%rdx), %xmm2         ## xmm2 = mem[0],zero,zero,zero
	shufps	$32, %xmm1, %xmm2       ## xmm2 = xmm2[0,0],xmm1[2,0]
	shufps	$36, %xmm2, %xmm1       ## xmm1 = xmm1[0,1],xmm2[2,0]
LBB0_20:
	.loc	1 39 11 is_stmt 1       ## <stdin>:39:11
	leaq	2(%rcx), %rdx
	.loc	1 40 11                 ## <stdin>:40:11
	cmpq	$3, %rdx
	.loc	1 0 0 is_stmt 0         ## <stdin>:0:0
	movaps	%xmm3, %xmm2
	.loc	1 41 5 is_stmt 1        ## <stdin>:41:5
	jg	LBB0_30
## %bb.21:
	.loc	1 50 11                 ## <stdin>:50:11
	leaq	(%rax,%rdx,4), %rdx
	.loc	1 51 11                 ## <stdin>:51:11
	leaq	(%rsi,%rdx,4), %rdx
	.loc	1 67 11                 ## <stdin>:67:11
	movdqa	LCPI0_3(%rip), %xmm2    ## xmm2 = [2147483648,2147483648]
	movdqa	%xmm9, %xmm4
	pxor	%xmm2, %xmm4
	movdqa	LCPI0_4(%rip), %xmm6    ## xmm6 = [2147483652,2147483652]
	movdqa	%xmm6, %xmm7
	pcmpeqd	%xmm4, %xmm7
	movdqa	%xmm6, %xmm5
	pcmpgtd	%xmm4, %xmm5
	pshufd	$160, %xmm5, %xmm4      ## xmm4 = xmm5[0,0,2,2]
	pand	%xmm7, %xmm4
	por	%xmm5, %xmm4
	pxor	%xmm8, %xmm2
	movdqa	%xmm6, %xmm5
	pcmpeqd	%xmm2, %xmm5
	pcmpgtd	%xmm2, %xmm6
	pshufd	$160, %xmm6, %xmm2      ## xmm2 = xmm6[0,0,2,2]
	pand	%xmm5, %xmm2
	por	%xmm6, %xmm2
	packssdw	%xmm2, %xmm4
	movmskps	%xmm4, %edi
	testb	$1, %dil
	je	LBB0_22
## %bb.23:                              ## %cond.load24
	movss	(%rdx), %xmm4           ## xmm4 = mem[0],zero,zero,zero
	movaps	LCPI0_5(%rip), %xmm2    ## xmm2 = <u,-4.2E+1,-4.2E+1,-4.2E+1>
	movss	%xmm4, %xmm2            ## xmm2 = xmm4[0],xmm2[1,2,3]
	testb	$2, %dil
	jne	LBB0_25
	jmp	LBB0_26
LBB0_17:                                ## %cond.load17
	movss	8(%rdx), %xmm2          ## xmm2 = mem[0],zero,zero,zero
	shufps	$48, %xmm1, %xmm2       ## xmm2 = xmm2[0,0],xmm1[3,0]
	shufps	$132, %xmm2, %xmm1      ## xmm1 = xmm1[0,1],xmm2[0,2]
	testb	$8, %dil
	jne	LBB0_19
	jmp	LBB0_20
LBB0_22:
	.loc	1 0 11 is_stmt 0        ## <stdin>:0:11
	movaps	LCPI0_2(%rip), %xmm2    ## xmm2 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 67 11                 ## <stdin>:67:11
	testb	$2, %dil
	je	LBB0_26
LBB0_25:                                ## %cond.load27
	movss	4(%rdx), %xmm4          ## xmm4 = mem[0],zero,zero,zero
	shufps	$0, %xmm2, %xmm4        ## xmm4 = xmm4[0,0],xmm2[0,0]
	shufps	$226, %xmm2, %xmm4      ## xmm4 = xmm4[2,0],xmm2[2,3]
	movaps	%xmm4, %xmm2
LBB0_26:                                ## %else28
	testb	$4, %dil
	jne	LBB0_27
## %bb.28:                              ## %else31
	testb	$8, %dil
	je	LBB0_30
LBB0_29:                                ## %cond.load33
	movss	12(%rdx), %xmm4         ## xmm4 = mem[0],zero,zero,zero
	shufps	$32, %xmm2, %xmm4       ## xmm4 = xmm4[0,0],xmm2[2,0]
	shufps	$36, %xmm4, %xmm2       ## xmm2 = xmm2[0,1],xmm4[2,0]
LBB0_30:
	.loc	1 39 11 is_stmt 1       ## <stdin>:39:11
	addq	$3, %rcx
	.loc	1 40 11                 ## <stdin>:40:11
	cmpq	$3, %rcx
	.loc	1 41 5                  ## <stdin>:41:5
	jg	LBB0_40
## %bb.31:
	.loc	1 50 11                 ## <stdin>:50:11
	leaq	(%rax,%rcx,4), %rax
	.loc	1 51 11                 ## <stdin>:51:11
	leaq	(%rsi,%rax,4), %rax
	.loc	1 67 11                 ## <stdin>:67:11
	movdqa	LCPI0_3(%rip), %xmm3    ## xmm3 = [2147483648,2147483648]
	pxor	%xmm3, %xmm9
	movdqa	LCPI0_4(%rip), %xmm4    ## xmm4 = [2147483652,2147483652]
	movdqa	%xmm4, %xmm5
	pcmpeqd	%xmm9, %xmm5
	movdqa	%xmm4, %xmm6
	pcmpgtd	%xmm9, %xmm6
	pshufd	$160, %xmm6, %xmm7      ## xmm7 = xmm6[0,0,2,2]
	pand	%xmm5, %xmm7
	por	%xmm6, %xmm7
	pxor	%xmm3, %xmm8
	movdqa	%xmm4, %xmm3
	pcmpeqd	%xmm8, %xmm3
	pcmpgtd	%xmm8, %xmm4
	pshufd	$160, %xmm4, %xmm5      ## xmm5 = xmm4[0,0,2,2]
	pand	%xmm3, %xmm5
	por	%xmm4, %xmm5
	packssdw	%xmm5, %xmm7
	movmskps	%xmm7, %ecx
	testb	$1, %cl
	je	LBB0_32
## %bb.33:                              ## %cond.load37
	movss	(%rax), %xmm4           ## xmm4 = mem[0],zero,zero,zero
	movaps	LCPI0_5(%rip), %xmm3    ## xmm3 = <u,-4.2E+1,-4.2E+1,-4.2E+1>
	movss	%xmm4, %xmm3            ## xmm3 = xmm4[0],xmm3[1,2,3]
	testb	$2, %cl
	jne	LBB0_35
	jmp	LBB0_36
LBB0_27:                                ## %cond.load30
	movss	8(%rdx), %xmm4          ## xmm4 = mem[0],zero,zero,zero
	shufps	$48, %xmm2, %xmm4       ## xmm4 = xmm4[0,0],xmm2[3,0]
	shufps	$132, %xmm4, %xmm2      ## xmm2 = xmm2[0,1],xmm4[0,2]
	testb	$8, %dil
	jne	LBB0_29
	jmp	LBB0_30
LBB0_32:
	.loc	1 0 11 is_stmt 0        ## <stdin>:0:11
	movaps	LCPI0_2(%rip), %xmm3    ## xmm3 = [-4.2E+1,-4.2E+1,-4.2E+1,-4.2E+1]
	.loc	1 67 11                 ## <stdin>:67:11
	testb	$2, %cl
	je	LBB0_36
LBB0_35:                                ## %cond.load40
	movss	4(%rax), %xmm4          ## xmm4 = mem[0],zero,zero,zero
	shufps	$0, %xmm3, %xmm4        ## xmm4 = xmm4[0,0],xmm3[0,0]
	shufps	$226, %xmm3, %xmm4      ## xmm4 = xmm4[2,0],xmm3[2,3]
	movaps	%xmm4, %xmm3
LBB0_36:                                ## %else41
	testb	$4, %cl
	jne	LBB0_37
## %bb.38:                              ## %else44
	testb	$8, %cl
	je	LBB0_40
LBB0_39:                                ## %cond.load46
	movss	12(%rax), %xmm4         ## xmm4 = mem[0],zero,zero,zero
	shufps	$32, %xmm3, %xmm4       ## xmm4 = xmm4[0,0],xmm3[2,0]
	shufps	$36, %xmm4, %xmm3       ## xmm3 = xmm3[0,1],xmm4[2,0]
LBB0_40:
	.loc	1 102 5 is_stmt 1       ## <stdin>:102:5
	retq
LBB0_37:                                ## %cond.load43
	.loc	1 67 11                 ## <stdin>:67:11
	movss	8(%rax), %xmm4          ## xmm4 = mem[0],zero,zero,zero
	shufps	$48, %xmm3, %xmm4       ## xmm4 = xmm4[0,0],xmm3[3,0]
	shufps	$132, %xmm4, %xmm3      ## xmm3 = xmm3[0,1],xmm4[0,2]
	testb	$8, %cl
	jne	LBB0_39
	jmp	LBB0_40
Ltmp1:
Lfunc_end0:
	.cfi_endproc
                                        ## -- End function
	.globl	_matMul                 ## -- Begin function matMul
	.p2align	4, 0x90

I asked the MLIR forum to see if there is something very wrong that I am doing.

@AlexandreEichenberger
Copy link
Collaborator Author

@tjingrant comparison you asked for.

@AlexandreEichenberger AlexandreEichenberger self-assigned this Jul 6, 2020
@AlexandreEichenberger AlexandreEichenberger added the KRNL IR and lowering Support for lowering of KRNL IR to lower MLIR dialects. label Jul 6, 2020
@AlexandreEichenberger
Copy link
Collaborator Author

AlexandreEichenberger commented Jul 16, 2020

Found that they have added a vector affine load and store.. this is even easier, so no changes to mlir needed. May need partial read and write

func @matMul(%A : memref<4x4xf32>, %B : memref<4x4xf32>, %C : memref<4x4xf32>) -> memref<4x4xf32> {
   %i0 = constant 0 : index
   affine.for %i = 0 to 4 {
        affine.for %k = 0 to 4 {
            %a = affine.load %A[%i, %k] : memref<4x4xf32>
            %va = splat %a : vector<4xf32>
            %vb = affine.vector_load %B[%k, %i0] : memref<4x4xf32>, vector<4xf32>
            %vm = mulf %va, %vb : vector<4xf32>
            %vc = affine.vector_load %C[%i, %i0] : memref<4x4xf32>,vector<4xf32>
            %vres = addf %vm, %vc : vector<4xf32>
            affine.vector_store %vres, %C[%i, %i0] : memref<4x4xf32>, vector<4xf32>
       }
   }
  return %C : memref<4x4xf32>
}

results in the results listed in 3rd column

Ops. Multi-dim vectors (# 1) Array of simple vectors (# 2) Affine vector memory (# 3 here)
Multiplications (mulps) 16 16 16
Add 64 (adds) 16 (adds) 16
Load/store/move 69 (movaps) 44 ops with 16 (movaps), 12 (movq), 16 (movss) 72 with 36 (movaps) 16 (movq), 16 movss)
Unpack (unpcklps) 9 0 0
Shuffle (shufps) 53 16 16
Total 256 94 119

Interestingly, the new approach is nearly as good as simple vector, but uses unaligned loads.. probably something that can be fixed. Also, the number of men ops are way off the min, but again we can probably handle it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
KRNL IR and lowering Support for lowering of KRNL IR to lower MLIR dialects.
Projects
None yet
Development

No branches or pull requests

1 participant