Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heuristic mapper produce worst result when size of CGRA increase #16

Closed
KelvinChung2000 opened this issue Feb 27, 2024 · 1 comment
Closed

Comments

@KelvinChung2000
Copy link

I am trying to map the following kernel.

#include <string.h>
#include <stdio.h>


#define SIZE  20
int A[SIZE], B[SIZE], C[SIZE];

void external_fun() {C[0] = 7;}

__attribute__((noinline))
void array_cond(){

   int x;
   for (int i=0;i<SIZE; i++){
	if (A[i] > B[i]){
		C[i] = A[i] - B[i];
		if (C[i] < 0)
			C[i] = 0;
		else
			C[i] = A[i] * B[i];
	}
	else{
		C[i] = A[i] + B[i];
		if (C[i] > 10)
			C[i] = 10;
		else
			C[i] = A[i] - B[i];
	}
   }
}

int main(){

int i,j;

for (i=0;i<SIZE; i++){
      A[i] = i * 2 + 5;
      B[i] = i * 3 + 8;
      C[i] = 0;
    }
    
array_cond();

for (i=0;i<SIZE; i++) printf("%d\n", C[i]);
    
return 0;

}

When the CGRA size is 4x4 the mapper produces a mapping with II=13. When the size is 8x8, the mapped II=19.

This is the configuration used

{
  "kernel"                : "array_cond",
  "targetFunction"        : false,
  "targetNested"          : false,
  "targetLoopsID"         : [0],
  "doCGRAMapping"         : true,
  "row"                   : 8,
  "column"                : 8,
  "precisionAware"        : false,
  "heterogeneity"         : false,
  "isTrimmedDemo"         : true,
  "heuristicMapping"      : true,
  "parameterizableCGRA"   : false,
  "diagonalVectorization" : false,
  "bypassConstraint"      : 4,
  "isStaticElasticCGRA"   : false,
  "ctrlMemConstraint"     : 10,
  "regConstraint"         : 8,
  "optLatency"            : {
                              "load" : 2,
                              "store": 2
                            },
  "optPipelined"          : ["load", "store"],
  "additionalFunc"        : {
                              "load" : [],
                              "store": []
                            }
}

@tancheng
Copy link
Owner

Hi Kelvin, you can remove optLatency, optPipelined, and additionalFunc from the param.json, they are just used for evaluating specialized FUs.

Unfortunately, we still get II=12 on 4x4, and we may get worse mapping on 8x8 (due to the large exploration space for mapping). The reason is that we support control flows but didn't well optimize it. See the generated DFG as follows, the DFG looks fine/reasonable to me. Due to the if/else the kernel contains, the recurrence dependence cycle length is 12, which makes the mapping cannot achieve II less than 12. If you look into the generated kernel.ll (by modifying the dot.sh), you can figure out why there are so many blue arrows connecting br nodes with other nodes (let me know if you need some explanations).

So in conclusion, the control flows are not well optimized, which leads to "bad" performance. This can be potentially improved by manipulating the DFG with appropriate HW support.
kernel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants