-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIE2P] Legalize and select VMUL.f from G_FMUL #360
base: aie-public
Are you sure you want to change the base?
Conversation
02cc673
to
872e379
Compare
@@ -40,6 +40,8 @@ class VecConf { | |||
int BMODE_16x16_b = 1; | |||
int BMODE_32x16 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny to have aliases here.
@@ -40,6 +40,8 @@ class VecConf { | |||
int BMODE_16x16_b = 1; | |||
int BMODE_32x16 = 0; | |||
|
|||
int VARIANT_BF16xBF16_1_elem_1 = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds as if there are more variants. List them all in one go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could but I'm not sure if we will ever be able to use all of them them in any patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the translation of a hardware enumeration into tablegen speak. I'm hoping that one day we'll have a single point of definition for these, and the full list would make them more recognisable.
@@ -59,6 +61,7 @@ class VecConf { | |||
} | |||
|
|||
def accfp32_vecconf : VecConf { let amode = AMODE_FP32; let bmode = BMODE_16x16; } | |||
def mulbf16_vecconf : VecConf { let amode = AMODE_FP32; let bmode = BMODE_16x16; let cmode = VARIANT_BF16xBF16_1_elem_1; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a local definition, I wouldn't mind using CMODE as prefix.
sub_1024_acc_hi)), | ||
sub_512_hi))>; | ||
|
||
def : Pat<(v32bf16 (fmul v32bf16:$vec1, v32bf16:$vec2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a standard legalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, I don't know any but for the wider v64bf16
case above we could possibly use .fewerElements
to keep only one pattern. I will try it.
@@ -225,12 +225,17 @@ AIE2PLegalizerInfo::AIE2PLegalizerInfo(const AIE2PSubtarget &ST) | |||
|
|||
getActionDefinitionsBuilder(G_FABS).customFor({S16, S32, S64}).scalarize(0); | |||
|
|||
getActionDefinitionsBuilder(G_FMUL) | |||
.legalFor({V64S16, V32S16}) | |||
.customFor({S16}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to retain .clampScalar
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have custom legalization for S16 now, no need to clamp it to S32/S64. Any other scalar should be illegal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean .clampScalar(0, S16, S64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why? the only float type under 16 bits we have is bfloat (aka S16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Just pointing that we deviate from old behavior, s128 to s64 or s8 to s32. But you are right, it does not make sense for these types.
@@ -5,7 +5,6 @@ | |||
# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates | |||
|
|||
# RUN: llc -mtriple aie2 -run-pass=legalizer %s -verify-machineinstrs -o - | FileCheck -DVER=2 --check-prefix=COMMON --check-prefix=AIE2 %s | |||
# RUN: llc -mtriple aie2p -run-pass=legalizer %s -verify-machineinstrs -o - | FileCheck -DVER=2p --check-prefix=COMMON --check-prefix=AIE2P %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have AIE2P checkline in the test. You could also remove -DVER=2 --check-prefix=COMMON
# | ||
# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates | ||
|
||
# RUN: llc -mtriple aie2p -run-pass=legalizer %s -verify-machineinstrs -o - | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to include the libcall tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already didn't have them but I will add them while at it.
@@ -225,12 +225,17 @@ AIE2PLegalizerInfo::AIE2PLegalizerInfo(const AIE2PSubtarget &ST) | |||
|
|||
getActionDefinitionsBuilder(G_FABS).customFor({S16, S32, S64}).scalarize(0); | |||
|
|||
getActionDefinitionsBuilder(G_FMUL) | |||
.legalFor({V64S16, V32S16}) | |||
.customFor({S16}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a comment to explain why we would customize this for s16
. I dont really get the context here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have an instruction to multiply bf16 scalars, so instead of using an inefficient and potentially unsafe libcall (e.g. in the case of hardware loops) we need custom legalization by inserting the bf16 scalar into a vector, perform the element wise multiplication with VMUL.f
and extract the bf16 scalar again. I can add this explanation as a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same as for FADD / FSUB. We implement a scalar multiplication by a full element by element vector mul.
const unsigned InsertEltOpc = | ||
ST.getInstrInfo()->getGenericInsertVectorEltOpcode(); | ||
|
||
const Register IdxReg = MIRBuilder.buildConstant(S32, 0).getReg(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be cheaper to broadcast? Or is this picked up by a push.lo?
@@ -222,6 +225,26 @@ def : Pat<(fadd ACC2048:$acc1, ACC2048:$acc2), | |||
def : Pat<(fsub ACC2048:$acc1, ACC2048:$acc2), | |||
(VSUB_f_vmac_cm2_add_reg ACC2048:$acc1, ACC2048:$acc2, (i32 accfp32_vecconf.ConfBits))>; | |||
|
|||
// MUL | |||
def : Pat<(v64bf16 (fmul v64bf16:$vec1, v64bf16:$vec2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check: We are performing the same multiplication twice: one for extract lo and other to extract hi. I guess we cannot express an optimized reuse of the same VMUL here, right?
No description provided.