Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!cir.bool lowering differs from original clang CodeGen. #480

Open
bcardosolopes opened this issue Feb 22, 2024 · 18 comments
Open

!cir.bool lowering differs from original clang CodeGen. #480

bcardosolopes opened this issue Feb 22, 2024 · 18 comments
Assignees
Labels
IR difference A difference in ClangIR-generated LLVM IR that could complicate reusing original CodeGen tests

Comments

@bcardosolopes
Copy link
Member

I believe the root problem here is that CIR lowers !cir.bool in a different way than the original clang CodeGen.

In the original CodeGen, bool glvalues are lowered to LLVM i8 values, and bool prvalues are lowered to LLVM i1 values, as illustrated in the following example:

bool test(int x) {
  bool ret = x > 42;
  return ret;
}
; @test returns a bool prvalue so its return type is `i1`
define dso_local noundef zeroext i1 @test()() {
entry:
  %ret = alloca i8, align 1  ; %ret is a bool glvalue so its type is `i8`
  tail call void @llvm.dbg.declare(metadata ptr %ret, metadata !16, metadata !DIExpression())
  store i8 1, ptr %ret, align 1
  %0 = load i8, ptr %ret, align 1
  %tobool = trunc i8 %0 to i1
  ret i1 %tobool
}

However, in CIRGen, all !cir.bool values are lowered to LLVM i8 values. The example above would be lowered to LLVMIR through CIR as:

; Note that the return value of @test is `i8` rather than `i1`
define i8 @test()() #0 !dbg !3 {
  %1 = alloca i8, i64 1, align 1, !dbg !6
  %2 = alloca i8, i64 1, align 1, !dbg !7
  store i8 1, ptr %2, align 1, !dbg !7
  %3 = load i8, ptr %2, align 1, !dbg !8
  store i8 %3, ptr %1, align 1, !dbg !9
  %4 = load i8, ptr %1, align 1, !dbg !9
  ret i8 %4, !dbg !9
}

This divergence leads to the redundancy illustrated in the PR description. The result of a cir.cmp operation is currently expected to be lowered to an i8 value although it's a prvalue. After emitting an llvm.icmp operation, you have to insert an llvm.zext operation to extend the i1 to i8. Thus the redundancy.

Originally posted by @Lancern in #478 (comment)

@wolfcomos
Copy link

Hi, I'm new to open source and llvm, and I want start contributing with this issue. To my understanding, the codegen for CIR produces the return value with extra bits, which is redundant since it requires extra casting for other operations. One thing I'm confused is the comment mentioned that the problem might be generated from LowerExpectIntrinsicPass, which is the infrastructure from llvm. Do you have any suggestions where to look at to get started? Thanks!

@bcardosolopes
Copy link
Member Author

@wolfcomos nice, welcome! You can ignore the LowerExpectIntrinsicPass for now, since this is related to other things. To get started I suggest:

  • Use the simple C example from above comment.
  • Use a debugger in clang to see how traditionally it goes from Clang AST to LLVM IR
  • Look at CIRGen and try to implement similar logic where things are currently working in a different way

@bcardosolopes
Copy link
Member Author

After discussions with @sitio-couto this will be solved by his ABI work that is work in progress, so assigning to him.

@smeenai
Copy link
Collaborator

smeenai commented Oct 11, 2024

#32 had some prior discussion of this, and we're running into it again in #962

bcardosolopes pushed a commit that referenced this issue Oct 21, 2024
)

The llvm's intrinsic `llvm.is.fpclass` is used to support multiple float
point builtins:
https://clang.llvm.org/docs/LanguageExtensions.html#builtin-isfpclass

> The `__builtin_isfpclass()` builtin is a generalization of functions
> isnan, isinf, isfinite and some others defined by the C standard. It
tests
> if the floating-point value, specified by the first argument, falls
into
> any of data classes, specified by the second argument.

I meant to support this by creating IntrinsicCallOp directly. But I
can't make it due to #480 since
the return type of the intrinsic will mismatch. So I have to create a
new Op for it. But I feel it might not be too bad. At least it is more
explicit and more expressive.
@smeenai
Copy link
Collaborator

smeenai commented Oct 25, 2024

This also causes missed optimizations, e.g. https://godbolt.org/z/Kfcn8P7M7. That's a pretty contrived example, but I expect something similar to show up in the real world as well. What was the intended ABI-based solution for this?

@sitio-couto
Copy link
Collaborator

Hey @smeenai, let me add a bit more info on this:

The original CodeGen has both a scalar and a memory representation for bool, which generate i1 and i8 for x86, respectively. AFAIK, these are used contextually: the memory type is used for allocations, while the scalar type is used in expressions/registers, and so on.

What was the intended ABI-based solution for this?

I'm not sure if the ABI alone can address this issue. When I had this discussion with @bcardosolopes I was looking only into calling conventions, which does not cover all cases, and the approach I had in mind did not come to fruition.

While the ABI-stuff might help us determine what the scalar/memory representation for a bool should be, I'm not sure it can determine when to use each one. Since cir.bool does not distinguish between its scalar and in-memory representation, we might have to add an attribute to do so, or handle this when lowering to LLVM Dialact by looking at the context (e.g. is a cir.bool in an cir.alloca, then lower to i8).

lanza pushed a commit that referenced this issue Nov 5, 2024
)

The llvm's intrinsic `llvm.is.fpclass` is used to support multiple float
point builtins:
https://clang.llvm.org/docs/LanguageExtensions.html#builtin-isfpclass

> The `__builtin_isfpclass()` builtin is a generalization of functions
> isnan, isinf, isfinite and some others defined by the C standard. It
tests
> if the floating-point value, specified by the first argument, falls
into
> any of data classes, specified by the second argument.

I meant to support this by creating IntrinsicCallOp directly. But I
can't make it due to #480 since
the return type of the intrinsic will mismatch. So I have to create a
new Op for it. But I feel it might not be too bad. At least it is more
explicit and more expressive.
@bcardosolopes bcardosolopes added the IR difference A difference in ClangIR-generated LLVM IR that could complicate reusing original CodeGen tests label Nov 12, 2024
@orbiri
Copy link
Collaborator

orbiri commented Nov 12, 2024

I am not super sure about the example in the top so can't tell what problem you are seeking here.

These two codes are clearly not computing the same thing 🙏

bool test(int x) {
  bool ret = x > 42;
  return ret;
}

; @test returns a bool prvalue so its return type is `i1`
define dso_local noundef zeroext i1 @test()() {
entry:
  %ret = alloca i8, align 1  ; %ret is a bool glvalue so its type is `i8`
  tail call void @llvm.dbg.declare(metadata ptr %ret, metadata !16, metadata !DIExpression())
  store i8 1, ptr %ret, align 1
  %0 = load i8, ptr %ret, align 1
  %tobool = trunc i8 %0 to i1
  ret i1 %tobool
}

@smeenai
Copy link
Collaborator

smeenai commented Nov 12, 2024

https://godbolt.org/z/cKjxzf5nE should be up-to-date; note the different return types for regular Clang vs. ClangIR.

@orbiri
Copy link
Collaborator

orbiri commented Nov 12, 2024

I see now, thanks!

In my local setup I get this output, Indicating that cir allocates both the local variable and the return value. Perhaps this is the difference?

cir.func @_Z3fooi(%arg0: !s32i loc(fused[#loc3, #loc4])) -> !cir.bool extra(#fn_attr) {
    %0 = cir.alloca !s32i, !cir.ptr<!s32i>, ["x", init] {alignment = 4 : i64} loc(#loc12)
    %1 = cir.alloca !cir.bool, !cir.ptr<!cir.bool>, ["__retval"] {alignment = 1 : i64} loc(#loc2)
    %2 = cir.alloca !cir.bool, !cir.ptr<!cir.bool>, ["b", init] {alignment = 1 : i64} loc(#loc13)
    cir.store %arg0, %0 : !s32i, !cir.ptr<!s32i> loc(#loc7)
    %3 = cir.load %0 : !cir.ptr<!s32i>, !s32i loc(#loc8)
    %4 = cir.const #cir.int<42> : !s32i loc(#loc6)
    %5 = cir.cmp(gt, %3, %4) : !s32i, !cir.bool loc(#loc14)
    cir.store %5, %2 : !cir.bool, !cir.ptr<!cir.bool> loc(#loc13)
    %6 = cir.load %2 : !cir.ptr<!cir.bool>, !cir.bool loc(#loc9)
    cir.store %6, %1 : !cir.bool, !cir.ptr<!cir.bool> loc(#loc15)
    %7 = cir.load %1 : !cir.ptr<!cir.bool>, !cir.bool loc(#loc15)
    cir.return %7 : !cir.bool loc(#loc15)
  } loc(#loc11)

I can try to have a look at it, but can't promise when will it be :)

@orbiri
Copy link
Collaborator

orbiri commented Nov 16, 2024

After trying to play with changing cir.bool width to i1 in the data layout, I circled back to the same conclusion that was reached in #32 by @lanza

This is actually a bit uglier than I thought. Clang shoehorns the bool to an i8 when it needs it and an i1 otherwise. So our representation has to keep track of what clang decided to do since this is effectively clang-codegen-ABI and we have to agree exactly with the ABI at the LLVMIR level if we want to be able to module link with CodeGen emitted IR files.

I will try to track the "shoehorning" process in clang and see if we can find a non-ugly method to use it in CIR :)

PikachuHyA added a commit to PikachuHyA/clangir that referenced this issue Nov 19, 2024
This patch adds support for `__builtin_isinf_sign`. The implementation
has several limitations that result in discrepancies between the
generated LLVM IR and the expected output.

Firstly, it uses `IsFPClass` to determine if a value is infinite,
whereas the original CGBuiltin implementation used direct comparisons
with infinity constants.
Secondly, due to llvm#480, there are
numerous unnecessary type conversions occurring, such as converting from
i1 to i8 and then back to i1.
Additionally, `SignBitOp` cannot set the return type to `CIR_BoolType`,
as doing so would lead to failures during the lowering to LLVM IR.
@bcardosolopes
Copy link
Member Author

@orbiri thanks a bunch!

@orbiri
Copy link
Collaborator

orbiri commented Nov 23, 2024

I have a solution which I implemented in the ThroughMLIR path and will upload as RFC today/over the coming week. My thought process surrounded the fact that CIR "postpones" the lowering of bool from the frontend to the moment it lowers to LLVM/MLIR. Therefore, the methods that clang's codegen uses in order to emit boolean types should remain. I found two key insights:

  1. bool type IS i1 in clang's codegen. That is what ConvertType returns!
  2. Only when concerned with memory do these types change. The three primitives in this area are convertTypeForLoadStore, EmitToMemory for stores and EmitFromMemory for loads. Together, they make sure that all scalars remain in their "sane" and natural type, and only when memory is involved the scalars are temporarily converted.

My main concern about this solution is that it will need to be duplicated to both ThroughMLIR and DirectLLVM. I wonder if someone here can suggest a method where these utilities can be shared. I'm open to suggestions!


On second thought, I would be surprised if we are the first who require this lowering pattern. I would like to suggest trying to ask in discourse how people suggest to address this issue of having separate data layout than the actual data.

@orbiri
Copy link
Collaborator

orbiri commented Nov 23, 2024

Published draft at #1158 :)

This still doesn't solve the ellision that is required to solve this issue, but it lays the foundations for solving it.

@bcardosolopes
Copy link
Member Author

Thanks for exploring solutions here, it's about time we fix this! My suggestion is to check whether we should implement this as part of target lowering, since it's also tied to ABI (memory usage in and out of functions), wrote more content in the PR.

PikachuHyA added a commit to PikachuHyA/clangir that referenced this issue Nov 27, 2024
This patch adds support for `__builtin_isinf_sign`. The implementation
has several limitations that result in discrepancies between the
generated LLVM IR and the expected output.

Firstly, it uses `IsFPClass` to determine if a value is infinite,
whereas the original CGBuiltin implementation used direct comparisons
with infinity constants.
Secondly, due to llvm#480, there are
numerous unnecessary type conversions occurring, such as converting from
i1 to i8 and then back to i1.
Additionally, `SignBitOp` cannot set the return type to `CIR_BoolType`,
as doing so would lead to failures during the lowering to LLVM IR.
@orbiri
Copy link
Collaborator

orbiri commented Jan 11, 2025

While thinking about #1267, I came to think about a crucial reason for saving the retval as stack variable - missing the ability to carry and update values across loops. Indeed, when trying a more sophisticated example (https://godbolt.org/z/M7hWczcn9), one can see that vanilla clang also emits the same temp variable. It does so probably in order to avoid complex phi emission and defer it to mem2reg passes.

I would claim it is nice to have to elide the __retval alloca in very simple cases, but perhaps it is more trouble than worth? I'll have a look at it now that cir.bool lowering behave more nicely.

P.S. the example above shows many more opportunities for minor optimizations as part of the codegen - many redundant pass through blocks, having two return sites instead of one, being able to emit an alloca of i1 and probably other things I missed :)

@orbiri
Copy link
Collaborator

orbiri commented Jan 11, 2025

The logic in codegen is: (1) Always emit temp variable for return value (2) If return is of type direct/extend:

  • [1] Find "dominating" store - either the only store in the function or the last store to the return value address in the return block.
  • [2] If found, take stored value as the return value, zap the store instruction on the spot.
    This is the source code -

Later, [3] if the return value address is not used (which can be the case when the only use was zapped by ☝) - delete the return value argument.

I have a few questions on this to the CIR designers:

  1. I don't think that the above process is something we should re-implement in CIRGen. I believe that if we want to implement this optimizations - perhaps it would be preferable to defer it to the final lowering steps. My thought process is that the canonical form of having the return value saved in the stack is something that static analyzers could benefit from with simplicity. Does anyone think otherwise?
  2. If we want to implement this elision logic in the final lowering steps, do you believe we should infer this "on the spot" by following the return value, or should we help the pass by marking the return value variable somehow: a special alloca operation or adding a special attribute to cir.alloca? I believe it is more robust to mark this for the pass. It is possible that in the future certain mem2reg passes will work on CIR and we would like to make easy assumptions on the return value address in the lowering pass if it is found.
  3. Given all of the above - would you recommend moving forward and implementing this optimization or should we leave it to O1+ passes? :)

[1]

static llvm::StoreInst *findDominatingStoreToReturnValue(CodeGenFunction &CGF) {

[2]
if (llvm::StoreInst *SI =
findDominatingStoreToReturnValue(*this)) {

[3]
if (ReturnValue.isValid()) {
auto *RetAlloca =
dyn_cast<llvm::AllocaInst>(ReturnValue.emitRawPointer(*this));
if (RetAlloca && RetAlloca->use_empty()) {
RetAlloca->eraseFromParent();
ReturnValue = Address::invalid();
}
}

@bcardosolopes
Copy link
Member Author

I would claim it is nice to have to elide the __retval alloca in very simple cases, but perhaps it is more trouble than worth? I'll have a look at it now that cir.bool lowering behave more nicely.

The general approach in CIRGen (and OG) is to be as simple as possible - if we decide to elide it should come as an extra pass (either out of generic dce from mem2reg, etc or some custom pass)

@bcardosolopes
Copy link
Member Author

bcardosolopes commented Jan 14, 2025

I don't think that the above process is something we should re-implement in CIRGen. ... Does anyone think otherwise?

+1

I believe it is more robust to mark this for the pass

Agreed, we already have a special "name" for it "__retval", so we usually now exactly what alloca is used for returning values. It's also possible we could leverage that info for NRVO, etc. I'd say we should tag it.

would you recommend moving forward and implementing this optimization

Hell yea! My take is to match OG (so it's easy to compare the IR outputs). Does it take -O1 to do this optimization currently or is it always on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IR difference A difference in ClangIR-generated LLVM IR that could complicate reusing original CodeGen tests
Projects
None yet
Development

No branches or pull requests

5 participants