[AutoBump] Merge with 56eb559b (Nov 26) (15) #479

jorickert · 2025-02-20T05:15:13Z

No description provided.

…leIteration (llvm#116323) Drop assumption of a surrounding builtin.func in promoteIfSingleIteration. Fixes llvm#116042

…ls/LoopUtils (llvm#116324) Drop assumptions of surrounding builtin.func op in affine LoopUtils and Utils. There are use cases of affine fusion or affine transformation in other func-like ops. In the context of llvm#116042

…opy generate (llvm#116763) Fix unchecked use of memref memory space attr in affine data copy generate. In the case of memory accesses without a memory space attribute or those other than integer attributes, the pass treats them as slow memory spaces. Fixes llvm#116536

Fix arbitrary checks and hardcoding/specialcasing in affine LICM. Drop unnecessary (too much) debug logging. This pass is still unsound due to not handling aliases. This will have to be handled later.

llvm#117260) This was a bit annoying because these introduce a new special case encoding usage. op_sel is repurposed as a subset of dpp controls, and is eligible for VOP3->VOP1 shrinking. For some reason fi also uses an enum value, so we need to convert the raw boolean to 1 instead of -1. The 2 registers are swapped, so this has 2 defs. Ideally the builtin would return a pair, but that's difficult so return a vector instead. This would make a hypothetical builtin that supports v2f16 directly uglier.

A few instructions changed rate.

…117262) Increase from 11 wait states to 19

…nge (llvm#117263) These have an additional wait state compared to gfx940.

…or gfx950 (llvm#117283) Read by sgemm/dgemm in srcc after v_mfma_f64_16x16x4_f64 increases from 9 to 17 wait states.

…d change for gfx950 (llvm#117284) Increase in wait states from 11 to 19. The index for smfmac counts as like srcA/srcB.

…Data (llvm#117412) This patch removes two functions to verify the consistency between: - IndexedAllocationInfo::CallStack - IndexedAllocationInfo::CSId Now that MemProf format Version 1 has been removed, IndexedAllocationInfo::CallStack doesn't participate in either serialization or deserialization, so we don't care about the consistency between the two fields in IndexAllocationInfo. Subsequent patches will remove uses of the old field and eventually remove the field.

…lvm#117076) This PR allows mixing `-basic-block-sections` with `-enable-machine-function-splitter`. The strategy is to let `-basic-block-sections` take precedence over functions with profiles.

llvm#116955) …ve offset In Exact mode, the approximation of returning (0,0) is invalid. It only holds in min/max mode.

… on the RHS This helps ensure we lower to VPERMI2/T2 instructions that we can commute the index arg to VPERMT2/I2. Prep work for llvm#79799

Use all_of instead of explicit loop to reduce indentation, also properly check VPScalarCastRecipe operand.

…rsion (llvm#117413) This commit adds support for 1:N result type conversions for `func.call` ops. In that case, argument materializations to the original result type should be inserted (via `replaceOpWithMultiple`). This commit is in preparation of merging the 1:1 and 1:N conversion drivers.

… CI configs

This PR updates the minimal required version of pybind11 from 2.9.0 to 2.10.0. New new version is almost 2.5 years old, which is half a year less than the previous version. This change is necessary to support the changes introduced in llvm#115307, which does not compile with pybind11 v.2.9. Signed-off-by: Ingo Müller <[email protected]>

This location type represents a contiguous range inside a file. It is effectively a pair of FileLineCols. Add new type and make FileLineCol a view for case where it matches existing previous one. The location includes filename and optional start line & col, and end line & col. Considered common cases are file:line, file:line:col, file:line:start_col to file:line:end_col and general range within same file. In memory its encoded as trailing objects. This keeps the memory requirement the same as FileLineColLoc today (makes the rather common File:Line cheaper) at the expense of extra work at decoding time. Kept the unsigned type. There was the option to always have file range be castable to FileLineColLoc. This cast would just drop other fields. That may result in some simpler staging. TBD. This is a rather minimal change, it does not yet add bindings (C or Python), lowering to LLVM debug locations etc. that supports end line:cols. --------- Co-authored-by: River Riddle <[email protected]>

…on the RHS This helps ensure we lower to VPERMI2/T2 instructions that we can commute the index arg to VPERMT2/I2. Similar to 1e31a45 to handle cases where the one use load appears after further folding (keep the lowerShuffleWithPERMV version as this can handle the non-VLX widening case as well).

…7325) Two PRs were merged at the same time: one that modified `maybeApplyToV` function, and shortly afterwards, this (the reverted) one that had the old definition. During the merge both definitions were retained leading to compilation errors. Reapply the reverted PR (1a08b15) with the duplicate removed.

This converts all ptr element shuffle vectors to s64, so that the existing vector legalization handling can lower them as needed. This prevents a lot of fallbacks that currently try to generate things like `<2 x ptr> G_EXT`. I'm not sure if bitcast/inttoptr/ptrtoint is intended to be necessary for vectors of pointers, but it uses buildCast for the casts, which now generates a ptrtoint/inttoptr.

It does not make sense to assemble for the default target. Add one that shows the behavior. It is treated as a tahiti alias without instructions which were later removed, and needs to be treated as wave64. We should probably turn this into a hard error though.

You cannot adjust the disassembler's subtarget. llvm-mc passes the originally constructed MCSubtargetInfo around, rather than querying the pointer in the disassembler instance.

This is a refinement for the existing hack. With this, the default target will have neither wavefrontsize feature present, unless it was explicitly specified. That is, getWavefrontSize() == 64 no longer implies +wavefrontsize64. getWavefrontSize() == 32 does imply +wavefrontsize32. Continue to assume the value is 64 with no wavesize feature. This maintains the codegenable property without any code that directly cares about the wavesize needing to worry about it. Introduce an isWaveSizeKnown helper to check if we know the wavesize is accurate based on having one of the features explicitly set, or a known target-cpu. I'm not sure what's going on in wave_any.s. It's testing what happens when both wavesizes are enabled, but this is treated as an error in codegen. We now treat wave32 as the winning case, so some cases that were previously printed as vcc are now vcc_lo.

We are using `PostMachineScheduler` instead of `PostRAScheduler` since llvm#68696. The hook `getPostRAMutations` is only used in `PostRAScheduler` so it is actually dead code for RISC-V now.

…108889) This looks like a rather weird change, so let me explain why this isn't as unreasonable as it looks. Let's start with the problem it's solving. ``` define signext i32 @overlap_live_ranges(ptr %arg, i32 signext %arg1) { bb: %i = icmp eq i32 %arg1, 1 br i1 %i, label %bb2, label %bb5 bb2: ; preds = %bb %i3 = getelementptr inbounds nuw i8, ptr %arg, i64 4 %i4 = load i32, ptr %i3, align 4 br label %bb5 bb5: ; preds = %bb2, %bb %i6 = phi i32 [ %i4, %bb2 ], [ 13, %bb ] ret i32 %i6 } ``` Right now, we codegen this as: ``` li a3, 1 li a2, 13 bne a1, a3, .LBB0_2 lw a2, 4(a0) .LBB0_2: mv a0, a2 ret ``` In this example, we have two values which must be assigned to a0 per the ABI (%arg, and the return value). SelectionDAG ensures that all values used in a successor phi are defined before exit the predecessor block. This creates an ADDI to materialize the immediate in the entry block. Currently, this ADDI is not sunk into the tail block because we'd have to split a critical edges to do so. Note that if our immediate was anything large enough to require two instructions we *would* split this critical edge. Looking at other targets, we notice that they don't seem to have this problem. They perform the sinking, and tail duplication that we don't. Why? Well, it turns out for AArch64 that this is entirely an accident of the existance of the gpr32all register class. The immediate is materialized into the gpr32 class, and then copied into the gpr32all register class. The existance of that copy puts us right back into the two instruction case noted above. This change essentially just bypasses this emergent behavior aspect of the aarch64 behavior, and implements the same "always sink immediates" behavior for RISCV as well.

…vm#117590) Co-authored-by: Pravin Jagtap <[email protected]>

…0. (llvm#117591) Co-authored-by: Pravin Jagtap <[email protected]>

…m#117592) Co-authored-by: Pravin Jagtap <[email protected]>

…llvm#117593) OPSEL[0] selects src_word to read. Co-authored-by: Pravin Jagtap <[email protected]>

…m#117594) These instructions have non-standard use of OPSEL bits to select dest write byte. The src2_modifiers operand is used without having its corresponding src2 operand by introducing dummy src2. Co-authored-by: Pravin Jagtap <[email protected]>

…7515)

…117595) Scale packed 16-component single-precision float vectors from two source inputs using the exponent provided by the third single-precision float input, then convert the values to a packed 32-component FP6 float value. Co-authored-by: Pravin Jagtap <[email protected]>

…lvm#117596) This patch adds assembly and builtin support for v_ashr_pk_i8/u8_i32 instructions. Co-authored-by: Sirish Pande <[email protected]>

…117597) v_dot2_f32_bf16 was added in gfx11 along with v_dot2_f16_f16 and v_dot2_bf16_bf16. All three instructions were part of Dot9 instructions in the compiler. This patch will split existing dot9 (v_dot2_f16_f16, v_dot2_bf16_bf16, v_dot2_f32_bf16) into new dot9 (v_dot2_f16_f16 and v_dot2_bf16_bf16), and dot12 (v_dot2_f32_bf16). All necessary changes to gfx11 and gfx12 are updated to reflect this change. Co-authored-by: Sirish Pande <[email protected]>

…#117598) The encoding of v_dot2c_f32_bf16 opcode is same as v_mac_f32 in gfx90a, both from gfx9 series. This required a new decoderNameSpace GFX950_DOT. Co-authored-by: Sirish Pande <[email protected]>

Co-authored-by: Sirish Pande <[email protected]>

…vm#117631) - Update the runtime entry points to accept a stream information - Update the conversion of `cuf.allocate` to pass correctly the stream information when present. Note that the stream is not currently used in the runtime. This will be done in a separate patch as a design/solution needs to be down together with the allocators.

Co-authored-by: Quentin Chateau <[email protected]>

…" (llvm#117668) This reverts commit ca184cf.

…etConstant. (llvm#117639) Fix all the places I could find that did't do this. We were already mostly correct for FP_ROUND after 9a976f3, but not STRICT_FP_ROUND.

…lvm#117560) Pre-commit for llvm#67812

In case of an error GetBlock would return a reference to a Block without adding it to a parent. This doesn't seem like a good idea, and none of the other plugins do that. This patch fixes that by propagating errors (well, null pointers...) up the stack. I don't know of any specific problem that this solves, but given that this occurs only when something goes very wrong (e.g. a corrupted PDB file), it's quite possible noone has run into this situation, so we can't say the code is correct either. It also gets in the way of a refactor I'm contemplating.

When dealing with cpu_specific GlobalDecl, GetOrCreateMultiVersionResolver should immediately return the already created llvm function if it exists. Fixes llvm#115299.

bondhugula and others added 30 commits November 23, 2024 07:03

[MLIR] Drop assumption of a surrounding builtin.func in promoteIfSing…

b3909f4

…leIteration (llvm#116323) Drop assumption of a surrounding builtin.func in promoteIfSingleIteration. Fixes llvm#116042

[MLIR] Fix arbitrary checks in affine LICM (llvm#116469)

132de3a

Fix arbitrary checks and hardcoding/specialcasing in affine LICM. Drop unnecessary (too much) debug logging. This pass is still unsound due to not handling aliases. This will have to be handled later.

AMDGPU: Define new sched model for gfx950 (llvm#117261)

33c2b20

A few instructions changed rate.

AMDGPU: Handle gfx950 change in mfma_f64_16x16x4 + valu hazard (llvm#…

b078b88

…117262) Increase from 11 wait states to 19

AMDGPU: Handle gfx950 XDL-write-overlapped-smfma-src-c wait state cha…

8cb6c99

…nge (llvm#117263) These have an additional wait state compared to gfx940.

AMDGPU: Handle v_mfma_f64_16x16x4_f64 srcc write VGPR hazard change f…

db08d78

…or gfx950 (llvm#117283) Read by sgemm/dgemm in srcc after v_mfma_f64_16x16x4_f64 increases from 9 to 17 wait states.

AMDGPU: Handle v_mfma_f64_16x16x4_f64 write VGPR read srca/srcb hazar…

85601fd

…d change for gfx950 (llvm#117284) Increase in wait states from 11 to 19. The index for smfmac counts as like srcA/srcB.

[BasicBlockSections] Allow mixing of -basic-block-sections with MFS. (l…

68f7b07

…lvm#117076) This PR allows mixing `-basic-block-sections` with `-enable-machine-function-splitter`. The strategy is to let `-basic-block-sections` take precedence over functions with profiles.

[llvm] Fix ObjectSizeOffsetVisitor behavior in exact mode upon negati… (

19ddafa

llvm#116955) …ve offset In Exact mode, the approximation of returning (0,0) is invalid. It only holds in min/max mode.

[Clang] Add C++26 approved in the Poland WG21 meeting

3a31427

[X86] vector-shuffle-avx512.ll - regenerate TERNLOG comments

dbb21df

[X86] lowerShuffleWithPERMV - commute VPERMV3 shuffles so any load is…

1e31a45

… on the RHS This helps ensure we lower to VPERMI2/T2 instructions that we can commute the index arg to VPERMT2/I2. Prep work for llvm#79799

[VPlan] Simplify and unify code in verifyEVLRecipe using all_of. (NFCI)

5909139

Use all_of instead of explicit loop to reduce indentation, also properly check VPScalarCastRecipe operand.

[libc++] Granularize <mutex> includes (llvm#117068)

4a8329c

[libc++][NFC] Remove a bunch of unused environment variables from the…

aaa0dd2

… CI configs

[clang-format][NFC] Reformat testcases added in 0ff8b79

aa2d084

AMDGPU: Move default wavesize hack for disassembler (llvm#117422)

8b087d6

You cannot adjust the disassembler's subtarget. llvm-mc passes the originally constructed MCSubtargetInfo around, rather than querying the pointer in the disassembler instance.

AMDGPU: Use isWave[32|64] instead of comparing size value (llvm#117411)

1944d19

wangpc-pp and others added 24 commits November 26, 2024 10:55

[RISCV] Remove getPostRAMutations (llvm#117527)

6633916

We are using `PostMachineScheduler` instead of `PostRAScheduler` since llvm#68696. The hook `getPostRAMutations` is only used in `PostRAScheduler` so it is actually dead code for RISC-V now.

AMDGPU: MC support for v_cvt_scalef32_pk32_f32_[fp|bf]6 of gfx950 (ll…

5dd48c4

…vm#117590) Co-authored-by: Pravin Jagtap <[email protected]>

AMDGPU: MC support for v_cvt_scalef32_pk32_{bf|f}16_{bf|fp}6 of gfx95…

658db91

…0. (llvm#117591) Co-authored-by: Pravin Jagtap <[email protected]>

AMDGPU: Support v_cvt_scalef32_pk32_{bf|f}6_{bf|fp}16 for gfx950 (llv…

22503a9

…m#117592) Co-authored-by: Pravin Jagtap <[email protected]>

AMDGPU: MC support for v_cvt_scalef32_pk_{bf|f}16_{bf|fp}8 of gfx950. (…

c767570

…llvm#117593) OPSEL[0] selects src_word to read. Co-authored-by: Pravin Jagtap <[email protected]>

[RISCV][CostModel] add cost for cttz/ctlz under the non-zvbb (llvm#11…

c3377af

…7515)

AMDGPU: Add support for v_ashr_pk_i8/u8_i32 instructions for gfx950 (l…

5d650a6

…lvm#117596) This patch adds assembly and builtin support for v_ashr_pk_i8/u8_i32 instructions. Co-authored-by: Sirish Pande <[email protected]>

AMDGPU: Add support for v_dot2c_f32_bf16 instruction for gfx950 (llvm…

716364e

…#117598) The encoding of v_dot2c_f32_bf16 opcode is same as v_mac_f32 in gfx90a, both from gfx9 series. This required a new decoderNameSpace GFX950_DOT. Co-authored-by: Sirish Pande <[email protected]>

AMDGPU: Support buffer_atomic_pk_add_bf16 for gfx950 (llvm#117599)

7fc71f7

Co-authored-by: Sirish Pande <[email protected]>

AMDGPU: Add encodings for minimum3/maximum3 f32 for gfx950 (llvm#117600)

a5174de

AMDGPU: Add minimum3/maximum3 pkf16 for gfx950 encodings (llvm#117601)

ae719f0

[clangd] Support outgoing calls in call hierarchy (llvm#77556)

ca184cf

Co-authored-by: Quentin Chateau <[email protected]>

Revert "[clangd] Support outgoing calls in call hierarchy (llvm#77556)…

d77cab8

…" (llvm#117668) This reverts commit ca184cf.

[clang-format][NFC] Clean up RemoveBraces, RemoveSemi, etc.

6e57186

[SelectionDAG] Require last operand of (STRICT_)FP_ROUND to be a Targ…

bc28260

…etConstant. (llvm#117639) Fix all the places I could find that did't do this. We were already mostly correct for FP_ROUND after 9a976f3, but not STRICT_FP_ROUND.

[LV][NFC] Auto-generate the test cases related to FindLastIV idioms. (l…

90f5c8b

…lvm#117560) Pre-commit for llvm#67812

[clang][FMV] Fix crash with cpu_specific attribute. (llvm#115762)

56eb559

When dealing with cpu_specific GlobalDecl, GetOrCreateMultiVersionResolver should immediately return the already created llvm function if it exists. Fixes llvm#115299.

[AutoBump] Merge with 56eb559 (Nov 26)

8e924de

Base automatically changed from bump_to_beff2bac to bump_to_776476c2 March 13, 2025 10:24

jorickert requested a review from mgehre-amd March 14, 2025 06:59

mgehre-amd approved these changes Mar 14, 2025

View reviewed changes

jorickert merged commit 33ab764 into bump_to_776476c2 Mar 14, 2025
11 checks passed

jorickert deleted the bump_to_56eb559b branch March 14, 2025 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoBump] Merge with 56eb559b (Nov 26) (15) #479

[AutoBump] Merge with 56eb559b (Nov 26) (15) #479

Uh oh!

jorickert commented Feb 20, 2025

Uh oh!

Uh oh!

Uh oh!

[AutoBump] Merge with 56eb559b (Nov 26) (15) #479

[AutoBump] Merge with 56eb559b (Nov 26) (15) #479

Uh oh!

Conversation

jorickert commented Feb 20, 2025

Uh oh!

Uh oh!

Uh oh!