Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoBump] Merge with 258a5d49 (Nov 20) (11) #473

Open
wants to merge 367 commits into
base: bump_to_68a39081
Choose a base branch
from

Conversation

jorickert
Copy link

No description provided.

Lukacma and others added 30 commits November 19, 2024 10:29
These nodes should appear between CALLSEQ_START / CALLSEQ_END.
Previously, they could be scheduled after CALLSEQ_END because the nodes
didn't update the chain.

The change in a test is due to X86 call frame optimizer pass bailing out
for a particular call when CALLSEQ_START / CALLSEQ_END are not in the
same basic block. This happens because SEG_ALLOCA is expanded into a
sequence of basic blocks early. It didn't bail out before because the
closing CALLSEQ_END was scheduled before SEG_ALLOCA, in the same basic
block as CALLSEQ_START.

While here, simplify creation of these nodes: allocating a virtual
register and copying `Size` into it were unnecessary.
…lts to improve further folding (llvm#116419)

Currently when creating a SHUFPD immediate mask, any undef shuffle elements are set to 0, which can limit options for further shuffle combining.

This patch attempts to canonicalize the mask to improve folding: first by detecting a per-lane broadcast style mask (which can allow us to fold to UNPCK instead), and second ensure any undef elements are set to an 'inplace' value to improve chances of the SHUFPD later folding to a BLENDPD (or be bypassed in a SimplifyMultipleUseDemandedVectorElts call).

This is very similar to canonicalization we already attempt in getV4X86ShuffleImm for vXi32/vXf32 SHUFPS/SHUFD shuffles.
A follow up to PR llvm#116402 to add a regression test. The original change
fixed the reproducer but that was not suitable to use as a regression
test.

This test case will fail with a LLD prior to llvm#116402.

The disassembly for the thunk that starts as a short thunk but is later
a long thunk isn't quite right. It is missing a $d mapping symbol. I
think this can be fixed, but I've not done that in this patch to keep it
test only. It is not a regression introduced in llvm#116402.

I've also removed a spurious --threads=1 I noticed in the original test
aarch64-thunk-bti.s
FPCLASS is a unary instruction with an immediate operand - update the naming to match similar instructions (e.g. VPSHUFD) by only using the source reg/mem and immediate in the instruction name
…6651)

Summary:
For Linux systems, we currently use the HSA library to determine the
installed GPUs. However, this isn't really necessary and adds a
dependency on the HSA runtime as well as a lot of overhead. Instead,
this patch uses the `sysfs` interface exposed by `amdkfd` to do this
directly.
…lvm#116195)

The entries in the dependency matrix can contain a lot of duplicates,
which is unnecessary and results in more checks that we can avoid, and
this patch adds that.
…n function signatures. (llvm#116146)

In loongarch64 LP64D ABI, `unsigned 32-bit` types, such as unsigned int,
are stored in general-purpose registers as proper sign extensions of
their 32-bit values. Therefore, Flang also follows it if a function
needs to be interoperable with C.

Reference:

https://github.com/loongson/la-abi-specs/blob/release/lapcs.adoc#Fundamental-types
…#116590)

This patch generalizes llvm#81027
to handle pattern `and/or (fcmp ord/uno X, 0), (fcmp pred fabs(X), Y)`.
Alive2: https://alive2.llvm.org/ce/z/tsgUrz
The correctness is straightforward because `fcmp ord/uno X, 0.0` is
equivalent to `fcmp ord/uno fabs(X), 0.0`. We may generalize it to
handle fneg as well.

Address comment
llvm#116065 (review)
…vm#116247)

There are lots of places where we try to estimate the runtime
vectorisation factor based on the getVScaleForTuning TTI hook.
I've added a new getEstimatedRuntimeVF function and taught
several places in the vectoriser to use this new function.
This was a typo in llvm#112983 that didn't cause build
failures but is still wrong.
Part of llvm#51787.
Follow up of llvm#116243.

This patch adds constexpr support for the built-in reduce mul function.
In ModuleToObject flow, users may want to add some callback functions
invoked with LLVM IR/ISA for debugging or other purposes.
I'll be modifying this test in a future PR.
)

Sink shuffle operands of FMul instructions if these are splats, as we
can generate lane-indexed variants for these.
…vm#114508)

The `rd` operand of AMCAS instructions is both read and written, because
of the nature of compare-and-swap operations, but currently it is not
declared as such. Fix it for upcoming codegen enablement changes. In
order to do that, a piece of LoongArchAsmParser logic that relied on
TableGen-erated enum variants being ordered in a specific way needs
updating; this will be addressed in a following refactor. No functional
change intended.

While at it, restore vertical alignment for the definition lines.

Suggested-by: tangaac <[email protected]>
Link:
llvm#114398 (comment)
Add variant with different metadata on all loads, for
llvm#115868
…hen there are predicate calls (llvm#116075)

On loongarch64 with lsx extension, we select `VBITREV_W` for `v4i32 (xor
X, (shl splat(1), Y))`:

https://github.com/llvm/llvm-project/blob/8e6630391699116641cf390a10476295b7d4b95c/llvm/lib/Target/LoongArch/LoongArchLSXInstrInfo.td#L1583-L1584

And `vsplat_imm_eq_1` is defined as:

https://github.com/llvm/llvm-project/blob/8e6630391699116641cf390a10476295b7d4b95c/llvm/lib/Target/LoongArch/LoongArchLSXInstrInfo.td#L77-L87

For the `(bitconvert (v4i32 (build_vector)))` case, the pattern is
expected to be:
```
PATTERN: (xor:{ *:[v4i32] } v4i32:{ *:[v4i32] }:$vj, (shl:{ *:[v4i32] } (bitconvert:{ *:[v4i32] } (build_vector:{ *:[v4i32] }))<<P:Predicate_vsplat_imm_eq_1>>, v4i32:{ *:[v4i32] }:$vk))
RESULT:  (VBITREV_W:{ *:[v4i32] } v4i32:{ *:[v4i32] }:$vj, v4i32:{ *:[v4i32] }:$vk)
```

However, `simplifyTree` drops the `bitconvert` node and its predicates:

https://github.com/llvm/llvm-project/blob/8e6630391699116641cf390a10476295b7d4b95c/llvm/utils/TableGen/Common/CodeGenDAGPatterns.cpp#L3036-L3062

Then llvm will match `vsplat_imm_eq_1` for any v4i32 splats and cause a
miscompilation:
```
PATTERN: (xor:{ *:[v4i32] } v4i32:{ *:[v4i32] }:$vj, (shl:{ *:[v4i32] } (build_vector:{ *:[v4i32] }), v4i32:{ *:[v4i32] }:$vk))
RESULT:  (VBITREV_W:{ *:[v4i32] } v4i32:{ *:[v4i32] }:$vj, v4i32:{ *:[v4i32] }:$vk)
```

This patch adds additional checks for predicates associated with the
trivial bitconvert node. Unused patterns in the LoongArch target are
also removed.

Fixes llvm#116008.
Currently, null chunks always follow other aligned chunks, so this patch
is NFC. However, it will become observable once support for ARM64X
imports is added. The import tables are shared between the native and EC
views. They are usually very similar, but in cases where they differ,
ARM64X relocations handle the discrepancies. If a DLL is only imported
by EC code, the native view will see it as importing zero functions from
this DLL (with ARM64X relocations replacing those null chunks with
actual imports). In this scenario, the null chunks may appear as the
very first chunks, meaning there is nothing else forcing their
alignment.
Looks like a few different phrasings got mashed up together.
…vm#116610)

This patch improves the code reuse of the actions system and adds
several improvements for easier debugging via clang-repl
--debug-only=clang-repl.

The change inimproves the consistency of the TUKind when actions are
handled within a WrapperFrontendAction. In this case instead of falling
back to default TU_Complete, we forward to the TUKind of the ASTContext
which presumably was created by the intended action. This enables the
incremental infrastructure to reuse code.

This patch also clones the first llvm::Module because the first PTU now
can come from -include A.h and the presumption of llvm::Module being
empty does not hold. The changes are a first step to fix the issues with
`clang-repl --cuda`.
…16780)

A lot of interchange tests unnecessary relied on a build with ASSERTS
enabled. Instead, simply check the IR output for both negative and
positive tests so that we don't rely on debug messages. This increases
test coverage as these tests will now also run with non-assert builds.
For a couple of files keeping some of the debug tests was useful, so
separated out them out and moved them to a similarly named *-remarks.ll
file.
…vm#116545)

see llvm#73359

Declarative assemblyFormat ODS is more concise and requires less
boilerplate than filling out cpp interfaces.

Changes:
- updates the AccessChainOp defined in SPIRVMemoryOps.td to use
assemblyFormat.
- Removes part print/parse from MemoryOps.cpp which is now generated by
assemblyFormat
- Updates tests to updated format
The (extended) bit width might not fit into the (non-extended)
type, resulting in an incorrect truncation of the compared value.

Fix this by using m_SpecificInt(), which is both simpler and
handles this correctly.

Fixes the assertion failure reported in:
llvm#114539 (comment)
kazutakahirata and others added 29 commits November 20, 2024 12:58
We've switched to LineLocation from FieldsAre in MemProfUseTest.cpp.
This patch does the same thing in InstrProfTest.cpp.

llvm/unittests/Transforms/Instrumentation/MemProfUseTest.cpp
llvm#115627)

This fixes a missed optimization caused by the `foldBitcastExtElt`
pattern interfering with other combine patterns. In the case I was
hitting, we have IR that combines two vectors into a new larger vector
by extracting elements and inserting them into the new vector.

```llvm
define <4 x half> @bitcast_extract_insert_to_shuffle(i32 %a, i32 %b) {
  %avec = bitcast i32 %a to <2 x half>
  %a0 = extractelement <2 x half> %avec, i32 0
  %a1 = extractelement <2 x half> %avec, i32 1
  %bvec = bitcast i32 %b to <2 x half>
  %b0 = extractelement <2 x half> %bvec, i32 0
  %b1 = extractelement <2 x half> %bvec, i32 1
  %ins0 = insertelement <4 x half> undef, half %a0, i32 0
  %ins1 = insertelement <4 x half> %ins0, half %a1, i32 1
  %ins2 = insertelement <4 x half> %ins1, half %b0, i32 2
  %ins3 = insertelement <4 x half> %ins2, half %b1, i32 3
  ret <4 x half> %ins3
}
```

With the current behavior, `InstCombine` converts each vector extract
sequence to

```llvm
  %tmp = trunc i32 %a to i16
  %a0 = bitcast i16 %tmp to half
  %a1 = extractelement <2 x half> %avec, i32 1
```

where the extraction of `%a0` is now done by truncating the original
integer. While on it's own this is fairly reasonable, in this case it
also blocks the pattern which converts `extractelement` -
`insertelement` into shuffles which gives the overall simpler result:

```llvm
define <4 x half> @bitcast_extract_insert_to_shuffle(i32 %a, i32 %b) {
  %avec = bitcast i32 %a to <2 x half>
  %bvec = bitcast i32 %b to <2 x half>
  %ins3 = shufflevector <2 x half> %avec, <2 x half> %bvec, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x half> %ins3
}
```

In this PR I fix the conflict by obeying the `hasOneUse` check even if
there is no shift instruction required. In these cases we can't remove
the vector completely, so the pattern has less benefit anyway.

Also fwiw, I think dropping the `hasOneUse` check for the 0th element
might have been a mistake in the first place. Looking at
llvm@535c5d5
the commit message only mentions loosening the `isDesirableIntType`
requirement and doesn't mention changing the `hasOneUse` check at all.
Instead of custom selecting a bunch of instructions, we can expand to
generic MIR during legalization.
This reworks the free store implementation in libc's malloc to use a
dlmalloc-style binary trie of circularly linked FIFO free lists. This
data structure can be maintained in logarithmic time, but it still
permits a relatively small implementation compared to other
logarithmic-time ordered maps.

The implementation doesn't do the various bitwise tricks or
optimizations used in actual dlmalloc; it instead optimizes for
(relative) readability and minimum code size. Specific optimization can
be added as necessary given future profiling.
This patch upgrades a unit test to MemProf Version 3 while removing
those bits that cannot be upgraded to Version 3.

The bits being removed expect instrprof_error::hash_mismatch from a
broken MemProf profile that references a frame that doesn't actually
exist.  Now, Version 3 no longer issues
instrprof_error::hash_mismatch.  Even if it still issued
instrprof_error::hash_mismatch, we would have a couple of hurdles:

- InstrProfWriter::addMemProfData will soon require all (or none) of
  the fields (frames, call stacks, and records) be populated.  That
  is, it won't accept an instance of IndexedMemProfData with frames
  missing.

- writeMemProfV3 asserts that every frame occurs at least once:

  assert(MemProfData.Frames.size() == FrameHistogram.size());

This patch gives up on instrprof_error::hash_mismatch and tries to
trigger instrprof_error::unknown_function with the empty profile.
The Android clang-r536225 compiler identifies as Clang 19, but it is
based on commit fc57f88, which predates
the official LLVM 19.0.0 release.

Some tests need fixes:

* The sized delete tests fail because clang-r536225 leaves sized
deallocation off by default.

* std::array<T[0]> is true when this Android Clang version is used with
a trunk libc++, but we expect it to be false in the test. In practice,
Clang and libc++ usually come from the same commit on Android.
…vm#116151)

Android clang-r536225 identifies as Clang 19 but it predates LLVM
19.0.0. It is based off of fc57f88.
…ames with DWARFTypePrinter (llvm#117071)

This is a reland of llvm#112811.
Fixed the bot breakage by running ld.lld explicitly.
…3626)

Now that `-fbasic-block-sections=list` is enabled for Arm, functions may
be split aross multiple sections, and CFI information must be handled
independently for each section.

On x86, this is handled in `llvm/lib/CodeGen/CFIInstrInserter.cpp`.
However, this pass does not run on Arm, so we must add logic for it
to `llvm/lib/CodeGen/CFIFixup.cpp`.
…ible with minimal runtime (llvm#114865)

We are currently getting:

`clang: error: invalid argument '-fsanitize-minimal-runtime' not allowed
with '-fsanitize=implicit-conversion'`

when running

`-fsanitize=implicit-conversion -fsanitize-minimal-runtime`

because `implicit-conversion` now includes
`implicit-bitfield-conversion` which is not included in the `integer`
check. The `integer` check includes the `implicit-integer-conversion`
checks and is supported by the trapping option and because of that
compatible with the minimal runtime. It is thus reasonable to make
`implicit-bitfield-conversion` compatible with the minimal runtime.
…vm#114537)

Summary:
Consolidate the logic in a single function. We do an extra pass over
Instructions but this is necessary to untangle things and extract
metadata cloning in a future diff.

Test Plan:
```
$ ninja check-llvm-unit check-llvm
[211/213] Running the LLVM regression tests

Testing Time: 106.06s

Total Discovered Tests: 62601
  Skipped          :    17 (0.03%)
  Unsupported      :  2518 (4.02%)
  Passed           : 59911 (95.70%)
  Expectedly Failed:   155 (0.25%)
[212/213] Running lit suite 

Testing Time: 12.47s

Total Discovered Tests: 8474
  Skipped:   17 (0.20%)
  Passed : 8457 (99.80%)
```

Extracted from llvm#109032 (commit 3) (there are more refactors and cleanups
in subsequent commits)
When embedding, if `compiler.used` exists, we should re-use it's element
type instead of blindly assuming it's an unqualified pointer.
In MSVC, when `/d1initall` is enabled, `__declspec(no_init_all)` can be
applied to a type to suppress auto-initialization for all instances of
that type or to a function to suppress auto-initialization for all
locals within that function.

This change does the same for Clang, except that it applies to the
`-ftrivial-auto-var-init` flag instead.

NOTE: I did not add a Clang-specific spelling for this but would be
happy to make a followup PR if folks are interested in that.
)

ld64.lld would previously allow you to link against dylibs linked with
`-allowable_client`, even if the client's name does not match any
allowed client.

This change fixes that. See llvm#114146 for related discussion. 

The test binary `liballowable_client.dylib` was created on macOS with:

echo | clang -xc - -dynamiclib -mmacosx-version-min=10.11 -arch x86_64
-Wl,-allowable_client,allowed -o lib/liballowable_client.dylib
…lvm#116934)

The dialect conversion driver has three phases:
- **Create** `IRRewrite` objects as the IR is traversed.
- **Finalize** `IRRewrite` objects. During this phase, source
materializations for mismatching value types are created. (E.g., when
`Value` is replaced with a `Value` of different type, but there is a
user of the original value that was not modified because it is already
legal.)
- **Commit** `IRRewrite` objects. During this phase, all remaining IR
modifications are materialized. In particular, SSA values are actually
being replaced during this phase.

This commit removes the "finalize" phase. This simplifies the code base
a bit and avoids one traversal over the `IRRewrite` stack. Source
materializations are now built during the "commit" phase, right before
an SSA value is being replaced.

This commit also removes the "inverse mapping" of the conversion value
mapping, which was used to predict if an SSA value will be dead at the
end of the conversion. This check is replaced with an approximate check
that does not require an inverse mapping. (A false positive for `v` can
occur if another value `v2` is mapped to `v` and `v2` turns out to be
dead at the end of the conversion. This case is not expected to happen
very often.) This reduces the complexity of the driver a bit and removes
one potential source of bugs. (There have been bugs in the usage of the
inverse mapping in the past.)

`BlockTypeConversionRewrite` no longer stores a pointer to the type
converter. This pointer is now stored in `ReplaceBlockArgRewrite`.

This commit is in preparation of merging the 1:1 and 1:N dialect
conversion driver. It simplifies the upcoming changes around the
conversion value mapping. (API surface of the conversion value mapping
is reduced.)
…/fromPtr.

On arm64e, uses the "wrap" and "unwrap" operations introduced in f14cb49 to
sign and strip pointers by default. Signing / striping can be overriden at the
toPtr / fromPtr callside by passing an explicit wrap / unwrap operation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment