Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoBump] Merge with ceeb08b9 (Nov 18) (9) #469

Open
wants to merge 1,770 commits into
base: feature/fused-ops
Choose a base branch
from

Conversation

jorickert
Copy link

No description provided.

jayfoad and others added 30 commits November 15, 2024 11:54
)

Introduces the SYCL based toolchain and initial toolchain construction
when using the '-fsycl' option. This option will enable SYCL based
offloading, creating a SPIR-V based IR file packaged into the compiled
host object.

This includes early support for creating the host/device object using
the new offloading model. The device object is created using the
spir64-unknown-unknown target triple.

New/Updated Options:
 -fsycl  Enables SYCL offloading for host and device
 -fsycl-device-only
         Enables device only compilation for SYCL
 -fsycl-host-only
         Enables host only compilation for SYCL

RFC Reference:
https://discourse.llvm.org/t/rfc-sycl-driver-enhancements/74092
…lvm#114874)

Emit .ptr, .address-space, and .align attributes for kernel
args in CUDA (previously handled only for OpenCL).

This allows for more vectorization opportunities if the PTX consumer
is able to know about the pointer alignments.

If no alignment is explicitly specified, .align 1 will be emitted
to match the LLVM IR semantics in this case.

PTX ISA doc -
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-parameter-attribute-ptr

This is a rework of the original patch proposed in llvm#79646

---------

Co-authored-by: Vandana <[email protected]>
The implementation is mostly based on the one existing for the exact
flag.

disjoint means that for each bit, that bit is zero in at least one of
the inputs. This allows the Or to be treated as an Add since no carry
can occur from any bit. If the disjoint keyword is present, the result
value of the or is a [poison
value](https://llvm.org/docs/LangRef.html#poisonvalues) if both inputs
have a one in the same bit position. For vectors, only the element
containing the bit is poison.
llvm#115777)

Summary:
Address spaces are used in several embedded and GPU targets to describe
accesses to different types of memory. Currently we use the address
space enumerations to control which address spaces are considered
supersets of eachother, however this is also a target level property as
described by the C standard's passing mentions. This patch allows the
address space checks to use the target information to decide if a
pointer conversion is legal. For AMDGPU and NVPTX, all supported address
spaces can be converted to the default address space.

More semantic checks can be added on top of this, for now I'm mainly
looking to get more standard semantics working for C/C++. Right now the
address space conversions must all be done explicitly in C/C++ unlike
the offloading languages which define their own custom address spaces
that just map to the same target specific ones anyway. The main question
is if this behavior is a function of the target or the language.
…115920)

Aaron has been helping out with TSA for several years and is highly
knowledgeable about the implementation.
Reverts llvm#116109

This change is breaking triton tests (results in huge numeric
disparities, e.g.
https://github.com/triton-lang/triton/blob/main/python/test/unit/language/test_core.py),
we'll need to revert until a fix forward can be merged.
…lvm#116196)

This is a follow-up for the conversation here
llvm#115722.

This test is designed for Windows target/PDB format, so it shouldn't be
built and run for DWARF/etc.
From the discussion at the round-table at the RISC-V Summit it was clear
people see cases where global merging would help. So the direction of
enabling it by default and iteratively working to enable it in more
cases or to improve the heuristics seems sensible. This patch tries to
make a minimal step in that direction.
Certain DFP instructions have GPR arguments, which are currently
incorrectly treated as FPR registers.  Since we do not use DFP
in codegen, this only affects the assembler and disassembler.
…7583)

Supersedes the draft PR llvm#94992, taking a different approach following
feedback:
* Lower in PreISelIntrinsicLowering
* Don't require that the number of bytes to set is a compile-time
constant
* Define llvm.memset_pattern rather than llvm.memset_pattern.inline

As discussed in the [RFC
thread](https://discourse.llvm.org/t/rfc-introducing-an-llvm-memset-pattern-inline-intrinsic/79496),
the intent is that the intrinsic will be lowered to loops, a sequence of
stores, or libcalls depending on the expected cost and availability of
libcalls on the target. Right now, there's just a single lowering path
that aims to handle all cases. My intent would be to follow up with
additional PRs that add additional optimisations when possible (e.g.
when libcalls are available, when arguments are known to be constant
etc).
…6161)

This PR removes tests with `br i1 undef` under
`llvm/tests/Transforms/HotColdSplit` and `llvm/tests/Transforms/I*`.
On a 32-bit target if pointer arithmetic with `addrspace` is used in i64
computation, the missed folding in InstCombine results to suboptimal
performance, unlike same code compiled for 64bit target.
…vm#114262)

I noticed that the two C functions emitted different IR:

```
int switch_duplicate_arms(int switch_val, int v, int w) {
  switch (switch_val) {
  default:
    break;
  case 0:
    w = v;
    break;
  case 1:
    w = v;
    break;
  }
  return w;
}

int if_duplicate_arms(int switch_val, int v, int w) {
  if (switch_val == 0)
    w = v;
  else if (switch_val == 1)
    w = v;
  return v0;
}
```

We generate IR that looks like this:

```
define i32 @switch_duplicate_arms(i32 %0, i32 %1, i32 %2, i32 %3) {
  switch i32 %1, label %7 [
    i32 0, label %5
    i32 1, label %6
  ]

5:
  br label %7

6:
  br label %7

7:
  %8 = phi i32 [ %3, %4 ], [ %2, %6 ], [ %2, %5 ]
  ret i32 %8
}

define i32 @if_duplicate_arms(i32 %0, i32 %1, i32 %2, i32 %3) {
  %5 = icmp ult i32 %1, 2
  %6 = select i1 %5, i32 %2, i32 %3
  ret i32 %6
}
```

For `switch_duplicate_arms`, taking case 0 and 1 are the same since %5
and %6
branch to the same location and the incoming values for %8 are the same
from
those blocks. We could remove one on the duplicate switch targets and
update
the switch with the single target.

On RISC-V, prior to this patch, we generate the following code:
```
switch_duplicate_arms:
        li      a4, 1
        beq     a1, a4, .LBB0_2
        mv      a0, a3
        bnez    a1, .LBB0_3
.LBB0_2:
        mv      a0, a2
.LBB0_3:
        ret

if_duplicate_arms:
        li      a4, 2
        mv      a0, a2
        bltu    a1, a4, .LBB1_2
        mv      a0, a3
.LBB1_2:
        ret
```

After this patch, the O3 code is optimized to the icmp + select pair,
which
gives us the same code gen as `if_duplicate_arms`, as desired. This
results
is one less branch instruction in the final assembly.

This may help with both code size and further switch simplification. I
found
that this patch causes no significant impact to spec2006/int/ref and
spec2017/intrate/ref.

---------

Co-authored-by: Min Hsu <[email protected]>
)

@alexander-shaposhnikov is already listed as the nsan maintainer on the
compiler-rt side
(https://github.com/llvm/llvm-project/blob/a55248789ed3f653740e0723d016203b9d585f26/compiler-rt/CODE_OWNERS.TXT#L75-L77),
so I'd like to mirror this to the LLVM part of the sanitizer as well.
)

The logic for marking runtime libcalls unavailable currently duplicates
essentially the same logic for some random subset of targets, where
someone reported an issue and then someone went and fixed the issue for
that specific target only. However, the availability for most of these
is completely target independent. In particular:

 * MULO_I128 is never available in libgcc
 * Various I128 libcalls are not available for 32-bit targets in libgcc
 * powi is never available in MSVCRT

Unify the logic for these, so we don't miss any targets. This fixes
llvm#16778 on AArch64, which is
one of the targets that was previously missed in this logic.
…llvm#97583)"

This reverts commit 7ff3a9a.

Recent scheduling changes means tests need to be re-generated. Reverting
to green while I do that.
…es (llvm#116250)

This is a follow-up to llvm#115567. Emit an error for invalid function
names, similar to MSVC's `lib.exe` behavior.

Returning an error from `writeImportLibrary` exposed bugs in error
handling by its callers, which have been addressed in this patch.
…llvm#97583)

Relands 7ff3a9a after regenerating the
test case.

Supersedes the draft PR llvm#94992, taking a different approach following
feedback:
* Lower in PreISelIntrinsicLowering
* Don't require that the number of bytes to set is a compile-time
constant
* Define llvm.memset_pattern rather than llvm.memset_pattern.inline

As discussed in the [RFC
thread](https://discourse.llvm.org/t/rfc-introducing-an-llvm-memset-pattern-inline-intrinsic/79496),
the intent is that the intrinsic will be lowered to loops, a sequence of
stores, or libcalls depending on the expected cost and availability of
libcalls on the target. Right now, there's just a single lowering path
that aims to handle all cases. My intent would be to follow up with
additional PRs that add additional optimisations when possible (e.g.
when libcalls are available, when arguments are known to be constant
etc).
Identified with misc-include-cleaner.
Identified with misc-include-cleaner.
Identified with misc-include-cleaner.
Enables splat support for loads with lanes> 2 or number of operands> 2.

Allows better detect splats of loads and reduces number of shuffles in
some cases.

X86, AVX512, -O3+LTO

Metric: size..text
                                                                          results     results0    diff
               test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   154867.00   156723.00  1.2%
 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12467735.00 12468023.00  0.0%

Better vectorization quality

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: llvm#115173
4vtomat and others added 27 commits November 18, 2024 18:12
Right shift has higher precedence than bitwise and, so it should be
parentheses around & operator. This case works as expected because
IsVRegClassShift is 0, other cases will fail.
…translation (llvm#116416)

I think this is a bit easier to read.
…sert/extract (llvm#116425)

This is a NFC-ish change that moves
vector.extractelement/vector.insertelement vector distribution patterns
to vector.insert/vector.extract.

Before:

0-d/1-d vector.extract -> vector.extractelement -> distributed
vector.extractelement
2-d+ vector.extract -> distributed vector.extract

After:

scalar input vector.extract -> distributed vector.extract
vector.extractelement -> distributed vector.extract
2d+ vector.extract -> distributed vector.extract

The same changes are done for insertelement/insert. The change allows us
to remove reliance on vector.extractelement/vector.insertelement, which
are soon to be depreciated:
https://discourse.llvm.org/t/rfc-psa-remove-vector-extractelement-and-vector-insertelement-ops-in-favor-of-vector-extract-and-vector-insert-ops/71116/8

No extra tests are included because this patch doesn't introduce /
remove any functionality. It only changes the chain of lowerings. This
change can be completly NFC if we make the distributed operation
vector.extractelement/vector.insertelement, but that is slightly weird,
because you are going from extractelement -> extract -> extractelement.
…115738)

The [`SPV_INTEL_split_barrier`  extension](https://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/INTEL/SPV_INTEL_split_barrier.html)
defines operations to split control barrier semantics in two operations.
Add support for these operations (arrive and wait) to the dialect.

Signed-off-by: Victor Perez <[email protected]>
This PR adds the handling of `ClassType`. It is treated as pointer to
the underlying type. Note that `ClassType` when passed to the function
have double indirection so it is represented as pointer to type
(compared to other types which may have a single indirection).

If `ClassType` wraps a pointer or allocatable then we take care to
generate it as PTR -> type (and not PTR -> PTR -> type).

This is how it looks like in the debugger.

```
subroutine test_proc (this)
    class(test_type), intent (inout) :: this
    allocate (this%b (3, 2))
    call fill_array_2d (this%b)
    print *, this%a
end
```

```
(gdb) p this
$6 = (PTR TO -> ( Type test_type )) 0x2052a0
(gdb) p this%a
$7 = 0
(gdb) p this%b
$8 = ((1, 2, 3) (4, 5, 6))

```
Change the signal handler to use a pipe to notify about incoming
signals. This has two benefits:
- the signal no longer has to happen on the MainLoop thread. With the
previous implementation, this had to be the case as that was the only
way to ensure that ppoll gets interrupted correctly. In a multithreaded
process, this would mean that all other threads have to have the signal
blocked at all times.
- we don't need the android-specific workaround, which was necessary due
to the syscall being implemented in a non-atomic way

When the MainLoop class was first implemented, we did not have the
interrupt self-pipe, so syscall interruption was the most
straight-forward implementation. Over time, the class gained new
abilities (the pipe being one of them), so we can now piggy-back on
those.

This patch also changes the kevent-based implementation to use the pipe
for signal notification as well. The motivation there is slightly
different:
- it makes the implementations more uniform
- it makes sure we handle all kinds of signals, like we do with the
linux version (EVFILT_SIGNAL only catches process-directed signals)
Part of llvm#51787.

This patch adds constexpr support for the built-in reduce add function.
If this is the right way to go, I will add support for other reduce
functions in later patches.

---------

Co-authored-by: Mariya Podchishchaeva <[email protected]>
…16506)

Add ALL variable category, implement semantic checks to verify the
validity of the clause, improve error messages, add testcases.

The variable category modifier is optional since 5.0, make sure we allow
it to be missing. If it is missing, assume "all" in clause conversion.
This introduces the MinGW counterpart of `lld-link`'s `-functionpadmin`.
Summary:
The end goal is to make `rpc.h` a standalone header so that other
projects can include it without leaking `libc` internals. I'm trying to
replace stuff slowly before pulling it out all at once to reduce the
size of the changes.

This patch removes the atomic and a few sparse dependencies. Now we
mostly rely on the GPU utils, the sleep function, optional, and the
type traits. I'll clean these up in future patches. This removed the old
stuff I had around the memcpy, but I think that it's not quite as bad as
it once was, as it removes a branch and only uses a few extra VGPRs
since I believe the builtin memcpy was improved for AMD.
This commit adds an intrinsic that will write to an image buffer. We
chose to match the name of the DXIL intrinsic for simplicity in clang.

We cannot reuse the existing openCL write_image function because that is
not a reserved name in HLSL. There is not much common code to factor
out.
…ed (llvm#116480)

As mentioned in the title, the missing `consumeError` triggers assertions.
In MCA, the load/store unit is modeled through a `LSUnitBase` class.
Judging from the name `LSUnitBase`, I believe there is an intent to
allow for different specialized load/store unit implementations.
(However, currently there is only one implementation used in-tree,
`LSUnit`.)

PR llvm#101534 fixed one instance where the specialized `LSUnit` was
hard-coded, opening the door for other subclasses to be used, but what
subclasses can do is, in my opinion, still overly limited due to a
reliance on the `MemoryGroup` class, e.g.
[here](https://github.com/llvm/llvm-project/blob/8b55162e195783dd27e1c69fb4d97971ef76725b/llvm/lib/MCA/HardwareUnits/Scheduler.cpp#L88).

The `MemoryGroup` class is currently used in the default `LSUnit`
implementation to model data dependencies/hazards in the pipeline.
`MemoryGroups` form a graph of memory dependencies that inform the
scheduler when load/store instructions can be executed relative to each
other.

In my eyes, this is an implementation detail. Other `LSUnit`s may want
to keep track of data dependencies in different ways. As a concrete
example, a downstream use I am working on<sup>[1]</sup> uses a custom
load/store unit that makes use of available aliasing information. I
haven't been able to shoehorn our additional aliasing information into
the existing `MemoryGroup` abstraction. I think there is no need to
force subclasses to use `MemoryGroup`s; users of `LSUnitBase` are only
concerned with when, and for how long, a load/store instruction
executes.

This PR makes changes to instead leave it up to the subclasses how to
model such dependencies, and only prescribes an abstract interface in
`LSUnitBase`. It also moves data members and methods that are not
necessary to provide an abstract interface from `LSUnitBase` to the
`LSUnit` subclass. I decided to make the `MemoryGroup` a protected
subclass of `LSUnit`; that way, specializations may inherit from
`LSUnit` and still make use of `MemoryGroup`s if they wish to do so
(e.g. if they want to only overwrite the `dispatch` method).

**Drawbacks / Considerations**

My reason for suggesting this PR is an out-of-tree use. As such, these
changes don't introduce any new functionality for in-tree LLVM uses.
However, in my opinion, these changes improve code clarity and prescribe
a clear interface, which would be the main benefit for the LLVM
community.

A drawback of the more abstract interface is that virtual dispatching is
used in more places. However, note that virtual dispatch is already
currently used in some critical parts of the `LSUnitBase`, e.g. the
`isAvailable` and `dispatch` methods. As a quick check to ensure these
changes don't significantly negatively impact performance, I also ran
`time llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2
-iterations=3000 llvm/test/tools/llvm-mca/X86/BtVer2/dot-product.s`
before and after the changes; there was no observable difference in
runtimes (`0.292 s` total before, `0.286 s` total after changes).

<sup>[1]: MCAD started by @mshockwave and @chinmaydd.</sup>
This patch fixes:

  lldb/source/Host/posix/MainLoopPosix.cpp:64:11: error: unused
  variable 'bytes_written' [-Werror,-Wunused-variable]
This patch removes clang/Parse/ParseDiagnostic.h because it just
forwards to clang/Basic/DiagnosticParse.h.
Identified with misc-include-cleaner.
Add patterns to lower `fmaxnum(fma(a, b, c), 0)` to `fma.rn{.ftz}.relu`
for `f16`, `f16x2`, `bf16`, `bf16x2` types, when `nnan` is used.

`fma_relu` honours `NaN`, so the substitution is only made if the `fma`
is `nnan`, since `fmaxnum` returns the non NaN argument when passed a
NaN value.

This patch also removes some `bf16` ftz instructions since `FTZ` is not
supported with the `bf16` type, according to the PTX ISA docs.
llvm#115544)

The input generating functions for benchmark tests in the GenerateInput.h
file can be slightly improved by invoking vector::reserve before calling
vector::push_back. This slight performance improvement could potentially
speed-up all benchmark tests for containers and algorithms that use these
functions as inputs.
…conditions (llvm#116627)

This patch bails out non-dedicated exits to avoid adding exiting
conditions to invalid context.
Closes llvm#116553.
…m#116621)

It also modifies the error message to specify it is the dependence-type
that is not supported.

Resolves the crash in
llvm#115647. A fix can come in
later as part of future OpenMP version support.
This reverts commit 4f48a81.

The newly added test was failing on the public macOS Arm64 bots:
```
======================================================================
FAIL: test_column_breakpoints (TestDAP_breakpointLocations.TestDAP_setBreakpoints)
   Test retrieving the available breakpoint locations.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/lldb/test/API/tools/lldb-dap/breakpoint/TestDAP_breakpointLocations.py", line 77, in test_column_breakpoints
    self.assertEqual(
AssertionError: Lists differ: [{'co[70 chars]e': 41}, {'column': 3, 'line': 42}, {'column': 18, 'line': 42}] != [{'co[70 chars]e': 42}, {'column': 18, 'line': 42}]

First differing element 2:
{'column': 3, 'line': 41}
{'column': 3, 'line': 42}

First list contains 1 additional elements.
First extra element 4:
{'column': 18, 'line': 42}

  [{'column': 39, 'line': 40},
   {'column': 51, 'line': 40},
-  {'column': 3, 'line': 41},
   {'column': 3, 'line': 42},
   {'column': 18, 'line': 42}]
Config=arm64-/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/lldb-build/bin/clang
----------------------------------------------------------------------
Ran 1 test in 1.554s

FAILED (failures=1)
```
@jorickert jorickert requested a review from mgehre-amd February 14, 2025 15:38
Base automatically changed from bump_to_3494ee95 to feature/fused-ops February 18, 2025 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.