forked from llvm/llvm-project
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoBump] Merge with ceeb08b9 (Nov 18) (9) #469
Open
jorickert
wants to merge
1,770
commits into
feature/fused-ops
Choose a base branch
from
bump_to_ceeb08b9
base: feature/fused-ops
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
) Introduces the SYCL based toolchain and initial toolchain construction when using the '-fsycl' option. This option will enable SYCL based offloading, creating a SPIR-V based IR file packaged into the compiled host object. This includes early support for creating the host/device object using the new offloading model. The device object is created using the spir64-unknown-unknown target triple. New/Updated Options: -fsycl Enables SYCL offloading for host and device -fsycl-device-only Enables device only compilation for SYCL -fsycl-host-only Enables host only compilation for SYCL RFC Reference: https://discourse.llvm.org/t/rfc-sycl-driver-enhancements/74092
…lvm#114874) Emit .ptr, .address-space, and .align attributes for kernel args in CUDA (previously handled only for OpenCL). This allows for more vectorization opportunities if the PTX consumer is able to know about the pointer alignments. If no alignment is explicitly specified, .align 1 will be emitted to match the LLVM IR semantics in this case. PTX ISA doc - https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-parameter-attribute-ptr This is a rework of the original patch proposed in llvm#79646 --------- Co-authored-by: Vandana <[email protected]>
The implementation is mostly based on the one existing for the exact flag. disjoint means that for each bit, that bit is zero in at least one of the inputs. This allows the Or to be treated as an Add since no carry can occur from any bit. If the disjoint keyword is present, the result value of the or is a [poison value](https://llvm.org/docs/LangRef.html#poisonvalues) if both inputs have a one in the same bit position. For vectors, only the element containing the bit is poison.
llvm#115777) Summary: Address spaces are used in several embedded and GPU targets to describe accesses to different types of memory. Currently we use the address space enumerations to control which address spaces are considered supersets of eachother, however this is also a target level property as described by the C standard's passing mentions. This patch allows the address space checks to use the target information to decide if a pointer conversion is legal. For AMDGPU and NVPTX, all supported address spaces can be converted to the default address space. More semantic checks can be added on top of this, for now I'm mainly looking to get more standard semantics working for C/C++. Right now the address space conversions must all be done explicitly in C/C++ unlike the offloading languages which define their own custom address spaces that just map to the same target specific ones anyway. The main question is if this behavior is a function of the target or the language.
…115920) Aaron has been helping out with TSA for several years and is highly knowledgeable about the implementation.
Reverts llvm#116109 This change is breaking triton tests (results in huge numeric disparities, e.g. https://github.com/triton-lang/triton/blob/main/python/test/unit/language/test_core.py), we'll need to revert until a fix forward can be merged.
…lvm#116196) This is a follow-up for the conversation here llvm#115722. This test is designed for Windows target/PDB format, so it shouldn't be built and run for DWARF/etc.
From the discussion at the round-table at the RISC-V Summit it was clear people see cases where global merging would help. So the direction of enabling it by default and iteratively working to enable it in more cases or to improve the heuristics seems sensible. This patch tries to make a minimal step in that direction.
Certain DFP instructions have GPR arguments, which are currently incorrectly treated as FPR registers. Since we do not use DFP in codegen, this only affects the assembler and disassembler.
…7583) Supersedes the draft PR llvm#94992, taking a different approach following feedback: * Lower in PreISelIntrinsicLowering * Don't require that the number of bytes to set is a compile-time constant * Define llvm.memset_pattern rather than llvm.memset_pattern.inline As discussed in the [RFC thread](https://discourse.llvm.org/t/rfc-introducing-an-llvm-memset-pattern-inline-intrinsic/79496), the intent is that the intrinsic will be lowered to loops, a sequence of stores, or libcalls depending on the expected cost and availability of libcalls on the target. Right now, there's just a single lowering path that aims to handle all cases. My intent would be to follow up with additional PRs that add additional optimisations when possible (e.g. when libcalls are available, when arguments are known to be constant etc).
…6161) This PR removes tests with `br i1 undef` under `llvm/tests/Transforms/HotColdSplit` and `llvm/tests/Transforms/I*`.
On a 32-bit target if pointer arithmetic with `addrspace` is used in i64 computation, the missed folding in InstCombine results to suboptimal performance, unlike same code compiled for 64bit target.
…vm#114262) I noticed that the two C functions emitted different IR: ``` int switch_duplicate_arms(int switch_val, int v, int w) { switch (switch_val) { default: break; case 0: w = v; break; case 1: w = v; break; } return w; } int if_duplicate_arms(int switch_val, int v, int w) { if (switch_val == 0) w = v; else if (switch_val == 1) w = v; return v0; } ``` We generate IR that looks like this: ``` define i32 @switch_duplicate_arms(i32 %0, i32 %1, i32 %2, i32 %3) { switch i32 %1, label %7 [ i32 0, label %5 i32 1, label %6 ] 5: br label %7 6: br label %7 7: %8 = phi i32 [ %3, %4 ], [ %2, %6 ], [ %2, %5 ] ret i32 %8 } define i32 @if_duplicate_arms(i32 %0, i32 %1, i32 %2, i32 %3) { %5 = icmp ult i32 %1, 2 %6 = select i1 %5, i32 %2, i32 %3 ret i32 %6 } ``` For `switch_duplicate_arms`, taking case 0 and 1 are the same since %5 and %6 branch to the same location and the incoming values for %8 are the same from those blocks. We could remove one on the duplicate switch targets and update the switch with the single target. On RISC-V, prior to this patch, we generate the following code: ``` switch_duplicate_arms: li a4, 1 beq a1, a4, .LBB0_2 mv a0, a3 bnez a1, .LBB0_3 .LBB0_2: mv a0, a2 .LBB0_3: ret if_duplicate_arms: li a4, 2 mv a0, a2 bltu a1, a4, .LBB1_2 mv a0, a3 .LBB1_2: ret ``` After this patch, the O3 code is optimized to the icmp + select pair, which gives us the same code gen as `if_duplicate_arms`, as desired. This results is one less branch instruction in the final assembly. This may help with both code size and further switch simplification. I found that this patch causes no significant impact to spec2006/int/ref and spec2017/intrate/ref. --------- Co-authored-by: Min Hsu <[email protected]>
) @alexander-shaposhnikov is already listed as the nsan maintainer on the compiler-rt side (https://github.com/llvm/llvm-project/blob/a55248789ed3f653740e0723d016203b9d585f26/compiler-rt/CODE_OWNERS.TXT#L75-L77), so I'd like to mirror this to the LLVM part of the sanitizer as well.
) The logic for marking runtime libcalls unavailable currently duplicates essentially the same logic for some random subset of targets, where someone reported an issue and then someone went and fixed the issue for that specific target only. However, the availability for most of these is completely target independent. In particular: * MULO_I128 is never available in libgcc * Various I128 libcalls are not available for 32-bit targets in libgcc * powi is never available in MSVCRT Unify the logic for these, so we don't miss any targets. This fixes llvm#16778 on AArch64, which is one of the targets that was previously missed in this logic.
…llvm#97583)" This reverts commit 7ff3a9a. Recent scheduling changes means tests need to be re-generated. Reverting to green while I do that.
…es (llvm#116250) This is a follow-up to llvm#115567. Emit an error for invalid function names, similar to MSVC's `lib.exe` behavior. Returning an error from `writeImportLibrary` exposed bugs in error handling by its callers, which have been addressed in this patch.
…llvm#97583) Relands 7ff3a9a after regenerating the test case. Supersedes the draft PR llvm#94992, taking a different approach following feedback: * Lower in PreISelIntrinsicLowering * Don't require that the number of bytes to set is a compile-time constant * Define llvm.memset_pattern rather than llvm.memset_pattern.inline As discussed in the [RFC thread](https://discourse.llvm.org/t/rfc-introducing-an-llvm-memset-pattern-inline-intrinsic/79496), the intent is that the intrinsic will be lowered to loops, a sequence of stores, or libcalls depending on the expected cost and availability of libcalls on the target. Right now, there's just a single lowering path that aims to handle all cases. My intent would be to follow up with additional PRs that add additional optimisations when possible (e.g. when libcalls are available, when arguments are known to be constant etc).
Identified with misc-include-cleaner.
Identified with misc-include-cleaner.
Identified with misc-include-cleaner.
Enables splat support for loads with lanes> 2 or number of operands> 2. Allows better detect splats of loads and reduces number of shuffles in some cases. X86, AVX512, -O3+LTO Metric: size..text results results0 diff test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 154867.00 156723.00 1.2% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12467735.00 12468023.00 0.0% Better vectorization quality Reviewers: RKSimon Reviewed By: RKSimon Pull Request: llvm#115173
Right shift has higher precedence than bitwise and, so it should be parentheses around & operator. This case works as expected because IsVRegClassShift is 0, other cases will fail.
…translation (llvm#116416) I think this is a bit easier to read.
…sert/extract (llvm#116425) This is a NFC-ish change that moves vector.extractelement/vector.insertelement vector distribution patterns to vector.insert/vector.extract. Before: 0-d/1-d vector.extract -> vector.extractelement -> distributed vector.extractelement 2-d+ vector.extract -> distributed vector.extract After: scalar input vector.extract -> distributed vector.extract vector.extractelement -> distributed vector.extract 2d+ vector.extract -> distributed vector.extract The same changes are done for insertelement/insert. The change allows us to remove reliance on vector.extractelement/vector.insertelement, which are soon to be depreciated: https://discourse.llvm.org/t/rfc-psa-remove-vector-extractelement-and-vector-insertelement-ops-in-favor-of-vector-extract-and-vector-insert-ops/71116/8 No extra tests are included because this patch doesn't introduce / remove any functionality. It only changes the chain of lowerings. This change can be completly NFC if we make the distributed operation vector.extractelement/vector.insertelement, but that is slightly weird, because you are going from extractelement -> extract -> extractelement.
…115738) The [`SPV_INTEL_split_barrier` extension](https://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/INTEL/SPV_INTEL_split_barrier.html) defines operations to split control barrier semantics in two operations. Add support for these operations (arrive and wait) to the dialect. Signed-off-by: Victor Perez <[email protected]>
This PR adds the handling of `ClassType`. It is treated as pointer to the underlying type. Note that `ClassType` when passed to the function have double indirection so it is represented as pointer to type (compared to other types which may have a single indirection). If `ClassType` wraps a pointer or allocatable then we take care to generate it as PTR -> type (and not PTR -> PTR -> type). This is how it looks like in the debugger. ``` subroutine test_proc (this) class(test_type), intent (inout) :: this allocate (this%b (3, 2)) call fill_array_2d (this%b) print *, this%a end ``` ``` (gdb) p this $6 = (PTR TO -> ( Type test_type )) 0x2052a0 (gdb) p this%a $7 = 0 (gdb) p this%b $8 = ((1, 2, 3) (4, 5, 6)) ```
Change the signal handler to use a pipe to notify about incoming signals. This has two benefits: - the signal no longer has to happen on the MainLoop thread. With the previous implementation, this had to be the case as that was the only way to ensure that ppoll gets interrupted correctly. In a multithreaded process, this would mean that all other threads have to have the signal blocked at all times. - we don't need the android-specific workaround, which was necessary due to the syscall being implemented in a non-atomic way When the MainLoop class was first implemented, we did not have the interrupt self-pipe, so syscall interruption was the most straight-forward implementation. Over time, the class gained new abilities (the pipe being one of them), so we can now piggy-back on those. This patch also changes the kevent-based implementation to use the pipe for signal notification as well. The motivation there is slightly different: - it makes the implementations more uniform - it makes sure we handle all kinds of signals, like we do with the linux version (EVFILT_SIGNAL only catches process-directed signals)
Part of llvm#51787. This patch adds constexpr support for the built-in reduce add function. If this is the right way to go, I will add support for other reduce functions in later patches. --------- Co-authored-by: Mariya Podchishchaeva <[email protected]>
…16506) Add ALL variable category, implement semantic checks to verify the validity of the clause, improve error messages, add testcases. The variable category modifier is optional since 5.0, make sure we allow it to be missing. If it is missing, assume "all" in clause conversion.
This introduces the MinGW counterpart of `lld-link`'s `-functionpadmin`.
Summary: The end goal is to make `rpc.h` a standalone header so that other projects can include it without leaking `libc` internals. I'm trying to replace stuff slowly before pulling it out all at once to reduce the size of the changes. This patch removes the atomic and a few sparse dependencies. Now we mostly rely on the GPU utils, the sleep function, optional, and the type traits. I'll clean these up in future patches. This removed the old stuff I had around the memcpy, but I think that it's not quite as bad as it once was, as it removes a branch and only uses a few extra VGPRs since I believe the builtin memcpy was improved for AMD.
…ons (llvm#113716) See specification ARM-software/abi-aa#295
This commit adds an intrinsic that will write to an image buffer. We chose to match the name of the DXIL intrinsic for simplicity in clang. We cannot reuse the existing openCL write_image function because that is not a reserved name in HLSL. There is not much common code to factor out.
…ed (llvm#116480) As mentioned in the title, the missing `consumeError` triggers assertions.
In MCA, the load/store unit is modeled through a `LSUnitBase` class. Judging from the name `LSUnitBase`, I believe there is an intent to allow for different specialized load/store unit implementations. (However, currently there is only one implementation used in-tree, `LSUnit`.) PR llvm#101534 fixed one instance where the specialized `LSUnit` was hard-coded, opening the door for other subclasses to be used, but what subclasses can do is, in my opinion, still overly limited due to a reliance on the `MemoryGroup` class, e.g. [here](https://github.com/llvm/llvm-project/blob/8b55162e195783dd27e1c69fb4d97971ef76725b/llvm/lib/MCA/HardwareUnits/Scheduler.cpp#L88). The `MemoryGroup` class is currently used in the default `LSUnit` implementation to model data dependencies/hazards in the pipeline. `MemoryGroups` form a graph of memory dependencies that inform the scheduler when load/store instructions can be executed relative to each other. In my eyes, this is an implementation detail. Other `LSUnit`s may want to keep track of data dependencies in different ways. As a concrete example, a downstream use I am working on<sup>[1]</sup> uses a custom load/store unit that makes use of available aliasing information. I haven't been able to shoehorn our additional aliasing information into the existing `MemoryGroup` abstraction. I think there is no need to force subclasses to use `MemoryGroup`s; users of `LSUnitBase` are only concerned with when, and for how long, a load/store instruction executes. This PR makes changes to instead leave it up to the subclasses how to model such dependencies, and only prescribes an abstract interface in `LSUnitBase`. It also moves data members and methods that are not necessary to provide an abstract interface from `LSUnitBase` to the `LSUnit` subclass. I decided to make the `MemoryGroup` a protected subclass of `LSUnit`; that way, specializations may inherit from `LSUnit` and still make use of `MemoryGroup`s if they wish to do so (e.g. if they want to only overwrite the `dispatch` method). **Drawbacks / Considerations** My reason for suggesting this PR is an out-of-tree use. As such, these changes don't introduce any new functionality for in-tree LLVM uses. However, in my opinion, these changes improve code clarity and prescribe a clear interface, which would be the main benefit for the LLVM community. A drawback of the more abstract interface is that virtual dispatching is used in more places. However, note that virtual dispatch is already currently used in some critical parts of the `LSUnitBase`, e.g. the `isAvailable` and `dispatch` methods. As a quick check to ensure these changes don't significantly negatively impact performance, I also ran `time llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3000 llvm/test/tools/llvm-mca/X86/BtVer2/dot-product.s` before and after the changes; there was no observable difference in runtimes (`0.292 s` total before, `0.286 s` total after changes). <sup>[1]: MCAD started by @mshockwave and @chinmaydd.</sup>
This patch fixes: lldb/source/Host/posix/MainLoopPosix.cpp:64:11: error: unused variable 'bytes_written' [-Werror,-Wunused-variable]
This patch removes clang/Parse/ParseDiagnostic.h because it just forwards to clang/Basic/DiagnosticParse.h.
Identified with misc-include-cleaner.
Add patterns to lower `fmaxnum(fma(a, b, c), 0)` to `fma.rn{.ftz}.relu` for `f16`, `f16x2`, `bf16`, `bf16x2` types, when `nnan` is used. `fma_relu` honours `NaN`, so the substitution is only made if the `fma` is `nnan`, since `fmaxnum` returns the non NaN argument when passed a NaN value. This patch also removes some `bf16` ftz instructions since `FTZ` is not supported with the `bf16` type, according to the PTX ISA docs.
llvm#115544) The input generating functions for benchmark tests in the GenerateInput.h file can be slightly improved by invoking vector::reserve before calling vector::push_back. This slight performance improvement could potentially speed-up all benchmark tests for containers and algorithms that use these functions as inputs.
…conditions (llvm#116627) This patch bails out non-dedicated exits to avoid adding exiting conditions to invalid context. Closes llvm#116553.
…m#116621) It also modifies the error message to specify it is the dependence-type that is not supported. Resolves the crash in llvm#115647. A fix can come in later as part of future OpenMP version support.
This reverts commit 4f48a81. The newly added test was failing on the public macOS Arm64 bots: ``` ====================================================================== FAIL: test_column_breakpoints (TestDAP_breakpointLocations.TestDAP_setBreakpoints) Test retrieving the available breakpoint locations. ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/lldb/test/API/tools/lldb-dap/breakpoint/TestDAP_breakpointLocations.py", line 77, in test_column_breakpoints self.assertEqual( AssertionError: Lists differ: [{'co[70 chars]e': 41}, {'column': 3, 'line': 42}, {'column': 18, 'line': 42}] != [{'co[70 chars]e': 42}, {'column': 18, 'line': 42}] First differing element 2: {'column': 3, 'line': 41} {'column': 3, 'line': 42} First list contains 1 additional elements. First extra element 4: {'column': 18, 'line': 42} [{'column': 39, 'line': 40}, {'column': 51, 'line': 40}, - {'column': 3, 'line': 41}, {'column': 3, 'line': 42}, {'column': 18, 'line': 42}] Config=arm64-/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/lldb-build/bin/clang ---------------------------------------------------------------------- Ran 1 test in 1.554s FAILED (failures=1) ```
mgehre-amd
approved these changes
Feb 14, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.