[AArch64] NEON, SVE2 and SME2 instruction support with tests #439

FinnWilkinson · 2024-11-04T18:11:15Z

This PR adds a wide range of different NEON, SVE2, SME2 instructions with regressions tests. These facilitate a subset of some internal SME-based GEMM and GEMV codes.

There is some BF16 prototypical instruction support which by default is disabled (using a new build option and an if statement in each appropriate switch statement case) due to some usage of __bf16 which is not compiler agnostic, some hacky usage of memcpy to re-interpret uint16_t, and a lack of regression tests for the BF16 instructions in question.

These BF16 instructions can be enabled through a new CMake option -DSIMENG_ENABLE_BF16=ON. I have deliberately not included this in the documentation given the possible instibility of the BF16 implementation and to keep it for (mainly) internal usage only.

This branch is based on sme2-support (PR #429 ) and so should be merged after this brnch has been merged into dev.

Some SM2 instructions which use multi-vector operands can be non-trivial to read or understand. Please ask for clarification and suggest any additional comments that may help future understanding.

ABenC377

Only a few comments

CMakeLists.txt

src/include/simeng/arch/aarch64/Instruction.hh

src/include/simeng/arch/aarch64/helpers/sve.hh

src/include/simeng/arch/aarch64/helpers/neon.hh

src/include/simeng/arch/aarch64/operandContainer.hh

jj16791 · 2024-12-14T11:24:47Z

src/lib/arch/aarch64/Instruction_decode.cc

@@ -548,7 +549,7 @@ void Instruction::decode() {
      } else if (metadata_.operands[0].is_vreg) {
        setInstructionType(InsnType::isVectorData);
      } else if ((metadata_.operands[0].reg >= AARCH64_REG_ZAB0 &&
-                  metadata_.operands[0].reg <= AARCH64_REG_ZT0) ||
+                  metadata_.operands[0].reg < AARCH64_REG_ZT0) ||


Can ZT0 be used in a SVE context?

ZT0 is enabled / disabled in the same way as Z0 but has a fixed width of 512-bits, with the logic for detecting whether a ZT0 related instruction can / can't be executed done in instruction_execute as with all other SME instructions.

Regarding where in a core/implementation ZT0 based instructions are executed, there is no fixed rule in the spec as far as I can tell.... Given its fixed width, to me it seems more SVE-like than SME hence the grouping seen here. And given we don't have co-processor SME support, theres no offload / seperate chip logic to come into play yet

I am not sure I follow the answer given here. I was just wondering if the ZT0 is used in SVE instructions, as it seems to be solely used in SME instructions when looking at the spec. If this is the case, would we not want to identify it as an SME instruction? May have missed something though.

Having gone through the spec more and how ZT0 is used, you're right - it should be SME not SVE ---- Overthinked this a bit previously I think...

src/lib/arch/aarch64/Instruction_execute.cc

test/regression/aarch64/AArch64RegressionTest.hh

test/regression/aarch64/Exception.cc

src/lib/arch/aarch64/Instruction_execute.cc

src/include/simeng/arch/aarch64/Instruction.hh

CMakeLists.txt

The base branch was changed.

…h tests.

…ged address generation logic for ST2W and ST4W.

…on with tests.

…tests.

jj16791 · 2025-01-23T18:58:02Z

src/include/simeng/arch/aarch64/helpers/sve.hh

+  // Predicate as counter is 16-bits and has the following encoding:
+  //    - Up to first 4 bits encode the element size (0b1, 0b10, 0b100, 0b1000
+  //    for b h s d respectively)
+  //            - bits 0->LSZ


This is terminology from the spec on how predicate-as-counter works. It is LSZ as the number of bits used is dynamic. I'll try find a webpage with this info to better explain it in the comment

jj16791 · 2025-01-23T19:00:50Z

src/include/simeng/arch/aarch64/helpers/sve.hh

+  //    - Bit 15 represents the invert bit
+  std::array<uint64_t, 4> out = {0, 0, 0, 0};
+
+  // Set invert bit to 1 and count to 0


add some context for this choice, i.e. relate it back to ptrue. I assume it's because you want to denote true as 0 inactive elements?

Again, this is how the spec defined it -- will try and find a source to better explain

jj16791 · 2025-01-23T19:26:00Z

src/include/simeng/arch/aarch64/helpers/sve.hh

+ * W represents how many source elements are multiplied to form an output
+ * element (i.e. for 4-way, W = 4).
+ * Returns correctly formatted RegisterValue. */
+template <typename D, typename N, int W>


isn't N always uint8_t and W 4? May have missed something in the docs

Will double check and alter if so

SVE UDOT has 4 possible encodings. See:

https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/UDOT--4-way--vectors---Unsigned-integer-dot-product-

https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/UDOT--2-way--vectors---Unsigned-integer-dot-product-

jj16791 · 2025-01-24T12:50:15Z

src/lib/arch/aarch64/Instruction_decode.cc

@@ -548,7 +549,7 @@ void Instruction::decode() {
      } else if (metadata_.operands[0].is_vreg) {
        setInstructionType(InsnType::isVectorData);
      } else if ((metadata_.operands[0].reg >= AARCH64_REG_ZAB0 &&
-                  metadata_.operands[0].reg <= AARCH64_REG_ZT0) ||
+                  metadata_.operands[0].reg < AARCH64_REG_ZT0) ||


I am not sure I follow the answer given here. I was just wondering if the ZT0 is used in SVE instructions, as it seems to be solely used in SME instructions when looking at the spec. If this is the case, would we not want to identify it as an SME instruction? May have missed something though.

jj16791 · 2025-01-24T12:55:39Z

test/regression/aarch64/instructions/sme.cc

@@ -7,8 +7,52 @@ namespace {

 using InstSme = AArch64RegressionTest;

-#if SIMENG_LLVM_VERSION >= 14


Are we not able to keep this check in? I assumed this just concerned no SME and SME support as opposed to SME and SME2 support. May have misremembered the LLVM versioning though

Yeah you're correct. Will re-add

FinnWilkinson added the enhancement New feature or request label Nov 4, 2024

FinnWilkinson requested review from dANW34V3R, jj16791, JosephMoore25 and ABenC377 November 4, 2024 18:11

FinnWilkinson self-assigned this Nov 4, 2024

FinnWilkinson changed the base branch from dev to sme2-support November 4, 2024 18:12

FinnWilkinson force-pushed the sme2-support branch from ec02455 to e7d34e1 Compare November 6, 2024 16:45

FinnWilkinson force-pushed the sme-loops-support branch 2 times, most recently from f9a759f to f2b86fa Compare November 7, 2024 19:58

FinnWilkinson force-pushed the sme-loops-support branch from f2b86fa to 796b99e Compare November 14, 2024 10:23

ABenC377 reviewed Dec 6, 2024

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

src/include/simeng/arch/aarch64/Instruction.hh Outdated Show resolved Hide resolved

src/include/simeng/arch/aarch64/helpers/sve.hh Outdated Show resolved Hide resolved

FinnWilkinson force-pushed the sme-loops-support branch from 796b99e to 5ff6446 Compare December 13, 2024 16:00

jj16791 requested changes Dec 14, 2024

View reviewed changes

ABenC377 approved these changes Dec 17, 2024

View reviewed changes

ABenC377 previously approved these changes Dec 17, 2024

View reviewed changes

FinnWilkinson force-pushed the sme2-support branch from bc91dcd to fc308db Compare December 17, 2024 17:47

FinnWilkinson force-pushed the sme-loops-support branch 2 times, most recently from 393dd26 to b027f73 Compare December 18, 2024 15:07

FinnWilkinson changed the base branch from sme2-support to dev December 20, 2024 10:01

FinnWilkinson added 9 commits December 20, 2024 10:05

Fixed execution logic for UMINP and UMAXP neon instructions.

51ade58

Implemented ldrsb (32-bit, Post) instruction with test.

6a11d7d

Fixed implementation of NEON CMHS instruction.

520324c

Implemented UCVTF (fixed-point to float) instruction with test.

2b4a886

Implemented UCVTF (fixed-point to float) helper function.

e43ada7

Implemented UDOT (by element) NEON instructions with tests.

4773af8

Implemented LD1 (NEON 8h x2, post index) instruction with tests.

50a8a20

Implemented NEON UMLAL (32 to 64 bit) instruction with tests.

6696d5f

Implemented NEON UMLAL2 (32 to 64 bit) instruction with tests.

bb5096a

FinnWilkinson added 22 commits December 20, 2024 10:05

Implemented FADD (float, vgx2) SME instruction with tests.

b988e01

Implemented LD1D (4 vec, scalar offset) SVE2 instruction with tests.

4f75ffe

Implemented FMLA (double, VGx4) SME instruction with tests.

f35472b

Implemented FADD (double, vgx2) SME instruction with tests.

1bf3306

Implemented LD1H (Single vec, imm offset) SVE instruction with tests.

4effde4

Added SVE bf16 DOT (indexed) instruction execution logic.

40bba12

Implemented LD1H (two vec, imm and scalar offset) SVE instruction wit…

3932360

…h tests.

Implemented BFMOPA (widening) SME instruction.

5aad523

Minor UMAXP fix.

430c775

Fixed function comment.

a01c2fc

Updated BF16 comment.

9790c6e

Implemented NEON UDOT (by vector) instruction with tests.

5bc9330

Implemented SVE UDOT (by vector, 4-way) instruction with tests.

1fd130c

Implemented SVE ST4W (scalar offset) instruction with tests, and chan…

81ddba7

…ged address generation logic for ST2W and ST4W.

Implemented LD1B (4 vec, scalar offset) SVE2 instruction with tests.

4c99a0f

Implemented UDOT (4-way, VGx4 8-bit to 32-bit widening) SME instructi…

0d74234

…on with tests.

Implemented ADD (uint32, vgx2, vectors and ZA), SME instruction with …

40a0fa4

…tests.

Implemented ZIP (4 vectors) SVE2 instruction with tests.

950de41

Attended PR comments.

03a95e7

Minor bug fixes.

6729363

Attended PR comments.

850b741

Updated multi-vector load logic.

1d04096

FinnWilkinson force-pushed the sme-loops-support branch from b027f73 to 1d04096 Compare December 20, 2024 10:08

FinnWilkinson added 2 commits December 20, 2024 11:06

CI CD fixes.

246d39a

CI CD fixes pt2.

0ec0b8d

ABenC377 previously approved these changes Dec 20, 2024

View reviewed changes

jj16791 reviewed Jan 24, 2025

View reviewed changes

FinnWilkinson mentioned this pull request Jan 24, 2025

Full SME(1) instruction support and STREAMING Groups #415

Merged

2 tasks

Attended PR comments.

6110bce

FinnWilkinson dismissed ABenC377’s stale review via 6110bce February 6, 2025 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64] NEON, SVE2 and SME2 instruction support with tests #439

[AArch64] NEON, SVE2 and SME2 instruction support with tests #439

FinnWilkinson commented Nov 4, 2024

ABenC377 left a comment

jj16791 Dec 14, 2024

FinnWilkinson Dec 16, 2024

jj16791 Jan 24, 2025

FinnWilkinson Feb 6, 2025

jj16791 Jan 23, 2025

FinnWilkinson Feb 5, 2025

jj16791 Jan 23, 2025

FinnWilkinson Feb 5, 2025

jj16791 Jan 23, 2025

FinnWilkinson Feb 6, 2025

FinnWilkinson Feb 6, 2025

jj16791 Jan 24, 2025

jj16791 Jan 24, 2025

FinnWilkinson Feb 6, 2025

		@@ -7,8 +7,52 @@ namespace {

		using InstSme = AArch64RegressionTest;

		#if SIMENG_LLVM_VERSION >= 14

[AArch64] NEON, SVE2 and SME2 instruction support with tests #439

Are you sure you want to change the base?

[AArch64] NEON, SVE2 and SME2 instruction support with tests #439

Conversation

FinnWilkinson commented Nov 4, 2024

ABenC377 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment