Skip to content

Commit b740eb6

Browse files
mkannwischerhanno-becker
authored andcommitted
SLOTHY: Superoptimize ntt.S for the Neoverse N1
This commit adds the optimized backend dev/aarch64_opt. For now this backend only differs from the clean backend in the NTT which is superoptimized using SLOTHY for the Neoverse N1. For all other files it's a simple copy of the clean backend. A Makefile is added that performs the optimization. CI is adjusted to test both the clean and the opt backend. The first loop of the NTT can be optimized in one go. The second loop is too largeand we, hence, use the split heuristic. I have experimented with the Cortex-A55 model as well - that results in significantly faster code on A55, but results in a noticable slow down, especially for A72 (see performance results in the pull request). A72 performance seems more important than A55 performance. I have experimented with applying some other optimizations (from the SLOTHY paper): - Using st4 instead of the manual tranposition - Using scalar loads instead of vector loads While those result in much better performance on Cortex-A55, they slow down code on other platforms (see the pull request for details). The autogen script is extended to allow running the optimization through the --slothy flag. Signed-off-by: Matthias J. Kannwischer <[email protected]>
1 parent dbebfc6 commit b740eb6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+6453
-271
lines changed

.github/workflows/base.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -221,9 +221,8 @@ jobs:
221221
backend:
222222
- arg: '--aarch64-clean'
223223
name: Clean
224-
# TODO: add backend option after we have optimized/clean seperation
225-
# - arg: ''
226-
# name: Optimized
224+
- arg: ''
225+
name: Optimized
227226
simplify:
228227
- arg: ''
229228
name: Simplified

BIBLIOGRAPHY.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,9 @@ source code and documentation.
158158
* Referenced from:
159159
- [dev/aarch64_clean/src/intt.S](dev/aarch64_clean/src/intt.S)
160160
- [dev/aarch64_clean/src/ntt.S](dev/aarch64_clean/src/ntt.S)
161+
- [dev/aarch64_opt/README.md](dev/aarch64_opt/README.md)
162+
- [dev/aarch64_opt/src/intt.S](dev/aarch64_opt/src/intt.S)
163+
- [dev/aarch64_opt/src/ntt.S](dev/aarch64_opt/src/ntt.S)
161164
- [mldsa/src/native/aarch64/src/intt.S](mldsa/src/native/aarch64/src/intt.S)
162165
- [mldsa/src/native/aarch64/src/ntt.S](mldsa/src/native/aarch64/src/ntt.S)
163166

@@ -266,6 +269,9 @@ source code and documentation.
266269
* Referenced from:
267270
- [dev/aarch64_clean/src/intt.S](dev/aarch64_clean/src/intt.S)
268271
- [dev/aarch64_clean/src/ntt.S](dev/aarch64_clean/src/ntt.S)
272+
- [dev/aarch64_opt/README.md](dev/aarch64_opt/README.md)
273+
- [dev/aarch64_opt/src/intt.S](dev/aarch64_opt/src/intt.S)
274+
- [dev/aarch64_opt/src/ntt.S](dev/aarch64_opt/src/ntt.S)
269275
- [mldsa/src/native/aarch64/src/intt.S](mldsa/src/native/aarch64/src/intt.S)
270276
- [mldsa/src/native/aarch64/src/ntt.S](mldsa/src/native/aarch64/src/ntt.S)
271277

dev/aarch64_opt/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
[//]: # (SPDX-License-Identifier: CC-BY-4.0)
2+
3+
# AArch64 backend (little endian)
4+
5+
This directory contains a native backend for little endian AArch64 systems. It is derived from [^NeonNTT] [^SLOTHY_Paper].
6+
7+
## Variants
8+
9+
This backend comes in two versions: "clean" and optimized. The "clean" backend is handwritten and meant to be easy to read and modify; for example, it heavily leverages register aliases and assembly macros. This directory contains the optimized version, which is automatically generated from the clean one via [SLOTHY](https://github.com/slothy-optimizer/slothy). Currently, the target architecture is Neoverse N1, but you can easily re-optimize the code for a different microarchitecture supported by SLOTHY, by adjusting the parameters in the [Makefile](src/Makefile).
10+
11+
Performance on in-order CPUs such as the Arm Cortex-A55 can be significantly improved by re-optimizing for the specific CPU which may, however, degrade performance on other CPUs.
12+
13+
<!--- bibliography --->
14+
[^NeonNTT]: Becker, Hwang, Kannwischer, Yang, Yang: Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1, [https://eprint.iacr.org/2021/986](https://eprint.iacr.org/2021/986)
15+
[^SLOTHY_Paper]: Abdulrahman, Becker, Kannwischer, Klein: Fast and Clean: Auditable high-performance assembly via constraint solving, [https://eprint.iacr.org/2022/1303](https://eprint.iacr.org/2022/1303)

dev/aarch64_opt/meta.h

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
/*
2+
* Copyright (c) The mlkem-native project authors
3+
* Copyright (c) The mldsa-native project authors
4+
* SPDX-License-Identifier: Apache-2.0 OR ISC OR MIT
5+
*/
6+
7+
#ifndef MLD_NATIVE_AARCH64_META_H
8+
#define MLD_NATIVE_AARCH64_META_H
9+
10+
/* Set of primitives that this backend replaces */
11+
#define MLD_USE_NATIVE_NTT
12+
#define MLD_USE_NATIVE_INTT
13+
#define MLD_USE_NATIVE_REJ_UNIFORM
14+
#define MLD_USE_NATIVE_REJ_UNIFORM_ETA2
15+
#define MLD_USE_NATIVE_REJ_UNIFORM_ETA4
16+
#define MLD_USE_NATIVE_POLY_DECOMPOSE_32
17+
#define MLD_USE_NATIVE_POLY_DECOMPOSE_88
18+
#define MLD_USE_NATIVE_POLY_CADDQ
19+
#define MLD_USE_NATIVE_POLY_USE_HINT_32
20+
#define MLD_USE_NATIVE_POLY_USE_HINT_88
21+
#define MLD_USE_NATIVE_POLY_CHKNORM
22+
#define MLD_USE_NATIVE_POLYZ_UNPACK_17
23+
#define MLD_USE_NATIVE_POLYZ_UNPACK_19
24+
#define MLD_USE_NATIVE_POINTWISE_MONTGOMERY
25+
#define MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L4
26+
#define MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L5
27+
#define MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L7
28+
29+
/* Identifier for this backend so that source and assembly files
30+
* in the build can be appropriately guarded. */
31+
#define MLD_ARITH_BACKEND_AARCH64
32+
33+
34+
#if !defined(__ASSEMBLER__)
35+
#include "src/arith_native_aarch64.h"
36+
37+
static MLD_INLINE void mld_ntt_native(int32_t data[MLDSA_N])
38+
{
39+
mld_ntt_asm(data, mld_aarch64_ntt_zetas_layer123456,
40+
mld_aarch64_ntt_zetas_layer78);
41+
}
42+
43+
static MLD_INLINE void mld_intt_native(int32_t data[MLDSA_N])
44+
{
45+
mld_intt_asm(data, mld_aarch64_intt_zetas_layer78,
46+
mld_aarch64_intt_zetas_layer123456);
47+
}
48+
49+
static MLD_INLINE int mld_rej_uniform_native(int32_t *r, unsigned len,
50+
const uint8_t *buf,
51+
unsigned buflen)
52+
{
53+
if (len != MLDSA_N || buflen % 24 != 0)
54+
{
55+
return -1;
56+
}
57+
58+
/* Safety: outlen is at most MLDSA_N, hence, this cast is safe. */
59+
return (int)mld_rej_uniform_asm(r, buf, buflen, mld_rej_uniform_table);
60+
}
61+
62+
static MLD_INLINE int mld_rej_uniform_eta2_native(int32_t *r, unsigned len,
63+
const uint8_t *buf,
64+
unsigned buflen)
65+
{
66+
unsigned int outlen;
67+
/* AArch64 implementation assumes specific buffer lengths */
68+
if (len != MLDSA_N || buflen != MLD_AARCH64_REJ_UNIFORM_ETA2_BUFLEN)
69+
{
70+
return -1;
71+
}
72+
/* Constant time: Inputs and outputs to this function are secret.
73+
* It is safe to leak which coefficients are accepted/rejected.
74+
* The assembly implementation must not leak any other information about the
75+
* accepted coefficients. Constant-time testing cannot cover this, and we
76+
* hence have to manually verify the assembly.
77+
* We declassify prior the input data and mark the outputs as secret.
78+
*/
79+
MLD_CT_TESTING_DECLASSIFY(buf, buflen);
80+
outlen = mld_rej_uniform_eta2_asm(r, buf, buflen, mld_rej_uniform_eta_table);
81+
MLD_CT_TESTING_SECRET(r, sizeof(int32_t) * outlen);
82+
/* Safety: outlen is at most MLDSA_N and, hence, this cast is safe. */
83+
return (int)outlen;
84+
}
85+
86+
static MLD_INLINE int mld_rej_uniform_eta4_native(int32_t *r, unsigned len,
87+
const uint8_t *buf,
88+
unsigned buflen)
89+
{
90+
unsigned int outlen;
91+
/* AArch64 implementation assumes specific buffer lengths */
92+
if (len != MLDSA_N || buflen != MLD_AARCH64_REJ_UNIFORM_ETA4_BUFLEN)
93+
{
94+
return -1;
95+
}
96+
/* Constant time: Inputs and outputs to this function are secret.
97+
* It is safe to leak which coefficients are accepted/rejected.
98+
* The assembly implementation must not leak any other information about the
99+
* accepted coefficients. Constant-time testing cannot cover this, and we
100+
* hence have to manually verify the assembly.
101+
* We declassify prior the input data and mark the outputs as secret.
102+
*/
103+
MLD_CT_TESTING_DECLASSIFY(buf, buflen);
104+
outlen = mld_rej_uniform_eta4_asm(r, buf, buflen, mld_rej_uniform_eta_table);
105+
MLD_CT_TESTING_SECRET(r, sizeof(int32_t) * outlen);
106+
/* Safety: outlen is at most MLDSA_N and, hence, this cast is safe. */
107+
return (int)outlen;
108+
}
109+
110+
static MLD_INLINE void mld_poly_decompose_32_native(int32_t *a1, int32_t *a0,
111+
const int32_t *a)
112+
{
113+
mld_poly_decompose_32_asm(a1, a0, a);
114+
}
115+
116+
static MLD_INLINE void mld_poly_decompose_88_native(int32_t *a1, int32_t *a0,
117+
const int32_t *a)
118+
{
119+
mld_poly_decompose_88_asm(a1, a0, a);
120+
}
121+
122+
static MLD_INLINE void mld_poly_caddq_native(int32_t a[MLDSA_N])
123+
{
124+
mld_poly_caddq_asm(a);
125+
}
126+
127+
static MLD_INLINE void mld_poly_use_hint_32_native(int32_t *b, const int32_t *a,
128+
const int32_t *h)
129+
{
130+
mld_poly_use_hint_32_asm(b, a, h);
131+
}
132+
133+
static MLD_INLINE void mld_poly_use_hint_88_native(int32_t *b, const int32_t *a,
134+
const int32_t *h)
135+
{
136+
mld_poly_use_hint_88_asm(b, a, h);
137+
}
138+
139+
static MLD_INLINE int mld_poly_chknorm_native(const int32_t *a, int32_t B)
140+
{
141+
return mld_poly_chknorm_asm(a, B);
142+
}
143+
144+
static MLD_INLINE void mld_polyz_unpack_17_native(int32_t *r,
145+
const uint8_t *buf)
146+
{
147+
mld_polyz_unpack_17_asm(r, buf, mld_polyz_unpack_17_indices);
148+
}
149+
150+
static MLD_INLINE void mld_polyz_unpack_19_native(int32_t *r,
151+
const uint8_t *buf)
152+
{
153+
mld_polyz_unpack_19_asm(r, buf, mld_polyz_unpack_19_indices);
154+
}
155+
156+
static MLD_INLINE void mld_poly_pointwise_montgomery_native(
157+
int32_t out[MLDSA_N], const int32_t in0[MLDSA_N],
158+
const int32_t in1[MLDSA_N])
159+
{
160+
mld_poly_pointwise_montgomery_asm(out, in0, in1);
161+
}
162+
163+
static MLD_INLINE void mld_polyvecl_pointwise_acc_montgomery_l4_native(
164+
int32_t w[MLDSA_N], const int32_t u[4][MLDSA_N],
165+
const int32_t v[4][MLDSA_N])
166+
{
167+
mld_polyvecl_pointwise_acc_montgomery_l4_asm(w, (const int32_t *)u,
168+
(const int32_t *)v);
169+
}
170+
171+
static MLD_INLINE void mld_polyvecl_pointwise_acc_montgomery_l5_native(
172+
int32_t w[MLDSA_N], const int32_t u[5][MLDSA_N],
173+
const int32_t v[5][MLDSA_N])
174+
{
175+
mld_polyvecl_pointwise_acc_montgomery_l5_asm(w, (const int32_t *)u,
176+
(const int32_t *)v);
177+
}
178+
179+
static MLD_INLINE void mld_polyvecl_pointwise_acc_montgomery_l7_native(
180+
int32_t w[MLDSA_N], const int32_t u[7][MLDSA_N],
181+
const int32_t v[7][MLDSA_N])
182+
{
183+
mld_polyvecl_pointwise_acc_montgomery_l7_asm(w, (const int32_t *)u,
184+
(const int32_t *)v);
185+
}
186+
187+
#endif /* !__ASSEMBLER__ */
188+
#endif /* !MLD_NATIVE_AARCH64_META_H */

dev/aarch64_opt/src/Makefile

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Copyright (c) The mldsa-native project authors
2+
# SPDX-License-Identifier: Apache-2.0 OR ISC OR MIT
3+
4+
######
5+
# To run, see the README.md file
6+
######
7+
.PHONY: all clean
8+
9+
# ISA to optimize for
10+
TARGET_ISA=Arm_AArch64
11+
12+
# MicroArch target to optimize for
13+
# Changing this to Arm_Cortex_A55 results in significantly better performance
14+
# on the Cortex-A55, but may result in worse performance on other CPUs.
15+
TARGET_MICROARCH=Arm_Neoverse_N1_experimental
16+
17+
SLOTHY_EXTRA_FLAGS ?=
18+
19+
SLOTHY_FLAGS=-c sw_pipelining.enabled=true \
20+
-c inputs_are_outputs \
21+
-c sw_pipelining.minimize_overlapping=False \
22+
-c sw_pipelining.allow_post \
23+
-c variable_size \
24+
-c constraints.stalls_first_attempt=64 \
25+
$(SLOTHY_EXTRA_FLAGS)
26+
27+
SLOTHY_FLAGS_SPLIT= -c inputs_are_outputs \
28+
-c variable_size \
29+
-c constraints.stalls_first_attempt=64 \
30+
-c split_heuristic=true \
31+
-c split_heuristic_factor=1.5 \
32+
-c split_heuristic_repeat=2 \
33+
-c sw_pipelining.enabled=true \
34+
-c sw_pipelining.halving_heuristic=True \
35+
$(SLOTHY_EXTRA_FLAGS)
36+
37+
# For kernels which stash callee-saved v8-v15 but don't stash callee-saved GPRs x19-x30.
38+
# Allow SLOTHY to use all V-registers, but only caller-saved GPRs.
39+
RESERVE_X_ONLY_FLAG=-c reserved_regs="[x18--x30,sp]"
40+
41+
# Used for kernels which don't stash callee-saved registers.
42+
# Restrict SLOTHY to caller-saved registers.
43+
RESERVE_ALL_FLAG=-c reserved_regs="[x18--x30,sp,v8--v15]"
44+
45+
all: ntt.S \
46+
intt.S \
47+
mld_polyvecl_pointwise_acc_montgomery_l4.S \
48+
mld_polyvecl_pointwise_acc_montgomery_l5.S \
49+
mld_polyvecl_pointwise_acc_montgomery_l7.S \
50+
pointwise_montgomery.S \
51+
poly_caddq_asm.S \
52+
poly_chknorm_asm.S \
53+
poly_decompose_32_asm.S \
54+
poly_decompose_88_asm.S \
55+
poly_use_hint_32_asm.S \
56+
poly_use_hint_88_asm.S \
57+
polyz_unpack_17_asm.S \
58+
polyz_unpack_19_asm.S \
59+
rej_uniform_asm.S \
60+
rej_uniform_eta2_asm.S \
61+
rej_uniform_eta4_asm.S
62+
63+
# These units explicitly save and restore registers v8-v15, so SLOTHY can freely use
64+
# those registers.
65+
ntt.S: ../../aarch64_clean/src/ntt.S
66+
# optimize first loop in one go and write to temp file
67+
$(eval TMPFILE := $(shell mktemp))
68+
slothy-cli $(TARGET_ISA) $(TARGET_MICROARCH) $< -o $(TMPFILE) -l ntt_layer123_start $(SLOTHY_FLAGS) $(RESERVE_X_ONLY_FLAG)
69+
# optimize second loop using split heuristic
70+
slothy-cli $(TARGET_ISA) $(TARGET_MICROARCH) $(TMPFILE) -o $@ -l ntt_layer45678_start $(SLOTHY_FLAGS_SPLIT) $(RESERVE_X_ONLY_FLAG)
71+
72+
# Copy remaining files without optimization for now
73+
intt.S: ../../aarch64_clean/src/intt.S
74+
cp $< $@
75+
76+
mld_polyvecl_pointwise_acc_montgomery_l4.S: ../../aarch64_clean/src/mld_polyvecl_pointwise_acc_montgomery_l4.S
77+
cp $< $@
78+
79+
mld_polyvecl_pointwise_acc_montgomery_l5.S: ../../aarch64_clean/src/mld_polyvecl_pointwise_acc_montgomery_l5.S
80+
cp $< $@
81+
82+
mld_polyvecl_pointwise_acc_montgomery_l7.S: ../../aarch64_clean/src/mld_polyvecl_pointwise_acc_montgomery_l7.S
83+
cp $< $@
84+
85+
pointwise_montgomery.S: ../../aarch64_clean/src/pointwise_montgomery.S
86+
cp $< $@
87+
88+
poly_caddq_asm.S: ../../aarch64_clean/src/poly_caddq_asm.S
89+
cp $< $@
90+
91+
poly_chknorm_asm.S: ../../aarch64_clean/src/poly_chknorm_asm.S
92+
cp $< $@
93+
94+
poly_decompose_32_asm.S: ../../aarch64_clean/src/poly_decompose_32_asm.S
95+
cp $< $@
96+
97+
poly_decompose_88_asm.S: ../../aarch64_clean/src/poly_decompose_88_asm.S
98+
cp $< $@
99+
100+
poly_use_hint_32_asm.S: ../../aarch64_clean/src/poly_use_hint_32_asm.S
101+
cp $< $@
102+
103+
poly_use_hint_88_asm.S: ../../aarch64_clean/src/poly_use_hint_88_asm.S
104+
cp $< $@
105+
106+
polyz_unpack_17_asm.S: ../../aarch64_clean/src/polyz_unpack_17_asm.S
107+
cp $< $@
108+
109+
polyz_unpack_19_asm.S: ../../aarch64_clean/src/polyz_unpack_19_asm.S
110+
cp $< $@
111+
112+
rej_uniform_asm.S: ../../aarch64_clean/src/rej_uniform_asm.S
113+
cp $< $@
114+
115+
rej_uniform_eta2_asm.S: ../../aarch64_clean/src/rej_uniform_eta2_asm.S
116+
cp $< $@
117+
118+
rej_uniform_eta4_asm.S: ../../aarch64_clean/src/rej_uniform_eta4_asm.S
119+
cp $< $@
120+
121+
clean:
122+
-$(RM) -rf *.S

dev/aarch64_opt/src/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
[//]: # (SPDX-License-Identifier: CC-BY-4.0)
2+
3+
# mldsa-native AArch64 backend SLOTHY-optimized code
4+
5+
This directory contains the AArch64 backend after it has been optimized by [SLOTHY](https://github.com/slothy-optimizer/slothy/).
6+
7+
## Re-running SLOTHY
8+
9+
If the "clean" sources [`../../aarch64_clean/src/*.S`](../../aarch64_clean/src/) change, take the following steps to re-optimize and install them into the main source tree:
10+
11+
1. Run `make` to re-generate the optimized sources using SLOTHY. This assumes a working SLOTHY setup, as established e.g. by the default nix shell for mldsa-native. See also the [SLOTHY README](https://github.com/slothy-optimizer/slothy/).
12+
13+
2. Run `autogen` to transfer the newly optimized files into the main source tree [mldsa/src/native](../../../mldsa/src/native).
14+
15+
3. Run `./scripts/tests all --opt=OPT` to check that the new assembly is still functional.

0 commit comments

Comments
 (0)