-
Notifications
You must be signed in to change notification settings - Fork 26
SLOTHY: Superoptimize AArch64 NTT #715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3db1e37 to
7d2fff7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mac Mini (M1, 2020) benchmarks (opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
46404 cycles |
46493 cycles |
1.00 |
ML-DSA-44 sign |
132024 cycles |
132762 cycles |
0.99 |
ML-DSA-44 verify |
47650 cycles |
47845 cycles |
1.00 |
ML-DSA-65 keypair |
81342 cycles |
81453 cycles |
1.00 |
ML-DSA-65 sign |
218211 cycles |
219166 cycles |
1.00 |
ML-DSA-65 verify |
79868 cycles |
80110 cycles |
1.00 |
ML-DSA-87 keypair |
132455 cycles |
132613 cycles |
1.00 |
ML-DSA-87 sign |
279824 cycles |
281096 cycles |
1.00 |
ML-DSA-87 verify |
130061 cycles |
130374 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mac Mini (M1, 2020) benchmarks (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
115253 cycles |
115241 cycles |
1.00 |
ML-DSA-44 sign |
430478 cycles |
430437 cycles |
1.00 |
ML-DSA-44 verify |
122166 cycles |
122150 cycles |
1.00 |
ML-DSA-65 keypair |
197170 cycles |
197159 cycles |
1.00 |
ML-DSA-65 sign |
700211 cycles |
700291 cycles |
1.00 |
ML-DSA-65 verify |
197615 cycles |
197609 cycles |
1.00 |
ML-DSA-87 keypair |
325635 cycles |
325599 cycles |
1.00 |
ML-DSA-87 sign |
884161 cycles |
884117 cycles |
1.00 |
ML-DSA-87 verify |
328981 cycles |
328935 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 4th gen (c7i)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
34977 cycles |
35033 cycles |
1.00 |
ML-DSA-44 sign |
119597 cycles |
120639 cycles |
0.99 |
ML-DSA-44 verify |
38066 cycles |
38096 cycles |
1.00 |
ML-DSA-65 keypair |
62933 cycles |
62103 cycles |
1.01 |
ML-DSA-65 sign |
201882 cycles |
199840 cycles |
1.01 |
ML-DSA-65 verify |
62796 cycles |
62386 cycles |
1.01 |
ML-DSA-87 keypair |
95163 cycles |
94045 cycles |
1.01 |
ML-DSA-87 sign |
235243 cycles |
230366 cycles |
1.02 |
ML-DSA-87 verify |
94086 cycles |
93695 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 4th gen (c7i) (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
95730 cycles |
95923 cycles |
1.00 |
ML-DSA-44 sign |
348383 cycles |
349606 cycles |
1.00 |
ML-DSA-44 verify |
101264 cycles |
101539 cycles |
1.00 |
ML-DSA-65 keypair |
163494 cycles |
164092 cycles |
1.00 |
ML-DSA-65 sign |
564771 cycles |
565519 cycles |
1.00 |
ML-DSA-65 verify |
164927 cycles |
165145 cycles |
1.00 |
ML-DSA-87 keypair |
267446 cycles |
267412 cycles |
1.00 |
ML-DSA-87 sign |
722795 cycles |
723281 cycles |
1.00 |
ML-DSA-87 verify |
270813 cycles |
271148 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 3rd gen (c6a)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
69443 cycles |
69283 cycles |
1.00 |
ML-DSA-44 sign |
185039 cycles |
184736 cycles |
1.00 |
ML-DSA-44 verify |
69118 cycles |
68943 cycles |
1.00 |
ML-DSA-65 keypair |
119271 cycles |
119333 cycles |
1.00 |
ML-DSA-65 sign |
295027 cycles |
294861 cycles |
1.00 |
ML-DSA-65 verify |
114865 cycles |
115240 cycles |
1.00 |
ML-DSA-87 keypair |
202095 cycles |
201809 cycles |
1.00 |
ML-DSA-87 sign |
385443 cycles |
386059 cycles |
1.00 |
ML-DSA-87 verify |
193677 cycles |
193415 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton4
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
69106 cycles |
69757 cycles |
0.99 |
ML-DSA-44 sign |
208510 cycles |
213820 cycles |
0.98 |
ML-DSA-44 verify |
70953 cycles |
72626 cycles |
0.98 |
ML-DSA-65 keypair |
122181 cycles |
122920 cycles |
0.99 |
ML-DSA-65 sign |
342337 cycles |
350128 cycles |
0.98 |
ML-DSA-65 verify |
118280 cycles |
120392 cycles |
0.98 |
ML-DSA-87 keypair |
200083 cycles |
201066 cycles |
1.00 |
ML-DSA-87 sign |
440000 cycles |
449443 cycles |
0.98 |
ML-DSA-87 verify |
195108 cycles |
198563 cycles |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 3rd gen (c6i)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
56922 cycles |
58073 cycles |
0.98 |
ML-DSA-44 sign |
179895 cycles |
179585 cycles |
1.00 |
ML-DSA-44 verify |
60993 cycles |
60950 cycles |
1.00 |
ML-DSA-65 keypair |
99702 cycles |
99876 cycles |
1.00 |
ML-DSA-65 sign |
296395 cycles |
296275 cycles |
1.00 |
ML-DSA-65 verify |
99953 cycles |
100357 cycles |
1.00 |
ML-DSA-87 keypair |
154280 cycles |
154306 cycles |
1.00 |
ML-DSA-87 sign |
352801 cycles |
352518 cycles |
1.00 |
ML-DSA-87 verify |
152426 cycles |
151736 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 3rd gen (c6a) (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
135705 cycles |
135922 cycles |
1.00 |
ML-DSA-44 sign |
540454 cycles |
541395 cycles |
1.00 |
ML-DSA-44 verify |
148955 cycles |
148646 cycles |
1.00 |
ML-DSA-65 keypair |
228221 cycles |
228378 cycles |
1.00 |
ML-DSA-65 sign |
890666 cycles |
888828 cycles |
1.00 |
ML-DSA-65 verify |
237625 cycles |
237994 cycles |
1.00 |
ML-DSA-87 keypair |
374149 cycles |
372874 cycles |
1.00 |
ML-DSA-87 sign |
1107455 cycles |
1106360 cycles |
1.00 |
ML-DSA-87 verify |
387864 cycles |
387292 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 4th gen (c7a)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
42040 cycles |
42636 cycles |
0.99 |
ML-DSA-44 sign |
130535 cycles |
131511 cycles |
0.99 |
ML-DSA-44 verify |
44019 cycles |
44987 cycles |
0.98 |
ML-DSA-65 keypair |
71749 cycles |
72910 cycles |
0.98 |
ML-DSA-65 sign |
211719 cycles |
213828 cycles |
0.99 |
ML-DSA-65 verify |
71689 cycles |
73802 cycles |
0.97 |
ML-DSA-87 keypair |
110650 cycles |
109892 cycles |
1.01 |
ML-DSA-87 sign |
251980 cycles |
249297 cycles |
1.01 |
ML-DSA-87 verify |
111284 cycles |
110122 cycles |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton4 (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
128263 cycles |
128322 cycles |
1.00 |
ML-DSA-44 sign |
456597 cycles |
456713 cycles |
1.00 |
ML-DSA-44 verify |
136325 cycles |
136331 cycles |
1.00 |
ML-DSA-65 keypair |
220618 cycles |
220718 cycles |
1.00 |
ML-DSA-65 sign |
745989 cycles |
746458 cycles |
1.00 |
ML-DSA-65 verify |
220650 cycles |
220327 cycles |
1.00 |
ML-DSA-87 keypair |
364973 cycles |
365321 cycles |
1.00 |
ML-DSA-87 sign |
944314 cycles |
943476 cycles |
1.00 |
ML-DSA-87 verify |
368962 cycles |
369250 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton2
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
116074 cycles |
115806 cycles |
1.00 |
ML-DSA-44 sign |
373987 cycles |
377649 cycles |
0.99 |
ML-DSA-44 verify |
119596 cycles |
120580 cycles |
0.99 |
ML-DSA-65 keypair |
199521 cycles |
200343 cycles |
1.00 |
ML-DSA-65 sign |
615748 cycles |
623612 cycles |
0.99 |
ML-DSA-65 verify |
196049 cycles |
198405 cycles |
0.99 |
ML-DSA-87 keypair |
326639 cycles |
327909 cycles |
1.00 |
ML-DSA-87 sign |
780618 cycles |
792403 cycles |
0.99 |
ML-DSA-87 verify |
322185 cycles |
325206 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton3
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
73273 cycles |
74035 cycles |
0.99 |
ML-DSA-44 sign |
221496 cycles |
228487 cycles |
0.97 |
ML-DSA-44 verify |
76006 cycles |
78067 cycles |
0.97 |
ML-DSA-65 keypair |
129453 cycles |
130734 cycles |
0.99 |
ML-DSA-65 sign |
368531 cycles |
378739 cycles |
0.97 |
ML-DSA-65 verify |
126698 cycles |
129237 cycles |
0.98 |
ML-DSA-87 keypair |
210952 cycles |
212581 cycles |
0.99 |
ML-DSA-87 sign |
467907 cycles |
479894 cycles |
0.98 |
ML-DSA-87 verify |
206847 cycles |
209118 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 3rd gen (c6i) (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
158743 cycles |
158555 cycles |
1.00 |
ML-DSA-44 sign |
564831 cycles |
565027 cycles |
1.00 |
ML-DSA-44 verify |
170620 cycles |
170312 cycles |
1.00 |
ML-DSA-65 keypair |
269035 cycles |
271317 cycles |
0.99 |
ML-DSA-65 sign |
924993 cycles |
931590 cycles |
0.99 |
ML-DSA-65 verify |
275075 cycles |
276884 cycles |
0.99 |
ML-DSA-87 keypair |
451854 cycles |
451637 cycles |
1.00 |
ML-DSA-87 sign |
1183125 cycles |
1183472 cycles |
1.00 |
ML-DSA-87 verify |
460875 cycles |
460346 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 4th gen (c7a) (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
120037 cycles |
121562 cycles |
0.99 |
ML-DSA-44 sign |
454020 cycles |
458684 cycles |
0.99 |
ML-DSA-44 verify |
130567 cycles |
131322 cycles |
0.99 |
ML-DSA-65 keypair |
205595 cycles |
205167 cycles |
1.00 |
ML-DSA-65 sign |
736345 cycles |
738228 cycles |
1.00 |
ML-DSA-65 verify |
209678 cycles |
211576 cycles |
0.99 |
ML-DSA-87 keypair |
337811 cycles |
338466 cycles |
1.00 |
ML-DSA-87 sign |
926359 cycles |
926408 cycles |
1.00 |
ML-DSA-87 verify |
344492 cycles |
346331 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton3 (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
138808 cycles |
138856 cycles |
1.00 |
ML-DSA-44 sign |
493785 cycles |
493869 cycles |
1.00 |
ML-DSA-44 verify |
148422 cycles |
148467 cycles |
1.00 |
ML-DSA-65 keypair |
241822 cycles |
242331 cycles |
1.00 |
ML-DSA-65 sign |
808313 cycles |
809068 cycles |
1.00 |
ML-DSA-65 verify |
240751 cycles |
240460 cycles |
1.00 |
ML-DSA-87 keypair |
396480 cycles |
396817 cycles |
1.00 |
ML-DSA-87 sign |
1027114 cycles |
1026758 cycles |
1.00 |
ML-DSA-87 verify |
401934 cycles |
402055 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton2 (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
214257 cycles |
214559 cycles |
1.00 |
ML-DSA-44 sign |
783606 cycles |
782675 cycles |
1.00 |
ML-DSA-44 verify |
230521 cycles |
230081 cycles |
1.00 |
ML-DSA-65 keypair |
382833 cycles |
385317 cycles |
0.99 |
ML-DSA-65 sign |
1288735 cycles |
1310339 cycles |
0.98 |
ML-DSA-65 verify |
372307 cycles |
376384 cycles |
0.99 |
ML-DSA-87 keypair |
605982 cycles |
607198 cycles |
1.00 |
ML-DSA-87 sign |
1625311 cycles |
1625770 cycles |
1.00 |
ML-DSA-87 verify |
617432 cycles |
617102 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
282343 cycles |
292843 cycles |
0.96 |
ML-DSA-44 sign |
884062 cycles |
937296 cycles |
0.94 |
ML-DSA-44 verify |
279440 cycles |
292376 cycles |
0.96 |
ML-DSA-65 keypair |
479793 cycles |
493195 cycles |
0.97 |
ML-DSA-65 sign |
1449277 cycles |
1528649 cycles |
0.95 |
ML-DSA-65 verify |
457015 cycles |
477135 cycles |
0.96 |
ML-DSA-87 keypair |
820376 cycles |
843007 cycles |
0.97 |
ML-DSA-87 sign |
1974277 cycles |
2059907 cycles |
0.96 |
ML-DSA-87 verify |
789444 cycles |
818544 cycles |
0.96 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
114851 cycles |
115551 cycles |
0.99 |
ML-DSA-44 sign |
371125 cycles |
377243 cycles |
0.98 |
ML-DSA-44 verify |
118439 cycles |
120533 cycles |
0.98 |
ML-DSA-65 keypair |
199168 cycles |
200181 cycles |
0.99 |
ML-DSA-65 sign |
615068 cycles |
623060 cycles |
0.99 |
ML-DSA-65 verify |
195862 cycles |
198360 cycles |
0.99 |
ML-DSA-87 keypair |
325573 cycles |
327214 cycles |
0.99 |
ML-DSA-87 sign |
779596 cycles |
791357 cycles |
0.99 |
ML-DSA-87 verify |
321385 cycles |
324866 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
213659 cycles |
213758 cycles |
1.00 |
ML-DSA-44 sign |
783503 cycles |
783969 cycles |
1.00 |
ML-DSA-44 verify |
229912 cycles |
229501 cycles |
1.00 |
ML-DSA-65 keypair |
385217 cycles |
384816 cycles |
1.00 |
ML-DSA-65 sign |
1306979 cycles |
1314407 cycles |
0.99 |
ML-DSA-65 verify |
375190 cycles |
375914 cycles |
1.00 |
ML-DSA-87 keypair |
605699 cycles |
606891 cycles |
1.00 |
ML-DSA-87 sign |
1622568 cycles |
1623316 cycles |
1.00 |
ML-DSA-87 verify |
617517 cycles |
617094 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
464939 cycles |
469580 cycles |
0.99 |
ML-DSA-44 sign |
2207072 cycles |
2223398 cycles |
0.99 |
ML-DSA-44 verify |
545501 cycles |
546853 cycles |
1.00 |
ML-DSA-65 keypair |
779139 cycles |
782408 cycles |
1.00 |
ML-DSA-65 sign |
3616666 cycles |
3632236 cycles |
1.00 |
ML-DSA-65 verify |
847160 cycles |
852498 cycles |
0.99 |
ML-DSA-87 keypair |
1257483 cycles |
1266251 cycles |
0.99 |
ML-DSA-87 sign |
4440506 cycles |
4476468 cycles |
0.99 |
ML-DSA-87 verify |
1364504 cycles |
1370939 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
222177 cycles |
232557 cycles |
0.96 |
ML-DSA-44 sign |
654716 cycles |
682235 cycles |
0.96 |
ML-DSA-44 verify |
218067 cycles |
236197 cycles |
0.92 |
ML-DSA-65 keypair |
404793 cycles |
402452 cycles |
1.01 |
ML-DSA-65 sign |
1093903 cycles |
1089031 cycles |
1.00 |
ML-DSA-65 verify |
384104 cycles |
385202 cycles |
1.00 |
ML-DSA-87 keypair |
651823 cycles |
659770 cycles |
0.99 |
ML-DSA-87 sign |
1413048 cycles |
1479498 cycles |
0.96 |
ML-DSA-87 verify |
639532 cycles |
650241 cycles |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
335258 cycles |
302190 cycles |
1.11 |
ML-DSA-44 sign |
1220132 cycles |
1168558 cycles |
1.04 |
ML-DSA-44 verify |
338235 cycles |
325443 cycles |
1.04 |
ML-DSA-65 keypair |
587268 cycles |
555575 cycles |
1.06 |
ML-DSA-65 sign |
1989637 cycles |
1948293 cycles |
1.02 |
ML-DSA-65 verify |
544073 cycles |
529304 cycles |
1.03 |
ML-DSA-87 keypair |
880186 cycles |
869559 cycles |
1.01 |
ML-DSA-87 sign |
2507767 cycles |
2440395 cycles |
1.03 |
ML-DSA-87 verify |
916613 cycles |
880002 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
827126 cycles |
827763 cycles |
1.00 |
ML-DSA-44 sign |
3343137 cycles |
3332871 cycles |
1.00 |
ML-DSA-44 verify |
922091 cycles |
920517 cycles |
1.00 |
ML-DSA-65 keypair |
1401530 cycles |
1401774 cycles |
1.00 |
ML-DSA-65 sign |
5435183 cycles |
5440049 cycles |
1.00 |
ML-DSA-65 verify |
1470070 cycles |
1468550 cycles |
1.00 |
ML-DSA-87 keypair |
2313496 cycles |
2302968 cycles |
1.00 |
ML-DSA-87 sign |
6840430 cycles |
6810359 cycles |
1.00 |
ML-DSA-87 verify |
2407702 cycles |
2405483 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
7d2fff7 to
62db7f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 6e9e43f | Previous: dcc95d6 | Ratio |
|---|---|---|---|
ML-DSA-87 sign |
1473360 cycles |
1416559 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Intel Xeon 3rd gen (c6i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 6e9e43f | Previous: dcc95d6 | Ratio |
|---|---|---|---|
ML-DSA-87 keypair |
174494 cycles |
153880 cycles |
1.13 |
This comment was automatically generated by workflow using github-action-benchmark.
a3fa9f0 to
540908f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Mac Mini (M1, 2020) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 540908f | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 sign |
137880 cycles |
132762 cycles |
1.04 |
ML-DSA-65 sign |
226830 cycles |
219166 cycles |
1.03 |
ML-DSA-87 sign |
289790 cycles |
281096 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 540908f | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-65 keypair |
75530 cycles |
72910 cycles |
1.04 |
ML-DSA-65 verify |
76179 cycles |
73802 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 4b8edae | Previous: c6d7c93 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
335258 cycles |
302190 cycles |
1.11 |
ML-DSA-44 sign |
1220132 cycles |
1168558 cycles |
1.04 |
ML-DSA-44 verify |
338235 cycles |
325443 cycles |
1.04 |
ML-DSA-65 keypair |
587268 cycles |
555575 cycles |
1.06 |
ML-DSA-87 verify |
916613 cycles |
880002 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
3bed6ee to
4b8edae
Compare
1d13c36 to
9ea0b14
Compare
hanno-becker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I also tested this locally on my M1 and it worked like a charm.
9ea0b14 to
5e7e103
Compare
|
Took the liberty to resolve the rebase conflict in |
5e7e103 to
f4a1257
Compare
This commit adds the optimized backend dev/aarch64_opt. For now this backend only differs from the clean backend in the NTT which is superoptimized using SLOTHY for the Neoverse N1. For all other files it's a simple copy of the clean backend. A Makefile is added that performs the optimization. CI is adjusted to test both the clean and the opt backend. The first loop of the NTT can be optimized in one go. The second loop is too largeand we, hence, use the split heuristic. I have experimented with the Cortex-A55 model as well - that results in significantly faster code on A55, but results in a noticable slow down, especially for A72 (see performance results in the pull request). A72 performance seems more important than A55 performance. I have experimented with applying some other optimizations (from the SLOTHY paper): - Using st4 instead of the manual tranposition - Using scalar loads instead of vector loads While those result in much better performance on Cortex-A55, they slow down code on other platforms (see the pull request for details). The autogen script is extended to allow running the optimization through the --slothy flag. Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
f4a1257 to
f5079fb
Compare
This commit adds the optimized backend dev/aarch64_opt. For now this backend
only differs from the clean backend in the NTT which is superoptimized using
SLOTHY for the Neoverse N1. For all other files it's a simple copy of the clean
backend. A Makefile is added that performs the optimization.
CI is adjusted to test both the clean and the opt backend.
The first loop of the NTT can be optimized in one go. The second loop is
too largeand we, hence, use the split heuristic.
I have experimented with the Cortex-A55 model as well - that results in
significantly faster code on A55, but results in a noticable slow down,
especially for A72 (see performance results in the pull request).
A72 performance seems more important than A55 performance.
I have experimented with applying some other optimizations (from the SLOTHY
paper):
While those result in much better performance on Cortex-A55, they slow down
code on other platforms (see the pull request for details).
The autogen script is extended to allow running the optimization through the
--slothy flag.
CI is added to test optimization.
Speed-ups for the NTT when using the Neoverse N1 model:
I also tried optimizing using the Cortex-A55 model (but results on A72 are a bit worse)
I tried applying additional tricks from the SLOTHY paper (st4, scalar loads) -- see https://github.com/pq-code-package/mldsa-native/tree/slothy-ntt-st4-scalar. When optimizing for the A55 that gives me: