Skip to content

Conversation

@mkannwischer
Copy link
Contributor

@mkannwischer mkannwischer commented Nov 19, 2025

This commit adds the optimized backend dev/aarch64_opt. For now this backend
only differs from the clean backend in the NTT which is superoptimized using
SLOTHY for the Neoverse N1. For all other files it's a simple copy of the clean
backend. A Makefile is added that performs the optimization.
CI is adjusted to test both the clean and the opt backend.

The first loop of the NTT can be optimized in one go. The second loop is
too largeand we, hence, use the split heuristic.

I have experimented with the Cortex-A55 model as well - that results in
significantly faster code on A55, but results in a noticable slow down,
especially for A72 (see performance results in the pull request).
A72 performance seems more important than A55 performance.

I have experimented with applying some other optimizations (from the SLOTHY
paper):

  • Using st4 instead of the manual tranposition
  • Using scalar loads instead of vector loads
    While those result in much better performance on Cortex-A55, they slow down
    code on other platforms (see the pull request for details).

The autogen script is extended to allow running the optimization through the
--slothy flag.

CI is added to test optimization.

Speed-ups for the NTT when using the Neoverse N1 model:

Platform Before (main) After Speedup
Mac Mini (M1) 440 419 1.050x
Graviton2 1755 1582 1.109x
Graviton3 936 720 1.300x
Graviton4 826 673 1.227x
Cortex-A76 (RPi 5) 1755 1582 1.109x
Cortex-A72 (RPi 4) 2102 1695 1.240x
Cortex-A55 (Snapdragon) 3625 2352 1.541x

I also tried optimizing using the Cortex-A55 model (but results on A72 are a bit worse)

Platform Before (main) After Speedup
Mac Mini (M1) 440 418 1.053x
Graviton2 1755 1599 1.098x
Graviton3 936 713 1.313x
Graviton4 826 680 1.215x
Cortex-A76 (RPi 5) 1755 1599 1.098x
Cortex-A72 (RPi 4) 2102 1931 1.089x
Cortex-A55 (Snapdragon) 3625 2236 1.621x

I tried applying additional tricks from the SLOTHY paper (st4, scalar loads) -- see https://github.com/pq-code-package/mldsa-native/tree/slothy-ntt-st4-scalar. When optimizing for the A55 that gives me:

Platform After
Mac Mini (M1) 619
Graviton2 1758
Graviton3 927
Graviton4 894
Cortex-A76 (RPi 5) 1758
Cortex-A72 (RPi 4) 2162
Cortex-A55 (Snapdragon) 1952

@mkannwischer mkannwischer force-pushed the slothy-ntt2 branch 2 times, most recently from 3db1e37 to 7d2fff7 Compare November 19, 2025 02:55
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 46404 cycles 46493 cycles 1.00
ML-DSA-44 sign 132024 cycles 132762 cycles 0.99
ML-DSA-44 verify 47650 cycles 47845 cycles 1.00
ML-DSA-65 keypair 81342 cycles 81453 cycles 1.00
ML-DSA-65 sign 218211 cycles 219166 cycles 1.00
ML-DSA-65 verify 79868 cycles 80110 cycles 1.00
ML-DSA-87 keypair 132455 cycles 132613 cycles 1.00
ML-DSA-87 sign 279824 cycles 281096 cycles 1.00
ML-DSA-87 verify 130061 cycles 130374 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 115253 cycles 115241 cycles 1.00
ML-DSA-44 sign 430478 cycles 430437 cycles 1.00
ML-DSA-44 verify 122166 cycles 122150 cycles 1.00
ML-DSA-65 keypair 197170 cycles 197159 cycles 1.00
ML-DSA-65 sign 700211 cycles 700291 cycles 1.00
ML-DSA-65 verify 197615 cycles 197609 cycles 1.00
ML-DSA-87 keypair 325635 cycles 325599 cycles 1.00
ML-DSA-87 sign 884161 cycles 884117 cycles 1.00
ML-DSA-87 verify 328981 cycles 328935 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 34977 cycles 35033 cycles 1.00
ML-DSA-44 sign 119597 cycles 120639 cycles 0.99
ML-DSA-44 verify 38066 cycles 38096 cycles 1.00
ML-DSA-65 keypair 62933 cycles 62103 cycles 1.01
ML-DSA-65 sign 201882 cycles 199840 cycles 1.01
ML-DSA-65 verify 62796 cycles 62386 cycles 1.01
ML-DSA-87 keypair 95163 cycles 94045 cycles 1.01
ML-DSA-87 sign 235243 cycles 230366 cycles 1.02
ML-DSA-87 verify 94086 cycles 93695 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 95730 cycles 95923 cycles 1.00
ML-DSA-44 sign 348383 cycles 349606 cycles 1.00
ML-DSA-44 verify 101264 cycles 101539 cycles 1.00
ML-DSA-65 keypair 163494 cycles 164092 cycles 1.00
ML-DSA-65 sign 564771 cycles 565519 cycles 1.00
ML-DSA-65 verify 164927 cycles 165145 cycles 1.00
ML-DSA-87 keypair 267446 cycles 267412 cycles 1.00
ML-DSA-87 sign 722795 cycles 723281 cycles 1.00
ML-DSA-87 verify 270813 cycles 271148 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 69443 cycles 69283 cycles 1.00
ML-DSA-44 sign 185039 cycles 184736 cycles 1.00
ML-DSA-44 verify 69118 cycles 68943 cycles 1.00
ML-DSA-65 keypair 119271 cycles 119333 cycles 1.00
ML-DSA-65 sign 295027 cycles 294861 cycles 1.00
ML-DSA-65 verify 114865 cycles 115240 cycles 1.00
ML-DSA-87 keypair 202095 cycles 201809 cycles 1.00
ML-DSA-87 sign 385443 cycles 386059 cycles 1.00
ML-DSA-87 verify 193677 cycles 193415 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 69106 cycles 69757 cycles 0.99
ML-DSA-44 sign 208510 cycles 213820 cycles 0.98
ML-DSA-44 verify 70953 cycles 72626 cycles 0.98
ML-DSA-65 keypair 122181 cycles 122920 cycles 0.99
ML-DSA-65 sign 342337 cycles 350128 cycles 0.98
ML-DSA-65 verify 118280 cycles 120392 cycles 0.98
ML-DSA-87 keypair 200083 cycles 201066 cycles 1.00
ML-DSA-87 sign 440000 cycles 449443 cycles 0.98
ML-DSA-87 verify 195108 cycles 198563 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 56922 cycles 58073 cycles 0.98
ML-DSA-44 sign 179895 cycles 179585 cycles 1.00
ML-DSA-44 verify 60993 cycles 60950 cycles 1.00
ML-DSA-65 keypair 99702 cycles 99876 cycles 1.00
ML-DSA-65 sign 296395 cycles 296275 cycles 1.00
ML-DSA-65 verify 99953 cycles 100357 cycles 1.00
ML-DSA-87 keypair 154280 cycles 154306 cycles 1.00
ML-DSA-87 sign 352801 cycles 352518 cycles 1.00
ML-DSA-87 verify 152426 cycles 151736 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 135705 cycles 135922 cycles 1.00
ML-DSA-44 sign 540454 cycles 541395 cycles 1.00
ML-DSA-44 verify 148955 cycles 148646 cycles 1.00
ML-DSA-65 keypair 228221 cycles 228378 cycles 1.00
ML-DSA-65 sign 890666 cycles 888828 cycles 1.00
ML-DSA-65 verify 237625 cycles 237994 cycles 1.00
ML-DSA-87 keypair 374149 cycles 372874 cycles 1.00
ML-DSA-87 sign 1107455 cycles 1106360 cycles 1.00
ML-DSA-87 verify 387864 cycles 387292 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 42040 cycles 42636 cycles 0.99
ML-DSA-44 sign 130535 cycles 131511 cycles 0.99
ML-DSA-44 verify 44019 cycles 44987 cycles 0.98
ML-DSA-65 keypair 71749 cycles 72910 cycles 0.98
ML-DSA-65 sign 211719 cycles 213828 cycles 0.99
ML-DSA-65 verify 71689 cycles 73802 cycles 0.97
ML-DSA-87 keypair 110650 cycles 109892 cycles 1.01
ML-DSA-87 sign 251980 cycles 249297 cycles 1.01
ML-DSA-87 verify 111284 cycles 110122 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 128263 cycles 128322 cycles 1.00
ML-DSA-44 sign 456597 cycles 456713 cycles 1.00
ML-DSA-44 verify 136325 cycles 136331 cycles 1.00
ML-DSA-65 keypair 220618 cycles 220718 cycles 1.00
ML-DSA-65 sign 745989 cycles 746458 cycles 1.00
ML-DSA-65 verify 220650 cycles 220327 cycles 1.00
ML-DSA-87 keypair 364973 cycles 365321 cycles 1.00
ML-DSA-87 sign 944314 cycles 943476 cycles 1.00
ML-DSA-87 verify 368962 cycles 369250 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 116074 cycles 115806 cycles 1.00
ML-DSA-44 sign 373987 cycles 377649 cycles 0.99
ML-DSA-44 verify 119596 cycles 120580 cycles 0.99
ML-DSA-65 keypair 199521 cycles 200343 cycles 1.00
ML-DSA-65 sign 615748 cycles 623612 cycles 0.99
ML-DSA-65 verify 196049 cycles 198405 cycles 0.99
ML-DSA-87 keypair 326639 cycles 327909 cycles 1.00
ML-DSA-87 sign 780618 cycles 792403 cycles 0.99
ML-DSA-87 verify 322185 cycles 325206 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 73273 cycles 74035 cycles 0.99
ML-DSA-44 sign 221496 cycles 228487 cycles 0.97
ML-DSA-44 verify 76006 cycles 78067 cycles 0.97
ML-DSA-65 keypair 129453 cycles 130734 cycles 0.99
ML-DSA-65 sign 368531 cycles 378739 cycles 0.97
ML-DSA-65 verify 126698 cycles 129237 cycles 0.98
ML-DSA-87 keypair 210952 cycles 212581 cycles 0.99
ML-DSA-87 sign 467907 cycles 479894 cycles 0.98
ML-DSA-87 verify 206847 cycles 209118 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 158743 cycles 158555 cycles 1.00
ML-DSA-44 sign 564831 cycles 565027 cycles 1.00
ML-DSA-44 verify 170620 cycles 170312 cycles 1.00
ML-DSA-65 keypair 269035 cycles 271317 cycles 0.99
ML-DSA-65 sign 924993 cycles 931590 cycles 0.99
ML-DSA-65 verify 275075 cycles 276884 cycles 0.99
ML-DSA-87 keypair 451854 cycles 451637 cycles 1.00
ML-DSA-87 sign 1183125 cycles 1183472 cycles 1.00
ML-DSA-87 verify 460875 cycles 460346 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 120037 cycles 121562 cycles 0.99
ML-DSA-44 sign 454020 cycles 458684 cycles 0.99
ML-DSA-44 verify 130567 cycles 131322 cycles 0.99
ML-DSA-65 keypair 205595 cycles 205167 cycles 1.00
ML-DSA-65 sign 736345 cycles 738228 cycles 1.00
ML-DSA-65 verify 209678 cycles 211576 cycles 0.99
ML-DSA-87 keypair 337811 cycles 338466 cycles 1.00
ML-DSA-87 sign 926359 cycles 926408 cycles 1.00
ML-DSA-87 verify 344492 cycles 346331 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 138808 cycles 138856 cycles 1.00
ML-DSA-44 sign 493785 cycles 493869 cycles 1.00
ML-DSA-44 verify 148422 cycles 148467 cycles 1.00
ML-DSA-65 keypair 241822 cycles 242331 cycles 1.00
ML-DSA-65 sign 808313 cycles 809068 cycles 1.00
ML-DSA-65 verify 240751 cycles 240460 cycles 1.00
ML-DSA-87 keypair 396480 cycles 396817 cycles 1.00
ML-DSA-87 sign 1027114 cycles 1026758 cycles 1.00
ML-DSA-87 verify 401934 cycles 402055 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 214257 cycles 214559 cycles 1.00
ML-DSA-44 sign 783606 cycles 782675 cycles 1.00
ML-DSA-44 verify 230521 cycles 230081 cycles 1.00
ML-DSA-65 keypair 382833 cycles 385317 cycles 0.99
ML-DSA-65 sign 1288735 cycles 1310339 cycles 0.98
ML-DSA-65 verify 372307 cycles 376384 cycles 0.99
ML-DSA-87 keypair 605982 cycles 607198 cycles 1.00
ML-DSA-87 sign 1625311 cycles 1625770 cycles 1.00
ML-DSA-87 verify 617432 cycles 617102 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 282343 cycles 292843 cycles 0.96
ML-DSA-44 sign 884062 cycles 937296 cycles 0.94
ML-DSA-44 verify 279440 cycles 292376 cycles 0.96
ML-DSA-65 keypair 479793 cycles 493195 cycles 0.97
ML-DSA-65 sign 1449277 cycles 1528649 cycles 0.95
ML-DSA-65 verify 457015 cycles 477135 cycles 0.96
ML-DSA-87 keypair 820376 cycles 843007 cycles 0.97
ML-DSA-87 sign 1974277 cycles 2059907 cycles 0.96
ML-DSA-87 verify 789444 cycles 818544 cycles 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 114851 cycles 115551 cycles 0.99
ML-DSA-44 sign 371125 cycles 377243 cycles 0.98
ML-DSA-44 verify 118439 cycles 120533 cycles 0.98
ML-DSA-65 keypair 199168 cycles 200181 cycles 0.99
ML-DSA-65 sign 615068 cycles 623060 cycles 0.99
ML-DSA-65 verify 195862 cycles 198360 cycles 0.99
ML-DSA-87 keypair 325573 cycles 327214 cycles 0.99
ML-DSA-87 sign 779596 cycles 791357 cycles 0.99
ML-DSA-87 verify 321385 cycles 324866 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 213659 cycles 213758 cycles 1.00
ML-DSA-44 sign 783503 cycles 783969 cycles 1.00
ML-DSA-44 verify 229912 cycles 229501 cycles 1.00
ML-DSA-65 keypair 385217 cycles 384816 cycles 1.00
ML-DSA-65 sign 1306979 cycles 1314407 cycles 0.99
ML-DSA-65 verify 375190 cycles 375914 cycles 1.00
ML-DSA-87 keypair 605699 cycles 606891 cycles 1.00
ML-DSA-87 sign 1622568 cycles 1623316 cycles 1.00
ML-DSA-87 verify 617517 cycles 617094 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 464939 cycles 469580 cycles 0.99
ML-DSA-44 sign 2207072 cycles 2223398 cycles 0.99
ML-DSA-44 verify 545501 cycles 546853 cycles 1.00
ML-DSA-65 keypair 779139 cycles 782408 cycles 1.00
ML-DSA-65 sign 3616666 cycles 3632236 cycles 1.00
ML-DSA-65 verify 847160 cycles 852498 cycles 0.99
ML-DSA-87 keypair 1257483 cycles 1266251 cycles 0.99
ML-DSA-87 sign 4440506 cycles 4476468 cycles 0.99
ML-DSA-87 verify 1364504 cycles 1370939 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 222177 cycles 232557 cycles 0.96
ML-DSA-44 sign 654716 cycles 682235 cycles 0.96
ML-DSA-44 verify 218067 cycles 236197 cycles 0.92
ML-DSA-65 keypair 404793 cycles 402452 cycles 1.01
ML-DSA-65 sign 1093903 cycles 1089031 cycles 1.00
ML-DSA-65 verify 384104 cycles 385202 cycles 1.00
ML-DSA-87 keypair 651823 cycles 659770 cycles 0.99
ML-DSA-87 sign 1413048 cycles 1479498 cycles 0.96
ML-DSA-87 verify 639532 cycles 650241 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 335258 cycles 302190 cycles 1.11
ML-DSA-44 sign 1220132 cycles 1168558 cycles 1.04
ML-DSA-44 verify 338235 cycles 325443 cycles 1.04
ML-DSA-65 keypair 587268 cycles 555575 cycles 1.06
ML-DSA-65 sign 1989637 cycles 1948293 cycles 1.02
ML-DSA-65 verify 544073 cycles 529304 cycles 1.03
ML-DSA-87 keypair 880186 cycles 869559 cycles 1.01
ML-DSA-87 sign 2507767 cycles 2440395 cycles 1.03
ML-DSA-87 verify 916613 cycles 880002 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 827126 cycles 827763 cycles 1.00
ML-DSA-44 sign 3343137 cycles 3332871 cycles 1.00
ML-DSA-44 verify 922091 cycles 920517 cycles 1.00
ML-DSA-65 keypair 1401530 cycles 1401774 cycles 1.00
ML-DSA-65 sign 5435183 cycles 5440049 cycles 1.00
ML-DSA-65 verify 1470070 cycles 1468550 cycles 1.00
ML-DSA-87 keypair 2313496 cycles 2302968 cycles 1.00
ML-DSA-87 sign 6840430 cycles 6810359 cycles 1.00
ML-DSA-87 verify 2407702 cycles 2405483 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 6e9e43f Previous: dcc95d6 Ratio
ML-DSA-87 sign 1473360 cycles 1416559 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Intel Xeon 3rd gen (c6i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 6e9e43f Previous: dcc95d6 Ratio
ML-DSA-87 keypair 174494 cycles 153880 cycles 1.13

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Mac Mini (M1, 2020) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 540908f Previous: c6d7c93 Ratio
ML-DSA-44 sign 137880 cycles 132762 cycles 1.04
ML-DSA-65 sign 226830 cycles 219166 cycles 1.03
ML-DSA-87 sign 289790 cycles 281096 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 540908f Previous: c6d7c93 Ratio
ML-DSA-65 keypair 75530 cycles 72910 cycles 1.04
ML-DSA-65 verify 76179 cycles 73802 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@mkannwischer mkannwischer changed the title SLOTHY: Superoptimize ntt.S for the Neoverse N1 SLOTHY: Superoptimize AArch64 NTT Nov 20, 2025
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 4b8edae Previous: c6d7c93 Ratio
ML-DSA-44 keypair 335258 cycles 302190 cycles 1.11
ML-DSA-44 sign 1220132 cycles 1168558 cycles 1.04
ML-DSA-44 verify 338235 cycles 325443 cycles 1.04
ML-DSA-65 keypair 587268 cycles 555575 cycles 1.06
ML-DSA-87 verify 916613 cycles 880002 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@mkannwischer mkannwischer force-pushed the slothy-ntt2 branch 2 times, most recently from 3bed6ee to 4b8edae Compare November 20, 2025 04:15
@mkannwischer mkannwischer marked this pull request as ready for review November 20, 2025 04:17
@mkannwischer mkannwischer requested a review from a team as a code owner November 20, 2025 04:17
@mkannwischer mkannwischer force-pushed the slothy-ntt2 branch 2 times, most recently from 1d13c36 to 9ea0b14 Compare November 20, 2025 10:19
Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I also tested this locally on my M1 and it worked like a charm.

@hanno-becker
Copy link
Contributor

Took the liberty to resolve the rebase conflict in autogen (trivial).

This commit adds the optimized backend dev/aarch64_opt. For now this backend
only differs from the clean backend in the NTT which is superoptimized using
SLOTHY for the Neoverse N1. For all other files it's a simple copy of the clean
backend. A Makefile is added that performs the optimization.
CI is adjusted to test both the clean and the opt backend.

The first loop of the NTT can be optimized in one go. The second loop is
too largeand we, hence, use the split heuristic.

I have experimented with the Cortex-A55 model as well - that results in
significantly faster code on A55, but results in a noticable slow down,
especially for A72 (see performance results in the pull request).
A72 performance seems more important than A55 performance.

I have experimented with applying some other optimizations (from the SLOTHY
paper):
 - Using st4 instead of the manual tranposition
 - Using scalar loads instead of vector loads
While those result in much better performance on Cortex-A55, they slow down
code on other platforms (see the pull request for details).

The autogen script is extended to allow running the optimization through the
--slothy flag.

Signed-off-by: Matthias J. Kannwischer <[email protected]>
@mkannwischer mkannwischer merged commit c1d9522 into main Nov 21, 2025
278 checks passed
@mkannwischer mkannwischer deleted the slothy-ntt2 branch November 21, 2025 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants