Skip to content

Conversation

@byeongkeunahn
Copy link

@byeongkeunahn byeongkeunahn commented Aug 28, 2023

This commit implements number theoretic transform (NTT) for large integer multiplication (issue #169).

  • To simplify implementation the Schönhage–Strassen algorithm was not used. Instead, three distinct 64-bit primes were carefully chosen to enable NTT up to ~10^18 64bit integers, which allows multiplication up to ~5 x 10^17 64bit integers. Depending on the input length either two or three primes are used, with the latter only used when the inputs consist of at least 2^40 64bit integers. The convolution results modulo primes are merged using the Chinese Remainder Theorem.
  • To reduce padding and the number of cycles for NTT, multiple radices are used (radix-2, radix-3, radix-4, radix-5, radix-6). Radix-8 is desirable but actually slower, presumably due to register spill. Although manual SIMD coding may alleviate this issue, it was not used (i) for maximum portability and (ii) since 64bit SIMD multiply is not widely available even on x86/x64 platforms (AVX512). The prime moduli are carefully chosen to support these radices.
    • The NTT length is selected by exhaustively evaluating cost estimates for all allowed lengths within a factor of two.
  • Single-word (u64) Montgomery reduction is used for fast modular multiplication.
  • 32bit digits are supported by repacking the digits into u64, running the u64 algorithm, and converting back to u32. This results in 32bit builds being about 3-5x slower compared to 64bit builds, which, however, still is an improvement upon the existing algorithms.
  • Unbalanced multiplication is enabled when the cost estimates are favorable.
  • Based on experimentation, the following thresholds are chosen:
    • For u64 digits, switch to NTT if the shorter integer has at least 512 digits.
    • For u32 digits, switch to NTT if the shorter integer has at least 2,048 digits.
  • The three primes are as follows:
    • P1 = 10_237_243_632_176_332_801, Max NTT length = 2^24 * 3^20 * 5^2 = 1_462_463_376_025_190_400
    • P2 = 13_649_658_176_235_110_401, Max NTT length = 2^26 * 3^19 * 5^2 = 1_949_951_168_033_587_200
    • P3 = 14_259_017_916_245_606_401, Max NTT length = 2^22 * 3^21 * 5^2 = 1_096_847_532_018_892_800
    • P1 and P2 are used for the two-prime NTT, whereas the three-prime NTT uses all three.
  • The following MIT-licensed projects are used as reference:

On Ryzen 7 2700X, 64bit, it takes about 15ms for 2.7Mbits x 2.7Mbits and 170ms for 27Mbits x 27Mbits multiplication. This seems comparable to GMP 6.2.1.

benchmark-20230829

Previously 3 primes were used, which was suboptimal in terms of speed. Currently, the threshold for switching from 2 to 3 primes is 2^38.
Despite the simple implementation with obvious inefficiencies (e.g., not reusing the NTT of the shorter array), this leads to speed gains in multiple benchmarks, although there is a small regression in others.
@byeongkeunahn byeongkeunahn changed the title Implement number theroetic transform for large integer multiplication Implement number theoretic transform for large integer multiplication Aug 28, 2023
@HKalbasi
Copy link

Does the chart shows that current algorithm is faster than GMP? That's impressive.

@HKalbasi
Copy link

I ran benchmark fib_hex 100m from https://github.com/tczajka/bigint-benchmark-rs on this PR and it made num-bigint twice faster than malachite, slightly faster than gmp and 12x faster than itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants