-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add std.math.egcd (Extended GCD). Micro-optimize GCD. #25533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b8e054b to
fffe0c7
Compare
|
For cryptographic purposes, the Berstein-Yang algorithm is a better choice than EGCD. The Zig standard library uses it for field inversion in the NIST curves. |
|
the egcd is not fully fleshed out yet. for example there is capabilities for overflow errors. thanks for your comment, @jedisct1. btw, regarding finding inverses with a cryptographic algorithm there is also (imo) some improvement to be done in the standard library. a lot of code looks like it's been blindly ported from some other language, but Zig has capabilities that let you express it more elegantly, for example inline-for-loops. I am not sure how welcome it is to let a draft develop into a bigger PR, but I wouldn't mind to add cryptographic counterparts of the gcd algos to the standard library too, which can be uniformly used by all crypto functions. |
A bit insulting, but generally, that’s not the case. In fact, both libsodium and Go benefited from some of the tricks that were originally implemented in the Zig standard library. Regarding inversions: when constant-time behavior is required, they currently use either addition chains (for “nice” primes) or the Bernstein–Yang inversion implemented in native Zig code generated by fiat-crypto, which ensures correctness. So please don’t change that. That said, a generic EEA implementation is still useful for other scenarios, thanks for adding it! |
I don't think I was and I am sorry if I offended you in any way, but taking a quick look at the code reveals this: // Autogenerated: 'src/ExtractionOCaml/word_by_word_montgomery' --lang Zig --internal-static --public-function-case camelCase --private-function-case camelCase --public-type-case UpperCamelCase --private-type-case UpperCamelCase --no-prefix-fiat --package-name p384 '' 64 '2^384 - 2^128 - 2^96 + 2^32 - 1' mul square add sub opp from_montgomery to_montgomery nonzero selectznz to_bytes from_bytes one msat divstep divstep_precompThat being said, I cannot claim that I am able to write something better than what already exists, so maybe this is the only way to get the compiler to do, what it is supposed to do? |
As mentioned above, this is code generated by fiat-crypto, and it is done so in order to ensure correctness. I'm not really sure from what other language this could have been copied, when it's generated. |
|
@Rexicon226 , very kind of you to join this conversation and verifying the very thing I am saying: There has been code that's been "blindly copied over from some other language".
Auto-generation has its perks: you can quickly produce many variants of basically the same thing and you remove the human element of errors. What you ultimately sacrifice however is readability, including clarity of intend, helping users of the library understand what is implemented and why. It took me a while, but I found the original formulation or template for this auto-generated code in the standard library: https://github.com/mit-plv/fiat-crypto/blob/master/src/Arithmetic/BYInv.v Now here are some of my key issues I have as a simple user of the standard library. These are my opinions and might be shared by others or not, nonetheless I want to voice myself. If a healthy communication is not welcomed, I can respect that decision of yours.
So having my opinions and complaints written out more elaborately, the only logical next step for me is to provide some suggestions regarding each point:
Conclusively I can just reiterate myself, do not take my critique as an insult. I am trying to be constructive or altleast voice an (by me perceived) issue in a neutral and professional manner. I never knew I would have to write this long explanation for a very simple (admittedly spicy) remark, but in my opinion it was justified as you might think so now too. I am happy to hear your perspectives on this matter and hope we can stay respectful to each other. |
This is true. The use case we need for this code is to be correct, not readable.
The problem here is the lack of constant-time-ness verification on the Zig AST, rather than any potential input sources for the Zig code being generated. This will be addressed in the future with #1776. It is not smarter to inline the assembler code because this
We require it to be correct. Not fast (fiat-crypto generates some pretty bad code all things considered), nor readable (that already exists in the form of
The same way you trust other cryptographic code in the standard library written and/or accepted by @jedisct1. If the level of trust you require is greater (for me personally, at work, we have some algorithms which we've written ourselves, with further testing and verification of correctness), then the Zig standard library isn't for you. And that's totally ok. I would also like to add that the underlying implementation of a finite field isn't really something that can be a source of vulnerability. Perhaps it could be written in a way that causes a program to panic, but if it is implemented wrong in a malicious way would just cause most algorithms written on top to not work. But that is a subjective argument, so take it as you will. These generated chunks of code are also not updated pretty much ever, so a supply-chain-based attack is unlikely to happen.
This defeats the entire purpose of using a formally verified generator for the implementation. And as mentioned before, we already have something like it in
I personally wouldn't be against adding this link in the doc comment above, zig/lib/std/crypto/pcurves/common.zig Line 199 in 958faa7
but keep in mind that I am not a core member, so my opinions are my own :).
I guess this could make sense if we were actually re-generating these files at some point in time; however, its purpose is defeated by just the fact that the code is looked at and reviewed by Frank. If you are this concerned about correctness, I think that the finite field implementations are the last thing you should be worrying about. Understand that the Zig stdlib cryptography code is not audited. If your use cases require the utmost correctness, I do not believe it is the correct library for you to be using. |
|
OpenSSL has experienced multiple severe carry propagation bugs in its finite field implementations (ex: CVE-2017-3732, CVE-2017-3736 and CVE-2021-4160). Bugs can happen everywhere including in hardware, but nowadays, tools that guarantee correctness of the original source code exist, and Zig has the privilege to be a supported target one of these tools. Not taking advantage of that would be a going back in the past. Using them gives us well-trusted implementations, one common representation and one less thing to worry about. fiat-crypto can also generate platform-specific assembly code that retains the same public API, but helps a lot to ensure that the code runs in constant time. This is something we can take eventually also advantage of. But maybe we should get back to the PR. ECGD in |
|
Ran a quick GCD benchmark: https://gist.github.com/jedisct1/0ddfe5484c6ea273c32efd73b40924c0
The improvement seems to be usage of diff --git a/lib/std/math/gcd.zig b/lib/std/math/gcd.zig
index 16ca7846f19a..minimal_exact_shifts 100644
--- a/lib/std/math/gcd.zig
+++ b/lib/std/math/gcd.zig
@@ -26,16 +26,16 @@ pub fn gcd(a: anytype, b: anytype) @TypeOf(a, b) {
const xz = @ctz(x);
const yz = @ctz(y);
const shift = @min(xz, yz);
- x >>= @intCast(xz);
- y >>= @intCast(yz);
+ x = @shrExact(x, @intCast(xz));
+ y = @shrExact(y, @intCast(yz));
var diff = y -% x;
while (diff != 0) : (diff = y -% x) {
const zeros = @ctz(diff);
if (x > y) diff = -%diff;
y = @min(x, y);
- x = diff >> @intCast(zeros);
+ x = @shrExact(diff, @intCast(zeros));
}
- return y << @intCast(shift);
+ return @shlExact(y, @intCast(shift));
} |
@jedisct1, I don't think it is: Compiler Explorer shows that changing the shifts into builtin calls for the version currently on master does not affect the generated machine code. I am in favour of changing the shifts into builtin calls regardless of whether that affects the output - they simply better encode the knowledge we have. |
Yep. Doesn't hurt. |
I am not happy with it now, because it might overflow, hence it only being a draft. Still, thank you all for taking the time.
I have spent a lot of time hand optimizing the code for 64 bit words on x86 (on my zen2 architecture). It is crucial to calculate the every step in the exact order I have written down to achieve the optimal assembler output which I have written by hand: .text
.globl gcd_hand_optimized
.type gcd_hand_optimized @function
gcd_hand_optimized:
test %rdi, %rdi
je .EARLY_RET_Y # if zero, early return
test %rsi, %rsi
je .EARLY_RET_X # if zero, early return
tzcntq %rsi, %rcx # remove tz from second input (ctz)
tzcntq %rdi, %rdx # remove tz from first input (ctz)
shrxq %rdx, %rdi, %rdi # remove tz from first input (shift)
shrxq %rcx, %rsi, %rsi # remove tz from second input (shift)
cmpq %rcx, %rdx # save minimum of shifts (compare)
cmovbq %rdx, %rcx # save minimum of shifts (move)
movq %rsi, %r8 # create copy of y
subq %rdi, %r8 # calculate y - x
.LOOP:
movq %rdi, %r9 # create copy of x
tzcntq %r8, %rdx # saving zero count in dx
subq %rsi, %rdi # subtract y from x
cmovbq %r8, %rdi # move y - x to x if carry was set
cmovbq %r9, %rsi # replace y with x if carry was set
shrxq %rdx, %rdi, %rdi # remove tz from first input (shift)
movq %rsi, %r8 # create copy of y
subq %rdi, %r8 # calculate y - x
jne .LOOP
shlxq %rcx, %rsi, %rax # return y with appropriate shift
ret
.EARLY_RET_X:
movq %rdi, %rax
ret
.EARLY_RET_Y:
movq %rsi, %rax
ret
.size gcd_hand_optimized, .-gcd_hand_optimized |
|
Binary appears to be much faster. I use this naive implementation: const std = @import("std");
/// (n / 2) mod m
fn halveMod(comptime T: type, n: T, m: T) T {
if (n & 1 == 0) {
return n >> 1;
} else {
const WideT = std.meta.Int(.unsigned, @bitSizeOf(T) + 1);
const n_wide: WideT = n;
const m_wide: WideT = m;
const result = (n_wide + m_wide) >> 1;
return @intCast(result);
}
}
/// (a - b) mod m
fn subMod(comptime T: type, a: T, b: T, m: T) T {
if (a >= b) {
return a - b;
} else {
return m - (b - a);
}
}
/// Returns the modular inverse of y modulo m, or 0 if it does not exist.
/// Requires m to be an odd integer >= 3, and 0 <= y < m
pub fn modInverse(comptime T: type, y: T, m: T) 0 {
std.debug.assert(m >= 3);
std.debug.assert(m & 1 == 1); // m must be odd
std.debug.assert(y < m);
var a: T = y;
var u: T = 1;
var b: T = m;
var v: T = 0;
while (a != 0) {
if (a & 1 == 0) {
a = a >> 1;
u = halveMod(T, u, m);
} else {
if (a < b) {
// Swap (a, u, b, v) ← (b, v, a, u)
const temp_a = a;
const temp_u = u;
a = b;
u = v;
b = temp_a;
v = temp_u;
}
// Now a >= b and both are odd
// a ← (a − b)/2
a = (a - b) >> 1;
// u ← (u − v)/2 mod m
u = halveMod(T, subMod(T, u, v, m), m);
}
}
if (b != 1) return 0; // Inverse does not exist
return v;
}Needs to use signed integers if we really want both Bézout's coefficients. Not sure that these micro-optimizations will make any practical difference, though. |
6da3f5c to
7a3d838
Compare
|
|
||
| y = @shlExact(y, @intCast(shift)); | ||
| s = @shlExact(s, @intCast(shift)); | ||
| // Using integer widening is only a temporary solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a work in progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can avoid integer widening by implementing a custom function for terms of the form (a - b * c) / d. Something like this:
inline fn mul_div_thing(a: anytype, b: anytype, c: anytype, d: anytype) @TypeOf(b, c, d) {
@setEvalBranchQuota(10_000_000);
const S = @TypeOf(b, c, d);
const sign_ = -1 * sign(b) * sign(c) * sign(d);
var x = @abs(b);
const y = @abs(c);
const z = @abs(d);
var acc = a;
var counter: S = 0;
if (x == 0) return sign_;
while (x > 0) : (x -= 1) {
acc += y;
while (acc >= z) {
acc -= z;
counter += 1;
}
}
return counter * sign_;
}As you can see, this solution is highly inefficient because I am not doing long multiplication / division. If I implement long multiplication / division, this algorithm becomes much more complicated, so I didn't even bother really. Still, I want to encourage looking for something that doesn't require widening the integer type, that's why I left the comment.
const a: i8 = 127;
const b: i8 = 126
const c = egcd(a, b);=> likely to crash with an integer overflow because |
|
Even the stdlib tests don't pass: |
7000c52 to
3c55282
Compare
|
My mistake for not running the tests locally first. I fixed the issue and added another test. |
|
The GCD changes are good, and should be merged. However, the EGCD code doesn't seem to be quite ready. The returned Bézout coefficients are off by a factor of Example: The code also crashes when the minimum value of a type is used for both operands. Also, while the GCD properly works with all the regular scalar types supported by the language, the EGCD doesn't currently compile with types > i32767. Consistency between both would be good. I’d recommend closing this PR and opening a new one focused solely on the GCD changes, which are solid and can be merged. EGCD can be implemented and tested separately, then proposed in a separate PR. The first implementation can be simple and conventional. Optimizations can always be added later. |
|
Thanks for reviewing again. I didn't plan to see this PR merged right-away, so feel free to review whenever you like. The algorithm for the egcd is correct, I was simply too stupid to fix the little artifacts I left when I translated code and played around a little bit. If you want I can convert this PR into a draft again. The power of 2 bug was because of literally one line I just fixed. |
97b4cd6 to
fc36653
Compare
|
Fixed the issue related to the min value. I've removed the helper function because interacting with the result types of it was a bit bothersome. |
|
With And all unsigned types return an error. Benchmarks also show that depending on the size, the classic iterative method can also be up to 2x faster. Modern CPUs have fast 64-bit division (which works well for sizes such as 64 and 512 bits) and the classic algorithm has fewer iterations and the algorithm is simpler with more predictable memory access. So there's no one-size-fits all method. It is useful to have EGCD in Optimizations are great, and we definitely want them. However, especially if this is something that gets eventually used by crypto functions, correctness is critical. And a classic iterative implementation is already extremely fast. So let's start with a classic implementation that works for all inputs: #25642 And then take the time to test optimizations before incrementally adding them for the cases they are the most effective. |
|
@jedisct1 Recent commit fixed that issue regarding divFloor operator, simply switched to an explicit shift. All checks pass now and I have double checked this time that all the temporary results indeed do not overflow. Also, I don't know why you suddenly try to re-introduced my scrapped version of the EGCD, because as I told you before: I think integer widening is bad – which was also the motivation behind my implementation of this Binary version.
There is a reason for that: The egcd is defined for signed integers. If you want to calculate the egcd for your unsigned inputs, you can simply upcast them, if that's your goal. The result type is always unsigned, the coefficients always signed. This is also a very handy feature, because now you don't have to add a single bit to the result type's width. And coefficients are of the same width as the inputs (also nice when you calculate the inverse of a number in a field: no need to cast).
I am not sure about this claim, but if you can back it up, I can/will accept this result of yours. Again, I was never here to merge my code ASAP. I opened this PR to be transparent in my (slow) progress. I am busy with different things too and this PR doesn't enjoy much of a priority right now. So I feel very odd too, if someone suddenly decides to create his own PR. I think, if adding an egcd function to the std library was really that important to you, you would've done it much earlier. So I am not sure what message you try to convey is. Funny enough, even with all the proving you still ran into the same issue as me: It doesn't even pass the standard library tests. So let's not pretend that overlooking the couple overflows in my draft is reason to scrap my attempt completely. |
The extended Euclidean algorithm isn’t defined for signed integers in the sense of restricting the domain. It’s defined over the integers ℤ, and in practice we almost always call it with non-negative (unsigned) inputs. Plenty of implementations (in GMP, Python, Rust, etc.) accept unsigned inputs and simply return signed coefficients. We might reasonably want egcd to work without forcing conversions to signed types. |
|
I am not sure what point you're trying to make, but ℤ is the signed integers or simply integers ( The egcd that I have replaced in this PR (used by ml_kem) was also written for signed integers. There is no problem in making the egcd work with unsigned integers too, I can take care of that. This would basically mean to do the up-casting inside the egcd function. For anyone interested: here are some points that I still wonder about in my current implementation: ... it is still not quiet where I want it yet. To compare, the fastest but not overflow safe version of the EEA takes only 450ms time. So these are the two issues I am looking forward to solve next. PS: Regarding performance, the version Frank Denis linked in his own PR needs |
Fix range obscurity and compile error. Revert change for converting comptime int.
Fix some comments in GCD. Make ml_kem use lcm and egcd from std/math. Fix name. Add egcd function. Don't destructure. Use binary gcd and make overflow safe. Force inlining, use ctz to reduce dependency in loop. Avoid integer overflow for temporary value. Add test against previous overflow capability. More optimization friendly expression. Fix egcd for even numbers. Minvalue causes crash. Remove helper function. Fix casting issues. Use shift instead division (to support i2) and avoid overflow of temp results.
6c9ab98 to
ba5f824
Compare
The EGCD is used for cryptographic purpose, for example to find the inverse of a number mod a prime.
I have implemented an iterative version of this algorithm.
The existing GCD has been microoptimized to gain some extra speed (3-5% faster).
I purposefully implemented the EGCD with the Euclidean method instead the Binary GCD, because whenever you'd shift off n zeros inside the Binary GCD, the coefficients would've needed to be multiplied by (2^-n) mod g (g being the gcd of the two inputs), or do some other funky stuff that's performance costly.