Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative or addition: instructions with overflow/carry flags #6

Closed
alexcrichton opened this issue Aug 14, 2024 · 42 comments · Fixed by #36
Closed

Alternative or addition: instructions with overflow/carry flags #6

alexcrichton opened this issue Aug 14, 2024 · 42 comments · Fixed by #36

Comments

@alexcrichton
Copy link
Collaborator

Perhaps the primary alternative to this proposal that would try to solve the same original problem would be to add instructions that manipulate/expose the overflow flag directly from native hardware. This is lower level than 128-bit operations themselves and the theory is that 128-bit operations could be built from these lower level instructions. The original proposal explicitly did not propose this due to complexity of optimizing these instructions and the ease of achieving performance on desired benchmarks with 128-bit operations. In the CG meeting however it was brought up that these lower-level primitives might be a better choice.

This issue is intended to be a discussion location for continuing to flesh out the flags-vs-128-bit-ops story. More thorough rationale will be needed in later phases to either explicitly use flag-based ops or not. This will likely be informed through experience in implementations in other engines in addition to performance numbers of relevant benchmarks.

@alexcrichton
Copy link
Collaborator Author

I've opened #10 with some more discussion behind the background of this proposal. It notably outlines the possible shape of these instructions plus why they had bad performance in Wasmtime with the initial implementation.

@alexcrichton
Copy link
Collaborator Author

One concern I might have with going exclusively with overflow/carry flags is that platforms like RISC-V don't have the same instructions as x86/aarch64 and they might be slower. For example implementing i64.add_with_carry_u on RISC-V to achieve 128-bit addition might end up being slower than what RISC-V already generates today. I'm not aware of other platforms missing these instructions, though, so RISC-V may be the only one affected and I additionally do not have a machine to benchmark on to confirm this hypothesis one way or another.

@waterlens
Copy link

One concern I might have with going exclusively with overflow/carry flags is that platforms like RISC-V don't have the same instructions as x86/aarch64 and they might be slower. For example implementing i64.add_with_carry_u on RISC-V to achieve 128-bit addition might end up being slower than what RISC-V already generates today. I'm not aware of other platforms missing these instructions, though, so RISC-V may be the only one affected and I additionally do not have a machine to benchmark on to confirm this hypothesis one way or another.

It is more likely to be a problem with the design of RISC-V ISA itself, as suggested by this discussion. It could be a big sacrifice for WASM if we give up such common operations on other hardware platforms, just because RISC-V people currently don't care about this.

@alexcrichton
Copy link
Collaborator Author

That's a good point! I should clarify that it's something I think is worth measuring, but I don't think it's a showstopper preventing such an alternative such as this.


Orthogonally I wanted to flesh out some more numbers on this of what I'm measuring locally. Using the benchmarks in https://github.com/alexcrichton/wasm-benchmark-i128 I measured the slowdown relative-to-native of wasm today, with this proposal as-is with 128-bit ops, and then with this alternative instead of overflow/carry flags (plus i64.mul_wide_*). Each column in the table is a different wasm binary benchmarked and each row is a benchmark. Lower is better as it means it's closer to native. This was collected with Wasmtime 25ish and on Linux x86_64.

Benchmark Wasm Today Wasm + 128-bit-arithmetic Wasm + overflow/carry/mul_wide
blind-sig 617% 74% 79%
lehmer 362% 12% -7%
mul-bignum 391% 30% 111%
fib_10000 122% 15% 100%
shl 93% 92% 93%
shr 98% 98% 95%
div-bignum 345% 160% 234%
sort signed 63% 64% 64%
sort unsigned 72% 72% 69%

Summary,:

  • blind-sig - both approaches are roughly equivalent as this is a multiplication-heavy benchmark
  • lehmer - this shows mostly that mul_wide can sometimes optimize better than mul128. I'll note this in Is mul128 going to be as fast as widening multiplication? #11.
  • mul-bignum / div-bignum - these benchmarks have more addition than previous ones and show that the overflow/carry approach is significantly slower than add128
  • fib_10000 - this shows add128 performing much better than overflow/carry
  • others - not affected

The most notable result is that add128 is performing much better in addition operations than overflow/carry operations. This may be counterintuitive to expectation. I've outlined the reason as to why in the explainer a bit but to summarize that here the most general lowerings of overflow/carry ops require moving the carry flag out of the EFLAGS register and back into it. That in turn causes significant slowdown if it's not optimized away, and Wasmtime/Cranelift do not have a means of optimizing it away. In the 128-bit op case that more closely matches the original program semantics

alexcrichton added a commit that referenced this issue Sep 5, 2024
Some recent [benchmarking] had a surprising result I wasn't trying to
dig for. Notably as summarized in #11 some more use cases of widening
multiplication being more optimal than 128-by-128 bit multiplication
have started to arise. Coupled with local benchmarking confirming that
both on x64 and aarch64 that widening multiplication has more support in
LLVM for more optimal lowerings and was easier to implement in Wasmtime
than the 128-by-128 bit multiplication once various optimizations were
implemented.

In the end `i64.mul128`, which was primarily motivated by "feels
cleaner" and "should have the same performance" as widening
multiplication, does not appear to have the expected
performance/implementation tradeoff. Getting an as-performant
`i64.mul128` instruction relative to `i64.mul_wide_{s,u}` has required
more work than expected and so the balance of concerns has me now
tipping away from `i64.mul128`, despite it being "less clean" compared
to the add/sub opcodes proposed in this PR.

Closes #11

[benchmarking]: #6 (comment)
@alexcrichton
Copy link
Collaborator Author

I've posted some more thoughts to record in the overview in #18

@waterlens
Copy link

waterlens commented Sep 22, 2024

Basically, I agree with your opinion. The correct modeling of how the carry flag works in the compiler is rare. There is a trade-off between less effort and optimal performance. The instructions to be added can not be perfect semantically, especially considering there aren't many people dedicated to the work. The current speedup is already considerable, so please go ahead and push this proposal upstream!

@sparker-arm
Copy link

Hi @alexcrichton

most general lowerings of overflow/carry ops require moving the carry flag out of the EFLAGS register and back into it

I'm not an x86 person, so could you clarify how your are lowering these operations? Can't a 'wide' add be lowered to add and adc, without having to explicitly move data from flags? Or do you just mean this is happening an a micro-arch level?

I mainly ask because the mul_wide operations sound good, but I expected to see add/sub with the same signature.

@alexcrichton
Copy link
Collaborator Author

Hello! And sure! I might also point towards the overview docs which is a summary of the phase 2 presentation I last gave to the CG.

You're correct that i64.add128 gets lowered to add + adc (and add + adds on aarch64). The problem that I was describing there was i64.add_overflow_s, a separate instruction, which produces two values: the addition result + overflow bit. That overflow bit, in the most general case, needs to be put in a general purpose register which is the "move out of EFLAGS". Later if that's then combined with i64.add_with_carry_s which takes three inputs (e.g. lhs/rhs/overflow bit) then to use adc (or adds) then the overflow bit needs to be "put back in EFLAGS" somehow since adc and adds don't actually take three register input operands, only 2 and one is implicitly in the flags register.

@sparker-arm
Copy link

Ah yes, I understand now. Flags are a pain... :)

Thanks.

@jakobkummerow
Copy link

While I'm strongly in favor of making wide arithmetic faster in Wasm, I continue to be unconvinced by the claim that "add128 is faster than add_with_overflow", and I believe that the experimental numbers that "prove" it only reflect prototype implementation deficiencies that could be ironed out (and wouldn't even be hard to iron out). Let me explain:

The "bignum" use case is, in general and regardless of specific library or language, an addition loop: given two vectors A, B of integers, add them element-by-element, propagating the carry, storing each result element in a result vector S (much like we all did by hand in elementary school). The integers will typically have register width, i.e. be uint64 on 64-bit hardware, because that's most efficient. The heavy lifting will be done by some loop, in pseudocode roughly:

carry = 0;
for each pair of items a=A[i], b=B[i] {
  [sum, next_carry] = add_snowman(a, b, carry)
  S[i] = sum
  carry = next_carry
}
S[A.length] = carry

So, how can we represent add_snowman in Wasm? It must take three inputs (at a minimum: two in 64-bit range, one in 1-bit range), and produce two outputs: the low 64 bits of the addition, and the overflow (=carry) bit (which is guaranteed to be either 0 or 1, and could have any type large enough to represent these). Let's consider a few hypothetical possibilities:

A. add128 as currently proposed. We actually need two of them:

[temp_low, temp_high] = add128(a, 0, b, 0)  // The two '0' are the "high parts" of a and b.
[sum, next_carry] = add128(temp_low, temp_high, carry, 0)

For easier comparison with the following options, let's write this in one line and with abbreviated names:

[s, nc] = add128(add128(a, 0, b, 0), c, 0)

That works fine, but the nesting is unsightly, leading to an idea to streamline that:

B. add128_three taking three (lo, hi) pairs of i64 inputs and producing a (lo, hi) pair of i64 outputs:

[s, nc] = add128_three(a, 0, b, 0, c, 0)

That's more elegant, but why should we have to emit so many zero-constants? Let's streamline those too, leading to the next idea:

C. add64_three taking three i64 "low halves" and returning a (lo, hi) pair:

[s, nc] = add64_three(a, b, c)

D. add_with_carry: That's actually the same as (C), just with a different name, and a degree of freedom whether the type of the third summand should be i64 like the others, or something smaller -- doesn't really matter.

Notably, all four options should get compiled to the same machine instruction sequence. In fact, engines have the hardest job for (A): they need to special-case the fact that two add128 instructions are immediately following each other, and they need to special-case the fact that half of the inputs are constant zeros; only then can they emit the most efficient machine code sequence. In the fully generic case of arbitrary inputs, they'll have to emit a lot more code. Engines have it easiest with (C)/(D), where no pattern recognition is necessary and even simple baseline/single-pass compilers can do a good job (i.e. emit an add/adc pair internally).
Also notably, an engine that has implemented (A) can trivially map (C) onto that implementation by filling in the 0 constants for the high parts. That's why the claim that (D) is necessarily slower than (A) is absolutely unconvincing.

I think all options are implementable. I do think that (A) is significantly more complex, because it has to be prepared to handle the case where none of the inputs are constant 0, while also needing a bunch of pattern-matching to make the common case as fast as possible (i.e. as fast as (C)). And I also think that for the bignum use case, there is no way (A) would have a fundamental performance benefit over either (B) or (C).

Of course, if you pick addition of two 128-bit integers with ignored/truncated overflow into the 129th bit as the example you study, then (A) will be the best fit, because it does exactly that. But for the "bignum" use case (as exemplified in the "fib_10000" benchmark quoted above), there should be no performance difference between all four options -- if there is one, then something's wrong with the implementation(s) of toolchains or engines or both.

In summary, I see a pleasing symmetry here:
For multiplication, it's uncontentious that we want an instruction that does [a:i64, b:i64] -> [low:i64, high:i64]. And for addition, the simplest equivalent would be an instruction with the same signature. A more efficient (for the primary use case) alternative would be a 3-summand addition with wide output [a:i64, b:i64, c:i64] -> [low:i64, high:i64] because grouping two 2-summand additions like that makes it easier for simple engines to use the most-efficient machine instruction (adc) more often.

My pseudocode example above assumed for simplicity that both inputs have the same vector length. If they have unequal lengths, then a two-operand addition with wide output is useful too. So, my suggestion for maximum performance and simplicity would be to add the following six instructions:

i64.add_wide:   [i64, i64] ->      [i64, i64]  ;; a + b -> (low, high=carry)
i64.add3_wide:  [i64, i64, i64] -> [i64, i64]  ;; a + b + c -> (low, high=carry)
i64.sub_wide:   [i64, i64] ->      [i64, i64]  ;; a - b -> (difference, borrow)
i64.sub2_wide:  [i64, i64, i64] -> [i64, i64]  ;; a - b - c -> (difference, borrow)
i64.mul_wide_u: [i64, i64] ->      [i64, i64]  ;; a * b -> (low, high)
i64.mul_wide_s: [i64, i64] ->      [i64, i64]  ;; a * b -> (low, high)

(I didn't see any detailed discussion of sub128 in this repo. Having "wide input" semantics there is probably workable but weird, because you'd normally have to pass constant 1 as the high part, and then the result's high part is the inverse of the "borrow bit", i.e. it's 0 if the borrow is 1 and vice versa, and you'd have to flip it before being able to use it in the vector processing loop's next iteration. I think it's easier for everyone to just return the borrow bit as the second result.)

@alexcrichton
Copy link
Collaborator Author

To clarify I'm not claiming that this proposal as-is is the fastest of all possible designs/use-cases. I've primarily been comparing against the idea of representing the overflow flag as a value and that's what's been causing such a slowdown. Your proposed instructions don't do that, however, so I think it's very much worth considering!

For reference the code that I've been benchmarking (with i64.add128) looks like this:

Native x64 compilation

full output on godbolt but the main loop is:

.LBB0_14:
		add     r8, qword ptr [rdi + 8*r14]
		adc     rbx, 0
		add     r8, qword ptr [rdx + 8*r14]
		adc     rbx, 0
		mov     qword ptr [rdi + 8*r14], r8
		lea     r11, [r14 + 2]
		xor     r8d, r8d
		add     rbx, qword ptr [rdi + 8*r14 + 8]
		setb    r8b
		add     rbx, qword ptr [rdx + 8*r14 + 8]
		adc     r8, 0
		mov     qword ptr [rdi + 8*r14 + 8], rbx
		mov     ebx, 0
		mov     r14, r11
		cmp     r10, r11
		jne     .LBB0_14
Wasm code generation

full output online but the main loop is:

        loop
        local.get       9
        i64.const       0
        local.get       0
        i64.load        0
        i64.const       0
        i64.add128
        local.set       5
        local.set       9
        local.get       9
        local.get       5
        local.get       1
        i64.load        0
        i64.const       0
        i64.add128
        local.set       5
        local.set       9
        local.get       0
        local.get       9
        i64.store       0
        local.get       5
        i64.const       0
        local.get       0
        i32.const       8
        i32.add
        local.tee       13
        i64.load        0
        i64.const       0
        i64.add128
        local.set       5
        local.set       9
        local.get       9
        local.get       5
        local.get       1
        i32.const       8
        i32.add
        i64.load        0
        i64.const       0
        i64.add128
        local.set       9
        local.set       5
        local.get       13
        local.get       5
        i64.store       0
        local.get       1
        i32.const       16
        i32.add
        local.set       1
        local.get       0
        i32.const       16
        i32.add
        local.set       0
        local.get       12
        local.get       10
        i32.const       2
        i32.add
        local.tee       10
        i32.ne
        br_if           0
        br              2
        end_loop
Compiled code on x64 with Wasmtime

Here I've just snipped the main loop, not the whole binary.

      b8:	44 8b eb             	mov    %ebx,%r13d
      bb:	4d 31 e4             	xor    %r12,%r12
      be:	4f 03 34 28          	add    (%r8,%r13,1),%r14
      c2:	49 83 d4 00          	adc    $0x0,%r12
      c6:	8b d0                	mov    %eax,%edx
      c8:	4d 03 34 10          	add    (%r8,%rdx,1),%r14
      cc:	49 83 d4 00          	adc    $0x0,%r12
      d0:	4f 89 34 28          	mov    %r14,(%r8,%r13,1)
      d4:	44 8d 6b 08          	lea    0x8(%rbx),%r13d
      d8:	4d 31 f6             	xor    %r14,%r14
      db:	4f 03 24 28          	add    (%r8,%r13,1),%r12
      df:	49 83 d6 00          	adc    $0x0,%r14
      e3:	8d 50 08             	lea    0x8(%rax),%edx
      e6:	4d 03 24 10          	add    (%r8,%rdx,1),%r12
      ea:	49 83 d6 00          	adc    $0x0,%r14
      ee:	4f 89 24 28          	mov    %r12,(%r8,%r13,1)
      f2:	83 c6 02             	add    $0x2,%esi
      f5:	4c 89 ca             	mov    %r9,%rdx
      f8:	81 e2 fe ff ff ff    	and    $0xfffffffe,%edx
      fe:	83 c3 10             	add    $0x10,%ebx
     101:	83 c0 10             	add    $0x10,%eax
     104:	39 f2                	cmp    %esi,%edx
     106:	0f 85 ac ff ff ff    	jne    b8 <wasm[0]::function[4]::add2+0xb8>

You're correct that the inner loop here is add128(add128(a, 0, b, 0), c, 0) and this is reflected in the instructions too. The loop here is also unrolled once by LLVM but overall the main structure is the same between native and wasm-compiled-to-native. The main difference is that native has a setb r8b for one of the combos where Wasmtime uses xor %r14,%r14 ... adc $0x0, %r14 for the same thing.

Basically I wanted to point out that at least from a purely technical point of view add128 as-is suffices for generating mostly optimal instructions here. There's still cleanups that can be done in Wasmtime naturally, and you're right that Wasmtime has special handling of the case where one side is zero-extended (e.g. upper bits are i64.const 0). For Wasmtime/Cranelift though this sort of instruction matching is the bread-and-butter of backends and isn't something I considered costly for Cranelift's implementation. I've been (perhaps naively) assuming other compilers also have well-established means of combining data-dependent instructions together.

For what you're proposing I want to confirm, are you thinking that the translation for i64.add3_wide would look something like this?

i64.add3_wide(a, b, c):
      xor  %ret_hi,%ret_hi
      mov  %b,%ret_lo
      add  %a,%ret_lo
      adc  %c,%ret_lo
      adc  $0,%ret_hi

If so that seems reasonable to me and it avoids the problem of dealing with flags-as-a-value where other drafts of this proposal have tried to interpret the c operand here as "one more if nonzero" or "zero if zero". Treating it as a full 64-bit value definitely simplifies the codegen and lowering. The problem with it though is that I don't know how to integrate this into language toolchains (aka LLVM).

One problem is that the i64.add3_wide operation, as far as I know, has no equivalent in any source language. This means that compilers would have to pattern-match some sequence of operations to generate such an instruction. LLVM may have the tools to do this where it'd probably use llvm.uadd.with.overflow.* intrinsics and such to model this in IR. Later when LLVM lowers to target-independent IR opcodes there's still no match to what this instruction has, so there'd then have to be wasm-specific logic to pattern-match such an instruction and generate it. For me, personally, this is enough leaps-of-faith that I'm not necessarily confident this would be generated. For example the Rust source code linked above is only using u128, it's not actually using any intrinsics, and I'd want that to still optimize to using i64.add3_wide for example.

In contrast with i64.add128 this was surprisingly easy, even for me who's unfamiliar with LLVM, to integrate. Languages naturally all represent addition already and while not everyone has 128-bit integers those that do support native addition on them typically. That means it's quite easy to plumb intent all the way through from the language frontend to the emitted wasm. It's true that compilers will need to pattern-match common cases where the upper bits are zero but I'd argue that's already the case for many other instructions. For example if a wasm compiler doesn't pattern-match constants in instructions at all it's probably going to perform a good amount worse than compilers today.

So overall, again assuming I've understood the lowering of i64.add3_wide correctly, I'd personally at least still need convincing on the lower-this-to-wasm side of things. I won't claim that i64.add128 is the end-all-be-all by any means but from what I've seen it represents a nice balance between high performance and ease of implementation of producers/consumers alike.

As a final note, I also ideally don't want to over-rotate on bignum arithmetic as well. Lowering an optimal sequence of basic 128-bit addition is naturally easy with i64.add128 but doing so with i64.add_wide and i64.add3_wide is not obvious to me. I'm not sure how using those two opcodes (or a combination) it could compile to a literal add/adc combo.


As a side note:

I didn't see any detailed discussion of sub128 in this repo

I've kept around sub as "it's probably best to keep sub on-par with addition so whatever's there for addition might as well add it for subtraction too. Beyond that I have not thought too deeply about subtraction specifically.

@jakobkummerow
Copy link

from a purely technical point of view add128 as-is suffices for generating mostly optimal instructions

Yes, with enough effort all four options I described lead to the same generated code.

are you thinking that the translation for i64.add3_wide would look something like this?

Yes, modulo perhaps replacing xor %ret_hi,ret_hi; adc $0,%ret_hi with setb %ret_hi, as you point out yourself.
And the 2-operand i64.add_wide would basically just skip the adc %c,%ret_lo part.

compilers would have to pattern-match some sequence of operations to generate such an instruction. LLVM may have the tools to do this

Yes. The pattern they'd match is add128(add128(a, 0, b, 0), c, 0), or phrased differently "a + b + c where a, b, c are all u128s created by zero-extending u64s". I'm not an LLVM engineer and don't know how to express it there (or at which point in the pipeline to apply the matching), but since we know that LLVM can go from that u128+u128+u128 addition to the final machine code we want, it must already have this matching implemented somewhere.

In other words, the adc function in the "Native x64 compilation" godbolt link in your last message is almost exactly the i64.add3_wide instruction, except it overloads its acc parameters for 3 (!) purposes: u64 input, u64 output, and u128 temp value. If you rephrased it as:

#[inline]
fn adc(a: BigDigit, b: BigDigit, c: BigDigit, acc: &mut BigDigit) -> BigDigit {
    let mut tmp = c as DoubleBigDigit;
    tmp += a as DoubleBigDigit;
    tmp += b as DoubleBigDigit;
    let lo = tmp as BigDigit;
    *acc = tmp >> BITS;
    lo
}

then it would more obviously equal the i64.add3_wide instruction. (Please forgive any syntactic mistakes, I don't speak Rust.)
This also illustrates why i64.add3_wide is useful: when you don't have it (e.g. in Rust), you're going to implement a helper function that does what it does! (We have an equivalent helper function in V8, using Clang's nonstandard-C++ __uint128_t instead of Rust's u128.)

a nice balance between high performance and ease of implementation of producers/consumers alike.

Generally, Wasm tries to do as much work as possible as early as possible, i.e. it prefers doing things in toolchains rather than in engines (when possible). There's an explicit goal that it should be possible to implement simple engines that still perform well. As an alternative phrasing, it should be possible to implement engines with high peak performance and low latency -- because most optimizations are done ahead of time.

I also ideally don't want to over-rotate on bignum arithmetic

I do think that bignum is the primary use case. In fact, while I'm sure it exists somewhere, I don't recall ever encountering an "actual i128" use case myself. But that said:

Lowering an optimal sequence of basic 128-bit addition [...] with i64.add_wide and i64.add3_wide

is very straightforward on the producer side:

wasmfunction add128(a_lo:i64, a_hi:i64, b_lo:i64, b_hi:i64) -> [lo:i64, hi:i64] {
  [res_lo, tmp_hi] = i64.add_wide(a_lo, b_lo)
  [res_hi, discard] = i64.add3_wide(a_hi, b_hi, tmp_hi)
  return [res_lo, res_hi]
}

That's even pretty concise in actual wat syntax:

(func add128 (param i64 i64 i64 i64) (result i64 i64)
  local.get 0
  local.get 2
  i64.add_wide   ;; value stack: res_lo, tmp_hi
  local.get 1
  local.get 3    ;; value stack: res_lo, tmp_hi, a_hi, b_hi
  i64.add3_wide  ;; value stack: res_lo, res_hi, discard
  drop
)

A baseline/single-pass compiler clearly won't emit the best possible machine code for this, but that's okay (firstly because baseline compilers are almost never optimal, and secondly because this use case is likely rare).
Teaching an optimizing compiler to emit only an add/adc pair requires a rule "if one of the inputs to an add3_wide IR node is the "high" output of an add_wide IR node, and the latter is scheduled right before the former, then the former can emit an adc instruction and nothing else". I haven't tried to implement that yet, but it sounds feasible.

Come to think of it, this could even be done peephole-style very early on, possibly even in the decoder or its interface to the baseline compiler (in V8 we already do similar tricks there to optimize compare+if sequences): when encountering an i64.add_wide, don't "commit" it right away, but stash it away in a 1-element cache of "tentative instructions", and when then seeing an i64.add3_wide as the next instruction (after the local.gets, which don't really count because they only affect compiler state, not actual emitted code), replace both of them with an internal add128 (with custom-tailored codegen).

That said, I also wouldn't mind having both (add_wide and add128) as explicit instructions in Wasm, if full-width i128 computations are assumed to be sufficiently common to motivate this.

@alexcrichton
Copy link
Collaborator Author

We have an equivalent helper function in V8

Do you have a source link on-hand for that? I'd like to add that to the reperotoire of "things to study". Or do you mean V8 has the equivalent of fn adc(a: BigDigit, b: BigDigit, c: BigDigit, acc: &mut BigDigit) -> BigDigit? (in which case no link necessary)

a nice balance between high performance and ease of implementation of producers/consumers alike.

Generally, Wasm tries to do as much work as possible as early as possible

Oh I'm no stranger to this. I'm talking from a practical "this is all working today" point of view. The work required on behalf of Cranelift is objectively very small to pattern match. I am (again, possibly naively) assuming that the work required to have a few extra patterns in other compilers is also going to be quite small. I'm not proposing we require LICM to have fast wasm compilers, instead just a pattern match or two.

Teaching an optimizing compiler to emit only an add/adc pair requires a rule "if one of the inputs to an add3_wide IR node is the "high" output of an add_wide IR node, and the latter is scheduled right before the former, then the former can emit an adc instruction and nothing else".

...

Come to think of it, this could even be done peephole-style very early on ...

This section of the overview, the slides from the February CG meeting, and these slides from the CG meeting last October outline the case for why I think that the optimization you describe here I believe is infeasible. Empirically in Cranelift this optimization is not possible without significant refactoring/restructuring. Such refactoring is not worth the value of optimizing 128-bit addition we've concluded.

You're definitely correct though in that other compilers might find such a transformation easy to implement. I mostly think that if performance relied on such an optimization happening we would find such a proposal not-useful in Wasmtime/Cranelift.

That said, I also wouldn't mind having both (add_wide and add128) as explicit instructions in Wasm, if full-width i128 computations are assumed to be sufficiently common to motivate this.

Personally I would also have no issue with this. It's something I've wondered as well where if we admit that the lowest-level primitive of adc is too difficult to implement in wasm then i64.add128 is not sufficient for capturing all possible use cases of adc easily, so other instructions might be necessary. In that sense i64.add_wide{,_3}, while definitely related to i64.add128, I think could still help serve other use cases as you've shown where 128-bit arithmetic is not necessarily the best primitive.

@jakobkummerow
Copy link

I'd call it an exact equivalent of that fn adc, yes: https://source.chromium.org/chromium/chromium/src/+/main:v8/src/bigint/digit-arithmetic.h;l=37?q=digit_add3
And it's used in a vector-iterating loop like so: https://source.chromium.org/chromium/chromium/src/+/main:v8/src/bigint/vector-arithmetic.cc;l=48?q=digit_add3 (that example also shows how the 2-operand version is useful once the iteration proceeds beyond the length of the shorter summand, propagating the carry through the longer summand)
I haven't checked just now, but Clang should compile it to pretty much exactly the same machine code as your Rust example. So it probably isn't interesting as an additional thing to study, rather as a datapoint confirming relevance of the Rust example.

To save you the trouble of looking up the typedefs: digit_t is what's BigDigit in Rust, twodigit_t is DoubleBigDigit, Digits is "read-only vector of digit_t", RWDigits is when it may be written into.

@alexcrichton
Copy link
Collaborator Author

Thanks! I've created a dedicated issue for evaluating these instructions. It'll take me some time to get around to doing this in LLVM.

@jakobkummerow
Copy link

It turns out there is actually value to an i1 type, regardless of the specific instructions this proposal standardizes.

As discussed above, the loop body of a bignum addition needs to sum up three values in each iteration. If these three have unconstrained i64 ranges (as is the case both with the add128(add128(a, 0, b, 0), c, 0) phrasing and the add3_wide(a, b, c) alternative), then there can be 2 carry bits. (Illustration with one byte instead of eight: 0xFF + 0xFF + 0xFF = 0x2FD, note 0x2....) Only if (at least) one of them is guaranteed to be constrained to i1 range, is the result then again guaranteed to have at most 1 as its combined carry. (Illustration: 0xFF + 0xFF + 1 = 0x1FF.)
This is relevant for code generation: only if we know that one of A, B, C is an i1 can we emit the sequence add A, B; adc A, C; setb HI. Otherwise it needs to be something like:

add A, B
setb HI
add A, C
adc HI, 0

So we could type the add3_wide instruction as [i1, i64, i64] -> [i64, i1], and doing so would make engines' lives easier. The add128 instruction would need two versions to support the same annotation. (Back in ancient times, Wasm used to claim that it valued the ability for simple engines to generate optimal code; I'm not sure whether that's still the case.)
The alternative is that we don't introduce the notion of i1 on the spec level, which would mean that for optimal code generation, engines internally need to run an analysis to determine which values are actually constrained to i1 range despite being typed as i64 in the wire bytes. This analysis needs to consider at least constants, loop phis, and whichever wide arithmetic instructions we'll end up standardizing here; so it's not a simple linear pass, nor simple inspection of the instruction used as input; it needs to take loops into account.

@alexcrichton
Copy link
Collaborator Author

Oh I don't disagree that i1 could be useful, but my personal thinking from the overview section is that (a) it's a pretty huge feature to add a type to wasm and this is a relatively niche use case and (b) it still has the problem of much of the time you don't actually want to have the i1 value in a register but instead want to leave it in flags. Combined with the fact that i64.add128, at the time, closed 90%+ of the gap between wasm/native it felt like it wasn't quite the right time to add i1.

I don't mean to say it's not useful though, just that the calculus of complexity/performance tradeoff is something where I don't think it's quite worth it at this time.

@jakobkummerow
Copy link

You do want the i1 value I'm talking about in a register (or even spill slot) precisely when it's the thing that's passed from one loop iteration to the next. It's orthogonal to what add128 can do. When you have add128(add128(a, 0, b, 0), c, 0) in your bigint addition loop body, then you have that i1 value that's the high result of the outer add128 and needs to be carried to the next loop iteration, where it becomes one of the low inputs (a or b or c depending on how exactly you order your inputs there).
I tried to spell out the hypothetical second, i1-annotated variant of add128 here but realized that it's actually not straightforward how to express the relevant constraints in the input/output types of the two nested add128 instructions, because the i1 output depends on any of a, b, c being an i1 too, but since those are split across two instructions it's hard to formally phrase that. add3_wide has a much easier time there. (It's kind of a mirrored version of the observation that for "real" int128-without-carry additions, a single add128 is much easier to work with than an add_wide+add3_wide pair.)

I'm not saying we need to add i1 to the spec necessarily. I'm just pointing out that doing so would be beneficial for engines that want to emit optimal code without first having to run costly analysis passes.

@alexcrichton
Copy link
Collaborator Author

Right yeah I definitely agree that engines could make use of i1, but I'm emphasizing that I've always considered it a non-starter due to the complexity of modifying the wasm type system for this proposal.

Although you've helped me crystallize another constraint I don't think I've necessarily written down:

.. would be beneficial for engines that want to emit optimal code without first having to run costly analysis passes.

I've been additionally thinking about this proposal from the perspective of engines should not need any sort of new analysis pass to make these instructions fast. My goal has been to integrate into existing engines easily with carefully selected instructions and still close the native/wasm performance gap significantly. For example that's what ended up panning out in Wasmtime, just add128 and a handful of specialized lowering rules (of which is quite normal for instructions) and the performance gap was closed.

Put another way, if ONLY add128 requires heroics in terms of analysis in V8 to similarly close the native/wasm perf gap then I would consider that a failure of this proposal. Additionally if ONLY add_wide + add_wide3 required heroics in Wasmtime to close the gap I'd also consider that a failure. (to be clear these are just examples, I'm not saying you're pushing one solution or another)

@jakobkummerow
Copy link

In order to emit optimal code (which is a more ambitious goal than to "close the perf gap significantly"), i1 helps regardless of the specific Wasm instructions. In case of my prototype, first the pattern add128(add128(a, 0, b, 0), c, 0) is replaced with add_wide3(a, b, c), and then an i1 analysis is run on the IR graph. So if the Wasm module contained add_wide3 instead of add128, I expect the benefit of the i1 analysis to be identical. The initial translation wouldn't be needed then, but that's fairly easy to do. So, in fact, from my prototyping experience, supporting the entire set add128, add_wide, add_wide3 isn't significantly more work for an engine than supporting just a subset of them (those that aren't in the wire bytes you'll still want in your compiler IR for optimization purposes). Converting between them when the module producer didn't make the optimal choice can then be seen as optional; one direction is easy, the other is hard (but probably also less relevant).

My prototype of an i1 analysis is ~250 lines and took a day to implement. Does that count as "heroic"? It's certainly an analysis pass we didn't have or need before.

Now I can quantify its impact. Explanation of the columns:

  • add128: always compile add128 to the obvious machine code
  • add_wide3: replace add128(add128(a, 0, b, 0), c, 0) with add_wide3(a, b, c), i.e. optimize out the i64.const 0 high inputs
  • i1 analysis: same as add_wide3, and additionally run the i1 analysis, to allow better codegen for add_wide3 when applicable
Benchmark Status quo add128 add_wide3 i1 analysis
blind-sig +339% +57% +55% +54%
lehmer +263% +11% +10% +8%
mul-bignum +171% +12% +11% +10%
fib_10000 +105% +20% +17% +10%
div-bignum +79% +7% +5% +5%

(Engine: V8 random daily snapshot shortly before 13.5 branch + early prototype implementation of this proposal,
Wasm modules: https://github.com/alexcrichton/wasm-benchmark-i128/releases/tag/2025-03-12,
Hardware: Ryzen 7950X,
Benchmarking methodology: compared to rustc 1.85.0 native, best of 3 runs, rounded to nearest integer, less is better)

Compared to the status quo, anything's a huge improvement. But if we want to aim for optimal code, the fib_10000 test shows quite clearly how impactful it can be to avoid one extra add machine instruction :-)

I think this table definitively puts to rest last year's claim that "add_with_carry: [i64, i64, i1] -> [i64, i1] can't be as fast as add128", because that's just a different name for what's in the last column, which is clearly the fastest. It happens to be the result of internal transformation in the compiler, but it could just as well be in the wire bytes directly, and would be just as fast, with a little less engine complexity. That said, since it can be achieved with internal transformations, I don't feel strongly about having it in the spec: V8 will achieve this performance with or without heroics; other engines can speak for themselves if they care.

@alexcrichton
Copy link
Collaborator Author

Thanks for prototyping and gathering numbers! To clarify, are you saying that add_wide3 and add_wide should be added to the spec, with signatures using i64, and this data supports that? Personally I'm all in favor of that 👍

If you're thinking though that i1 should be added to the spec I'm personally pretty set on being against that. My read though is you're saying engines can benefit from add_wide3 + add_wide to more easily perform an i1 analysis, and I'd completely agree with you there.

@jakobkummerow
Copy link

To clarify:

I think it would be technically ideal to have i1 in the spec, so that the wire bytes can accurately convey information that's relevant for optimal codegen. However, I also think it's not terribly important, because said information can be recovered by engine-side analysis; and I'm certainly sympathetic to the argument that adding a new primitive type is perhaps a bit too much for such an admittedly niche use case, so I won't push for it. However, I think it would be fair to present it as an open question to the larger community. Adding i1 could be rather straightforward: the only instruction we'd really need, I think, would be i1.const; next on the list would be conversions to/from i32 and/or i64, but those are easy to polyfill (using if and i32.const) for the few places that would need them, so while it'd be nice to have them, they're not required.
(For comparison: adding i8 and i16 as storage-only types wasn't a big deal either. It only gets messy when you want to support the full set of arithmetic, but we won't need that for i1 just like we don't have it for i8. The comparison is not perfect: i1 wouldn't be a storage type, instead it would show up in locals and on the value stack and perhaps function signatures; so the commonality is that it would only show up in very few places.)

My thinking on add_wide3+add_wide is similar: ideally we'd have them, because they're the most efficient way (in terms of both wire bytes encoding size, and engine simplicity) to encode the primary use case (bigint addition). But since they can be recovered from add128 through pattern matching, they're also not terribly important to have in the spec. I'm in favor of having them! And they're a nice match with mul_wide.

The rightmost three columns in my table all used the module you provided, i.e. the wire bytes contained only add128, everything else was done engine-side, demonstrating that that's possible with enough implementation effort.

@alexcrichton
Copy link
Collaborator Author

Ah ok, interesting! I was actually starting to wonder what you were benchmarking there but starting from "only add128" and decompiling/pattern-matching to other instructions makes sense.

On the topic of i1, personally I really don't want to add a new value type to wasm. The i8 and i16 types are only present within GC objects and aren't otherwise value types which sort of helps, but there's a huge amount of things outside of an engine's purview when adding types, for example:

  • Should new if instructions be added to support i1 inputs? Should if be made polymorphic?
  • Every engine now needs to add another case to the union wasm_valtype.
  • All fuzzers need to update to have i1 parameters, locals, results, in functions.
  • Should C ABIs change such that passing bool now uses i1 instead of i32?
  • As mentioned, all the conversion instructions are necessary to interoperate with i1 at a language level.

Of course this can all be done, but at the same time you're saying that you're starting with add128 and with ~250 lines of engine code get optimal results. To me those 250 lines are way way way smaller than the amount of time/effort needed to add a new value type to wasm.

On the topic of add_wide3 and add_wide, this is one reason I was surprised you started with an add128 module. In some sense that indicates to me that add128 is indeed the right primitive here. I can say with certainty that if add_wide3 and add_wide were the primitives then Wasmtime would not at all be able to have anywhere near the performance gains that add128 brings. Getting even remotely close would require the infrastructure you describe of translating adjacent instructions differently than instructions-in-isolation which brings up all sorts of questions about brittle optimizations and brittle patterns of "what's emitted by LLVM today".

I realize you're saying that V8 probably doesn't care if it's add_wide3+add_wide or add128 or everything. It might be an interesting data point to remove add128 entirely and instead have the input exclusively contain add_wide3 and add_wide and benchmark that on V8, but I'm also not sure what we'd do with the result, so perhaps not too interesting of a data point.

Personally I wouldn't feel too swayed by binary size here. I don't disagree that add_wide3 is smaller than the equivalent encoding with add128 + i64.const 0 in a few places, but at the same time it's not like the majority of any wasm module is going to be using these instructions. If you don't feel too strongly about add_wide3 or add_wide then I might lean towards keeping just add128. It's not a perfect swiss army knife but in my experience it's been quite close to the abstraction that works well without too much effort most of the time. (e.g. it was easy to integrate everywhere in Rust/LLVM/Wasmtime and it sounds like it was also easy to integrate into V8 with not really much effort needed to optimize it either)

@alexcrichton
Copy link
Collaborator Author

Also, on i1, is that actually sufficient? Wouldn't add_wide3 produce an i2?

@jakobkummerow
Copy link

I was surprised you started with an add128 module.

It was the only module I had available.

Wouldn't add_wide3 produce an i2?

That is exactly the crux of the matter: it produces an i1 if and only if it consumes an i1. So i1 would not only be "sufficient", it would be exactly what we need: using i2 or i8 would be pointless, because those two can't convey the crucial information, so they would provide no benefit over just sticking with i64.

if add_wide3 and add_wide were the primitives then Wasmtime would not at all be able to have anywhere near the performance gains that add128 brings. Getting even remotely close would require the infrastructure you describe of translating adjacent instructions differently than instructions-in-isolation

That's difficult to imagine. Supporting an add_wide Wasm instruction in an engine/compiler is no harder than supporting mul_wide. In V8, I had to add a new type of IR node for it; looks like Cranelift even already has it -- and it even has the add_wide3 variant too! I don't think it should be hard to map Wasm instructions directly onto these existing IR nodes? And it shouldn't be slow either? Am I misunderstanding where you see difficulty?

@alexcrichton
Copy link
Collaborator Author

That's difficult to imagine. Supporting an add_wide Wasm instruction in an engine/compiler is no harder than supporting mul_wide.

I apologize if I sound like a broken record, but this is what I linked to above, notably this section of the overview coupled with this historic CG presentation. Notably the slide with this table:

Image

I implemented the rough equivalent of add_wide and add_wide3 in wasmtime WITHOUT add128, and this is the performance. You're correct it's trivial to simply get working in Wasmtime using the instructions you link to. The result was that it was slower than add128. The notes, slides, and overview contain more information, but to rehash them here: the reason is the overflow flag. The performance here basically relies on not materializing the overflow flag in a general-purpose register. Doing that when the producer and consumer are two separate wasm instructions is not something Wasmtime nor Cranelift can do.

I know I've basically been repeating this many many times to you but I'm not really sure how else to say it? I can try to dig up the branches and show you why it's slow but that would both take a bit to dig up and you'd have to boot up on Wasmtime/Cranelift to understand why generating more optimal code is hard. An alternative is I could try to better understand what V8 is doing and try to understand why it's easy in V8. Do you have a link to a diff for the work you've done that I can review?

@jakobkummerow
Copy link

I know you've said all that before, and I'm sorry that I'm probably coming across as just not listening; I just can't reconcile it with my own implementation experience. I intentionally go from two nested add128s to a single add_wide3 on the fly, and I do materialize the carry flag in a GP register (because the loop forces that anyway), and I'm not doing any fancy compiler magic, and it's still a little faster than plain add128, not twice as slow.

(You can count the i1 analysis as "fancy", but that doesn't make anywhere near the material difference that your table shows, so the statement above holds even without that particular optimization.)

I don't have a good guess how to explain the difference in our respective experiments. One detail I've noticed before in your machine code snippets is an explicit instruction to put the carry value back into the flags register. That seems inefficient, and I don't understand why a compiler would do that, and I'd be quite surprised if that's what Cranelift emitted for uadd_overflow_cin, but if that is indeed what it does, then that could explain at least some of the difference... and perhaps it would be easy-ish to fix? But it shouldn't make a huge difference, I guess, as it's just one instruction.

My prototype compiles add_wide3 (with computed i1 annotation) to the following, where out_high, out_low == in1, in2, and in3 are all general-purpose registers (with the out_low=in1 equality ensured by the register allocator upon explicit request):

xor out_high, out_high
add out_low, in2
adc out_low, in3
setb out_high

and that's pretty much hard-coded in the code generator (i.e. the logic is basically "see the add3 IR node, emit that sequence"). Without i1 analysis, there's one more adc to account for the second possible carry bit, so we go from four instructions to five.

In your experiment, did you replace each add128(add128(a, 0, b, 0), c, 0) on or near the Wasm level with a single add_wide3(a, b, c), or did you replace it with the much bigger sequence of operations that'd be required to handle the full 128-bit input range if you didn't know that the three high inputs are all zero? If the latter, that could be an explanation for the massive performance difference. I mean, if you saw it as add128(add128(a, x, b, y), c, z) and replaced that with something like

temp_low, carry1 = add_wide(a, b)
temp_high, ignored1 = add_wide3(x, y, carry1)
out_low, carry2 = add_wide(temp_low, c)
out_high, ignored2 = add_wide3(temp_high, z, carry2)

then I could certainly see how that'd be much slower than a single out_low, out_high = add_wide3(a, b, c).

Also, without having looked at its code, I'm pretty sure that Cranelift internally replaces add128(add128(a, 0, b, 0), c, 0) with add_wide3(a, b, c), because that's the only explanation I can think of why you'd be seeing the performance (and generated machine code) that you're seeing with add128. If the compiler can do that replacement internally, then it must be feasible to feed it some equivalent input directly -- perhaps not with zero effort, depending on the compiler's existing frontend, but certainly with not much effort, because the internals are already there.

I'll get my hacky experimental code uploaded when I get a chance so you can take a look at it in full detail.

@jakobkummerow
Copy link

OK, here you go: patch

Applies to current tip-of-tree V8, if you want to compile it. Only x64 is implemented (and even there, i64.sub2_wide is still missing).

Quick guidance:

  • src/wasm/wasm-opcodes.h: the opcodes I've assigned to the additional instructions (consecutive to the ones you assigned in this proposal).
  • src/wasm/function-body-decoder-impl.h: decoding. Straightforward.
  • src/wasm/baseline/*: Baseline compiler. Can be entirely ignored for performance considerations.
  • src/wasm/turboshaft-graph-interface.cc: entry point to the optimizing compiler. Straightforward, just creating the IR nodes.
  • src/compiler/turboshaft/operations.h: definition of those IR nodes (all the ops are just parameterized versions of Word64PairOp)
  • src/compiler/turboshaft/machine-optimization-reducer.h: where the add128(add128(...->add3 replacement happens. (The reverse is sketched out as a problem description but not implemented.)
  • src/compiler/turboshaft/carry-propagation-phase.cc: the i1 analysis.
  • src/compiler/backend/x64/instruction-selector-x64.cc: creating input for the register allocator. It's a lot of code, but not very interesting algorithmically, and no optimizations happen there (except for checking whether the high output happens to be entirely unused; I haven't checked whether that's ever actually the case for the benchmark at hand).
  • src/compiler/backend/x64/code-generator-x64.cc: emitting machine code. Looks pretty verbose because it deals with memory operands (not sure if that's even relevant for the benchmark at hand, I just followed other existing instructions) and contains a bunch of fairly paranoid CHECKs to make sure I'm not screwing up register aliasing. The high-level summary is what I described in my previous post.

@alexcrichton
Copy link
Collaborator Author

Thanks for the link and the explanations! That definitely all appears as I'd expect, and while I don't fully grok the i1 optimization the purpose of removing an adc in the lowering of add_wide3 makes sense to me.

I believe the answer to your confusion lies in the primitives that the wasm I originally tested was using and what was implemented in Cranelift. I outlined on this slide but what I was testing was:

i64.add_overflow_{u,s} : [i64 i64] → [i64 $t]
i64.add_with_carry_{u,s} : [i64 i64 $t] → [i64 $t]

The primitives in Cranelift I used at the time were, in V8 terms, IR nodes of uadd_overflow and uadd_overflow_cin. There was no IR node for add_wide3. There was no combining uadd_overflow or uadd_overflow_in to something higher level (not even 128-bit addition, I didn't implement that, and definitely not add_wide3 as it doesn't exist in Cranelift). This means that the codegen ended up being terrible because between IR nodes the overflow flag is a value, not a bit in a register, hence the movement in and out.

Does that help explain the discrepancies in performance? Basically in my experience the performance of a compiler here is heavily dependent on how overflow-flag-using operations are modeled. I have no doubt that if I implemented add_wide3 in Cranelift that it would perform basically the same as your add_wide3 column above. It would be nontrivial for us to implement the i1 optimization, however. Regardless I did not measure such a hypothetical Cranelift implementation before, I measured wasm/Cranelift using lower level instructions.

The benchmark here, bignum addition, is basically add_wide3 in a loop so it naturally benefits from a single IR node being the entire loop. Originally I was trying to be as general as possible and ended up settling instead on add128 as a balance between generality and ease-of-implementation.


FWIW a few days ago on my llvm/rust/wasmtime forks I filled out more implementation, notably I got LLVM to do that double-add128-fold-to-add_wide3 optimization so the input wasm itself had add_wide3. In Wasmtime I translated that right back to add128(add128(..), ..) and got tests passing and the benchmark ran the same as before. The same-performance was expected though given the implementation.

One thing I'm a bit hung up on and haven't bottomed out: why did you do the i1 optimization? Did you find a tight loop that had add_wide3 but one input was a 1-bit value? If so can you share the shape of the wasm doing that? (e.g. ideally a C/Rust source function + what you would translate to add_wide3/add_wide ops)

@jakobkummerow
Copy link

I don't fully grok the i1 optimization

Yeah, I know it needs a descriptive comment :-)
It finds (potentially looped and/or branched) chains of nodes that only consist of the following:

  • i64.const with values 0 or 1
  • high output from add_wide (guaranteed to be in i1 range)
  • high output from add_wide3 that are already in this chain (because having at least one i1 input guarantees that their high output will be in i1 range again)
  • Phi nodes that are already in this chain (so they're only propagating i1-ranged values)

It does this by first finding all add_wide3 nodes, and then walking backwards along the value computation chains that form their inputs, until it finds one that doesn't meet the above criteria, and then propagates that information forwards to the add_wide3 again. The "that are already in this chain" conditions are implemented by first assuming them to be true on the backwards pass, and then correcting that optimistic assumption if necessary on the forwards pass.

why did you do the i1 optimization? Did you find a tight loop that had add_wide3 but one input was a 1-bit value?

I mostly did it for exploratory reasons: I wanted to see what it takes to generate optimal code. Practical relevance is given by every single bigint addition loop, e.g. the one you linked before. Let me rephrase that slightly for clarity:

let mut carry = 0;
for (...) {
  (output, carry) = adc(input1, input2, carry);
}

So carry is the "loop phi", it starts out with what could be i1.const 0, and in every iteration it's both input to and (high) output from an add3(input1, input2, carry), so its value is guaranteed to be always 0 or 1, IOW it's in i1 range. And as we discussed before, practically every implementation of bigint addition will share this basic structure.

With the state of this proposal as it is, the emitted Wasm for the loop body will be (output, carry) = add128(add128(input1, 0, input2, 0), carry, 0), but that is, as discussed for the past dozen posts in this thread, exactly the pattern that either a Wasm AOT optimizer or an optimizing Wasm engine can easily recognize and convert to an add_wide3(input1, input2, carry) (aka uadd_overflow_cin(input1, input2, carry)), so the subsequent i1 analysis can run on the add_wide3 that the human author of this code clearly had in mind when they wrote adc.

Does that help explain the discrepancies in performance?

Not really, because:

IR nodes of uadd_overflow and uadd_overflow_cin. There was no IR node for add_wide3

My turn to sound like a broken record: "uadd_overflow_cin" and "add_wide3 with i1 analysis" are two names for the same thing! Both take [i64, i64, carry] inputs, add them all up, and return [i64, carry] outputs. Minor implementation differences in various code generation backends might cause minor performance differences (e.g. I've found claims that xor hi, hi; setb hi; is slightly faster than setb hi; movzxbl hi, hi; I wouldn't be surprised if the specifics depended on microarchitecture though), but either way they should be within a few percent of each other.
And, of course, uadd_overflow and add_wide are also the same thing.

There was no combining uadd_overflow or uadd_overflow_cin to something higher level

Yeah, I'm not combining them either. uadd_overflow_cin is add_wide3.

It would be nontrivial for us to implement the i1 optimization

No problem; as my earlier results show, it doesn't have a huge impact.

@alexcrichton
Copy link
Collaborator Author

Ok I've recompiled all the old bits and here's what I have.

Inner loop source code

The rust source code for the inner loop is here - https://github.com/dignifiedquire/num-bigint/blob/699574df3ae665a6d394c6bc94d6d151aa063c25/src/algorithms/add.rs#L13-L35

Native disassembly of inner loop
   e0480:       4b 03 04 de             add    (%r14,%r11,8),%rax
   e0484:       49 83 d2 00             adc    $0x0,%r10
   e0488:       4a 03 04 da             add    (%rdx,%r11,8),%rax
   e048c:       49 83 d2 00             adc    $0x0,%r10
   e0490:       4b 89 04 de             mov    %rax,(%r14,%r11,8)
   e0494:       4d 8d 4b 02             lea    0x2(%r11),%r9
   e0498:       31 c0                   xor    %eax,%eax
   e049a:       4f 03 54 de 08          add    0x8(%r14,%r11,8),%r10
   e049f:       0f 92 c0                setb   %al
   e04a2:       4e 03 54 da 08          add    0x8(%rdx,%r11,8),%r10
   e04a7:       48 83 d0 00             adc    $0x0,%rax
   e04ab:       4f 89 54 de 08          mov    %r10,0x8(%r14,%r11,8)
   e04b0:       41 ba 00 00 00 00       mov    $0x0,%r10d
   e04b6:       4d 89 cb                mov    %r9,%r11
   e04b9:       4d 39 c8                cmp    %r9,%r8
   e04bc:       75 c2                   jne    e0480 <_ZN118_$LT$num_bigint_dig..biguint..BigUint$u20$as$u20$core..ops..arith..Add$LT$$RF$num_bigint_dig..biguint..BigUint$GT$$GT$3add17hedad4a958e88df37E+0x100>
add128: Inner loop wasm
loop ;; label = @13
  local.get 7
  local.get 12
  i64.const 0
  local.get 7
  i64.load
  i64.const 0
  i64.add128
  local.get 15
  i64.load
  i64.const 0
  i64.add128
  local.set 12
  i64.store
  local.get 7
  i32.const 8
  i32.add
  local.tee 16
  local.get 12
  i64.const 0
  local.get 16
  i64.load
  i64.const 0
  i64.add128
  local.get 15
  i32.const 8
  i32.add
  i64.load
  i64.const 0
  i64.add128
  local.set 12
  i64.store
  local.get 15
  i32.const 16
  i32.add
  local.set 15
  local.get 7
  i32.const 16
  i32.add
  local.set 7
  local.get 5
  local.get 13
  i32.const 2
  i32.add
  local.tee 13
  i32.ne
  br_if 0 (;@13;)
end
add128: Inner loop native disassembly
   9505d:       45 8b df                mov    %r15d,%r11d
   95060:       48 31 db                xor    %rbx,%rbx
   95063:       4f 03 34 1c             add    (%r12,%r11,1),%r14
   95067:       48 83 d3 00             adc    $0x0,%rbx
   9506b:       45 8b ea                mov    %r10d,%r13d
   9506e:       4f 03 34 2c             add    (%r12,%r13,1),%r14
   95072:       48 83 d3 00             adc    $0x0,%rbx
   95076:       4f 89 34 1c             mov    %r14,(%r12,%r11,1)
   9507a:       45 8d 5f 08             lea    0x8(%r15),%r11d
   9507e:       4d 31 f6                xor    %r14,%r14
   95081:       4b 03 1c 1c             add    (%r12,%r11,1),%rbx
   95085:       49 83 d6 00             adc    $0x0,%r14
   95089:       45 8d 6a 08             lea    0x8(%r10),%r13d
   9508d:       4b 03 1c 2c             add    (%r12,%r13,1),%rbx
   95091:       49 83 d6 00             adc    $0x0,%r14
   95095:       4b 89 1c 1c             mov    %rbx,(%r12,%r11,1)
   95099:       83 c0 02                add    $0x2,%eax
   9509c:       4d 89 c3                mov    %r8,%r11
   9509f:       41 81 e3 fe ff ff ff    and    $0xfffffffe,%r11d
   950a6:       41 83 c7 10             add    $0x10,%r15d
   950aa:       41 83 c2 10             add    $0x10,%r10d
   950ae:       41 39 c3                cmp    %eax,%r11d
   950b1:       0f 85 a6 ff ff ff       jne    9505d <wasm[0]::function[531]::_ZN118_$LT$num_bigint_dig..biguint..BigUint$u20$as$u20$core..ops..arith..Add$LT$$RF$num_bigint_d+0x15d>
overflow: Inner loop wasm
loop ;; label = @13
  local.get 7
  local.get 12
  local.get 7
  i64.load
  i64.add_overflow_u
  local.set 16
  local.get 15
  i64.load
  i64.add_overflow_u
  local.set 17
  i64.store
  local.get 7
  i32.const 8
  i32.add
  local.tee 18
  i64.const 0
  i64.const 0
  local.get 16
  i64.add_with_carry_u
  drop
  i64.const 0
  local.get 17
  i64.add_with_carry_u
  drop
  local.get 18
  i64.load
  i64.add_overflow_u
  local.set 16
  local.get 15
  i32.const 8
  i32.add
  i64.load
  i64.add_overflow_u
  local.set 17
  i64.store
  local.get 16
  i64.extend_i32_u
  i64.const 0
  local.get 17
  i64.add_with_carry_u
  drop
  local.set 12
  local.get 15
  i32.const 16
  i32.add
  local.set 15
  local.get 7
  i32.const 16
  i32.add
  local.set 7
  local.get 5
  local.get 13
  i32.const 2
  i32.add
  local.tee 13
  i32.ne
  br_if 0 (;@13;)
end
overflow: Inner loop disassembly
   94dd5:       45 8b fe                mov    %r14d,%r15d
   94dd8:       4b 03 3c 3c             add    (%r12,%r15,1),%rdi
   94ddc:       40 0f 92 c3             rex setb %bl
   94de0:       41 8b f2                mov    %r10d,%esi
   94de3:       49 03 3c 34             add    (%r12,%rsi,1),%rdi
   94de7:       40 0f 92 c6             setb   %sil
   94deb:       4b 89 3c 3c             mov    %rdi,(%r12,%r15,1)
   94def:       80 c3 ff                add    $0xff,%bl
   94df2:       4c 89 df                mov    %r11,%rdi
   94df5:       48 83 d7 00             adc    $0x0,%rdi
   94df9:       41 0f 92 c5             setb   %r13b
   94dfd:       40 80 c6 ff             add    $0xff,%sil
   94e01:       48 83 d7 00             adc    $0x0,%rdi
   94e05:       40 0f 92 c6             setb   %sil
   94e09:       41 8d 76 08             lea    0x8(%r14),%esi
   94e0d:       49 03 3c 34             add    (%r12,%rsi,1),%rdi
   94e11:       41 0f 92 c5             setb   %r13b
   94e15:       41 8d 5a 08             lea    0x8(%r10),%ebx
   94e19:       49 03 3c 1c             add    (%r12,%rbx,1),%rdi
   94e1d:       41 0f 92 c7             setb   %r15b
   94e21:       49 89 3c 34             mov    %rdi,(%r12,%rsi,1)
   94e25:       49 0f b6 fd             movzbq %r13b,%rdi
   94e29:       41 80 c7 ff             add    $0xff,%r15b
   94e2d:       48 83 d7 00             adc    $0x0,%rdi
   94e31:       40 0f 92 c6             setb   %sil
   94e35:       41 83 c1 02             add    $0x2,%r9d
   94e39:       48 89 c6                mov    %rax,%rsi
   94e3c:       83 e6 fe                and    $0xfffffffe,%esi
   94e3f:       41 83 c6 10             add    $0x10,%r14d
   94e43:       41 83 c2 10             add    $0x10,%r10d
   94e47:       44 39 ce                cmp    %r9d,%esi
   94e4a:       0f 85 85 ff ff ff       jne    94dd5 <wasm[0]::function[528]::_ZN118_$LT$num_bigint_dig..biguint..BigUint$u20$as$u20$core..ops..arith..Add$LT$$RF$num_bigint_d+0x155>

The add128 loop is 16% slower than native. The overflow loops is 101% slower than native. My rationale for why the overflow loop is so much slwoer is that the codegen is terrible, the overflow flag keeps getting moved in and out of EFLAGS. Why? It's because Wasmtime and Cranelift generate code for each wasm instruction effectively in isolation in this case. The carry flag in EFLAGS is defined in one wasm instruction and consumed in another. Wasmtime and Cranelift have no means of seeing that the definition of the carry flag is adjacent to the use therefore there's no need to take it out of a register.

The reason add128 is fast is the carry flag is never live between wasm instructions. All it takes is a lowering rule or two to fit constants into adc.

@alexcrichton
Copy link
Collaborator Author

Also if it helps overflow.wasm.gz is the wasm file I was benchmarking for "overflow". I was evaluating a lot of different things in flight so the opcodes aren't the same. Notably this is the decoding, this is the validation, and this is the translation to Cranelift

@jakobkummerow
Copy link

I'm not doubting that you saw what you saw.

I think the discrepancy between your experimental results and mine boils down to one crucial difference:

  • You've assumed that a Wasm add_with_carry instruction should be compiled to a single adc x64 machine instruction that takes its third input as the actual carry flag. We are in agreement that while that perhaps sounds like a good idea at first (WebAssembly and all), it doesn't work out well in practice because the flags register is too tricky to work with.
  • I've assumed that a Wasm add_wide3 instruction is the same opaque/abstract operation as the adc Rust helper function, and may well be compiled to a sequence of 4-5 machine instructions, so the flags register is only used within that sequence.

Interestingly and confusingly, in both of these perspectives the signature of the operation in question is [i64, i64, carry] -> [i64, carry], and of course its behavior is the same too: it adds up its three inputs.

I think the biggest reason for the performance differences we saw is actually not the flags register (although that probably contributes), but rather suboptimal module generation:
The Rust source contains one call to adc in the loop body. So if you hand-wrote the Wasm, you could simply use a single add_wide3 for the loop body; that'd be the "golden" solution.
Since Rust/LLVM doesn't offer add_wide3 as a primitive, we need to emulate it with u128 arithmetic, which is precisely what the adc helper function does. Notably, we need two u128 additions for one add_wide3.
Now, in a world where Wasm doesn't offer add128 but only add_wide3, we luckily find that u128 arithmetic can, in turn, be emulated in terms of add_wide3 -- but in the general case of unconstrained inputs, we need two add_wide3 for each u128 addition. (For the sake of brevity, I'm not distinguishing between add_wide and add_wide3 here.)
So, by naively chaining those emulations, we get four add_wide3 Wasm instructions for the two add128 instructions that emulated the original single add_wide3.
It's unsurprising that this is slow: we're still doing the same work as the original code, but using four times as many operations to do it.
Inspecting the snippets you just posted reveals that this is almost exactly what actually happened: due to loop unrolling, everything shows up twice as often; the "add128" Wasm contains four add128 instructions representing two calls to the adc helper, which the "overflow" Wasm replaces with seven i64.add_* instructions; apparently the unrolling allowed it to optimize out one of the eight we would have predicted by the line of thought above.

So, I think to get your results to match mine, two things would be required:

  • When producing the Wasm module, detect the pattern add128(add128(a, 0, b, 0), c, 0) and emit a single add_wide3(a, b, c) for it (or add_with_carry(a, b, c) if you prefer the other name). If that's too hard to do directly in your toolchain, it could also be done in a standalone Wasm-to-Wasm optimizer.
  • In the engine, compile the add_wide3 Wasm instruction to a short sequence (like the xor; add; adc; setb mentioned earlier, or similar).

I guessed before that Cranelift's existing uadd_overflow_cin would do the latter (it has the right signature), but it's entirely possible that I was overly optimistic there and that internal operation is indeed hell-bent on stuffing its third input into the flags register first, in which case it wouldn't be a good fit. But it can still be a useful building block, I guess something like this would do the trick then (using setb; movzx instead of xor; setb, because I have no idea how to express the latter):

        Operator::I64AddWide3 => {
            let (arg1, arg2, arg3) = state.pop3();
            let (tmp1, tmp2) = builder.ins().uadd_overflow(arg1, arg2);
            let (result, oflow) = builder.ins().uadd_overflow_cin(tmp1, arg3, tmp2);
            state.push1(result);
            state.push1(builder.ins().uextend(I64, oflow));
        }

Codegen sidenote: the sequence

   94dfd:       40 80 c6 ff             add    $0xff,%sil
   94e01:       48 83 d7 00             adc    $0x0,%rdi

really makes very little sense. If you have a carry bit from a previous operation in %sil and want to add it to the value in %rdi, just use add %sil,%rdi, don't bother with the flags register!


add128 is fast [because] the carry flag is never live between wasm instructions

Well, in a sequence

(low_temp, carry) = add128(a, 0, b, 0)
(out_low, out_high) = add128(low_temp, carry, c, 0)

the value carry is, in fact, the carry flag of the a + b addition, and looks quite live between wasm instructions to me.

But sure, the general point stands that having way too many i64.add_* instructions on the Wasm level also causes way too many things to be live between instructions.

@alexcrichton
Copy link
Collaborator Author

You're right yeah the wasm itself is not very well optimized (nor lowering patterns for uadd_overflow_cin). After working on them a bit more though I'm a little surprised but the performance is still quite bad.

To clarify though your translation of I64AddWide3 isn't quite right, what actually needs to happen is:

add_wide3(a, b, c):
  v0, v1 = uadd_overflow(a, b)
  v2, v3 = uadd_overflow(v0, c)
  v4 = iadd(v1, v3)
  return v2, v4

Notably uadd_overflow_cin is not part of the most general form of I64AddWide3 (this has tripped me up thinking about it in the past). Only when one of the inputs to add_wide3 is proven [0, 1] is it equivalent to a single uadd_overflow_cin, otherwise I don't think it can help the implementation (I've been wrong before though...)

After applying some basic LLVM optimizations the new wasm is:

take 2: overflow loop
loop ;; label = @13
  local.get 7
  local.get 12
  local.get 7
  i64.load
  i64.add_overflow_u
  local.set 12
  local.get 15
  i64.load
  i64.add_overflow_u
  local.set 16
  i64.store
  local.get 7
  i32.const 8
  i32.add
  local.tee 17
  local.get 16
  local.get 12
  i64.const 0
  i64.add
  i64.add
  local.get 17
  i64.load
  i64.add_overflow_u
  local.set 12
  local.get 15
  i32.const 8
  i32.add
  i64.load
  i64.add_overflow_u
  local.set 16
  i64.store
  local.get 12
  local.get 16
  i64.add
  local.set 12
  local.get 15
  i32.const 16
  i32.add
  local.set 15
  local.get 7
  i32.const 16
  i32.add
  local.set 7
  local.get 5
  local.get 13
  i32.const 2
  i32.add
  local.tee 13
  i32.ne
  br_if 0 (;@13;)
end

It's still unrolled twice with 3 adds per unroll (total 6). The two add_overflow_u + add at the end is the equivalent of add_wide3, so I believe this is a "more basic" translation of add_wide3.

The generated machine code is:

take 2: overflow machine code
152:┌─→mov    ecx,r14d
    │  add    r9,QWORD PTR [r15+rcx*1]
    │  rex    setb bl
    │  mov    r13d,r12d
    │  add    r9,QWORD PTR [r15+r13*1]
    │  setb   r13b
    │  mov    QWORD PTR [r15+rcx*1],r9
    │  lea    r9d,[r14+0x8]
    │  movzx  rcx,r13b
    │  movzx  rbx,bl
    │  add    rbx,rcx
    │  add    rbx,QWORD PTR [r15+r9*1]
    │  setb   r13b
    │  lea    ecx,[r12+0x8]
    │  add    rbx,QWORD PTR [r15+rcx*1]
    │  rex    setb cl
    │  mov    QWORD PTR [r15+r9*1],rbx
    │  add    esi,0x2
    │  mov    rbx,rdi
    │  and    ebx,0xfffffffe
    │  add    r14d,0x10
    │  movzx  r9,r13b
    │  movzx  rcx,cl
    │  add    r9,rcx
    │  add    r12d,0x10
    ├──cmp    ebx,esi
    └──jne    152

This has the "most naive lowering" here which is each add_overflow becomes add + setb and i64.add becomes add. It's true that xor ; ... ; setb is indeed more efficient than this lowering, and that's something we can improve in Wasmtime, but the performance of this loop is 90% slower than native. I was expecting something in the more 50%-ish range.

Basically what I'm getting at here, which is still true of the original prototype, is that if performance relies on the overflow bit being carried between wasm instructions I don't think this proposal is going to work. The add_wide3 instruction works because, as you say, it keeps the overflow bit within the instruction.

One thing I still don't understand: why did you get speedup with your i1 optimization? This loop doesn't benefit or trigger i1 optimizations? Perhaps measurement noise of up to 10% (which wouldn't surprise me, it's a noisy microbenchmark).


Also, to clarify, none of this is working with add_wide3. LLVM has no knowledge of it nor does Wasmtime. In this "overflow" world Wasmtime has no nowledge of 128-bit arithmetic either, only LLVM does. In some sense nothing has add_wide3 as a primitive operation, not x64, LLVM, Rust, or wasm (the variant I'm testing here).

Personally I still feel very lost with respect to conclusions on this thread. None of this feels like it motivates add_wide3 to me in particular over add128. We have a ton of discussion about various iterations but it's all variants of what the various primitives are in play here and who's expected to do what optimization. I still personally feel that "wasm just has add128" is the best balance so far of net total complexity between engines, toolchains, languages, etc. To me the major points are:

  • With just add128, Wasmtime and V8 both perform quite well with a modest amount of engine work.
  • No measurement has been done of "just add_wide + add_wide3" in the wasm module itself, but it's expected that a sufficiently smart toolchain would perform the same as add128 on the bigint-summation test case here.
  • With just i64.add_overflow_u I'm measuring it's only marginally better than the status quo before wide-arithmetic, notably because the engine would have to do work to combine 2 x i64.add_overflow_u plus 1 x i64.add into an internal add_wide3 node. Without this optimization it seems most performance is lost.

The only major open question I have is:

  • Why did the i1 optimization in V8 seemingly improve the bignum benchmark when this loop shouldn't be triggering the i1 optimization? The loop requires the most general form of add_wide3 where the i1 optimization is not applicable.

Do you have other conslusions at this point though?

@jakobkummerow
Copy link

your translation of I64AddWide3 isn't quite right

You're right, I realized after posting it that I gave the i1-optimized variant. The general case needs one more addition, I'd express it as:

        Operator::I64AddWide3 => {
            let (arg1, arg2, arg3) = state.pop3();
            let (tmp1, tmp2) = builder.ins().uadd_overflow(arg1, arg2);
            let tmp3 = builder.ins().uextend(I64, tmp2);  // Ideally replaced with `xor; setb` sequence but I don't know how
            let (result, tmp4) = builder.ins().uadd_overflow(tmp1, arg3);
            let (high, ignored) = builder.ins().uadd_overflow_cin(tmp3, 0, tmp4);
            state.push1(result);
            state.push1(high);
        }

That should result in add; setb; movzx; add; adc, so it only needs to setb-materialize one carry flag, while being able to use the flags register for the second. I don't know how to specify a constant/immediate in that code, so I just wrote a literal 0. Probably something with builder.ins().iconst(...)?

take 2: overflow loop

Looks better, but seems to get only halfway there. You really want to have one add3 for every two add128 in the "add128" version. With four add_overflow_u I agree that it needs way too much heavy lifting in the engine, or produces too slow code.

why did you get speedup with your i1 optimization? This loop doesn't benefit or trigger i1 optimizations?

Yes, it does. Every bigint addition loop I've ever seen does. They all start with let mut carry = 0 (or some equivalent phrasing), and then in every loop iteration, unrolled or not, they do (current_output, carry) = adc(in1, in2, carry). So since the first iteration's adc gets the constant 0 as input (which is clearly in [0..1] range), it produces a carry in [0..1] range, and by induction so does the adc in every subsequent iteration too. It's easier to see/analyze if you express the loop in terms of add_wide3.

I'm sure it wasn't measurement noise. It was pretty stable across several runs, fluctuating by 1-2% at most. (Some of the other tests fluctuate much more wildly.)

conclusions on this thread

My take:

  • For bigint addition, add_wide + add_wide3 perform at least as well as add128 if the module producer does its job right (which is no harder than what engines have to do for emitting optimal code for nested add128 instructions). For engines that so far have none of the primitives under consideration, these two are (slightly) simpler to implement than add128.
  • add128 is fine. For "real i128" arithmetic (i.e. not bigint), it makes optimal codegen easier because it's (obviously) a perfect fit.
  • I agree that just i64.add_overflow_u (which, IIUC, is another name for add_wide) isn't a good contender to replace add128 and/or add_wide3. It would nicely complement the latter though.
  • i1 analysis (or annotation) is useful but the difference isn't huge. Notably it's useful for both add_wide3 and add128 (in case of the latter: after the engine pattern-match-replaced it with the former), so it's not an argument for or against either of these instructions.
  • If it was up to me, I'd standardize all three: add128, add_wide, add_wide3.
  • I'm ambivalent about standardizing i1. Simpler engines would benefit (quote from one of them: "It would be nontrivial for us to implement the i1 optimization"), V8 doesn't need it, I think its spec complexity would be low (only a small handful of instructions would operate on it ); overall I wouldn't mind having it but I also wouldn't oppose a decision that it isn't worth having.

Taking a closer look at the native codegen, here's an interesting observation: the specific example you shared before is written in a way that makes rustc/LLVM not emit optimal code. In fact, we can use it to illustrate the difference between add128(add128(...), ...) and add_wide3 in a simple engine. In this godbolt link, function add2 is the original, with only two irrelevant changes:
(1) I've set -C optlevel=z to disable loop unrolling just to make the output easier to parse for us humans; the main observation below holds even if you change it back to optlevel=3.
(2) I've dropped the second loop that deals with inputs of unequal length, which doesn't change the code that's generated for the first, more interesting loop, and again makes the output easier to understand.
Observe how in the loop there are four additions add, adc, add, adc. That's two full-width 128-bit additions; LLVM didn't take advantage of the upper halves being zero, nor did it perform an i1 analysis. So this exemplifies what a Wasm engine would emit for an add128 pair, if it didn't bother trying to optimize any particular patterns in it. Compare that with function add2_opt below it, which computes exactly the same result, but phrases its helper function slightly differently. It ends up modeling what a simple Wasm engine would emit for add_wide3: in the loop, there are only three additions and a setb. So in this case, despite the additions still using u128 on the Rust level, LLVM did notice and exploit the fact that their upper halves are all zero, and replaced them with some variety of "64-bit add with overflow". Notably, LLVM again doesn't appear to have performed an i1 analysis, otherwise there'd be only two additions in the loop.
To tie all this back to our Wasm experiments: a particular input that makes even native LLVM struggle to emit great code is perhaps not the easiest example for a Wasm toolchain to generate optimal Wasm for. That can be seen as a benefit: it's a stress test of sorts; a toolchain that can handle this will likely also be able to handle friendlier formulations of the same algorithm. But if you're experiencing difficulty producing an optimal Wasm module, you may want to start with an easier example, such as the add2_opt rephrasing I just linked.

@alexcrichton
Copy link
Collaborator Author

That should result in add; setb; movzx; add; adc, so it only needs to setb-materialize one carry flag

To clarify, though, avoiding the setb after the second add in this code sequence requires "fusing" the uadd_overflow and uadd_overflow_cin. That's what I've said a number of times is not possible for Cranelift at this time. Also, to clarify, uadd_overflow_cin the two overflow flags is semantically the same as iadd-ing them together. You're assuming uadd_overflow_cin has fancy codegen which realizes the carry flag comes from uadd_overflow, but it does not.

Looks better, but seems to get only halfway there

I don't think that's quite right? The loop is still unrolled twice from the original source code. That means there's two conceptual add_wide3 operations, meaning that there are a total of four add_overflow_u and two add for the pairs of overflow flags. The i64.add is a bit buried but that's what's happening there.

Specifically the heavy lifting here is that if you add a carry flag to another you can use adc $original_flag, 0. You ALSO have to skip the setb from the previous uadd_overflow. That's the hard part Cranelift doesn't do.

Yes, it does. Every bigint addition loop I've ever seen does

Ah yes good point, but at the same time there's no way for the loop to be a single adc. This is something I don't really understand about you pushing for add_wide3 -- it sounds like you actually want uadd_overflow_cin which is pre-constrained to take an i1 input. Why do you want add_wide3 when it's not even what bigint case wants? (in that you want a more specific version where the third operand is a i1).

In the end though it's basically impossible for the bigint summation loop to be a single adc. Something in the loop is almost surely going to clobber flags meaning you can't preserve the carry bit in EFLAGS across iterations of the loop. Once you setb to get the carry flag from the adc then you're back to dealing with the same problem of how do you get it back in there efficiently for the next iteration of the loop. In the end you reach the pessimal codegen of uadd_overflow_cin in isolation.

Observe how in the loop there are four additions add, adc, add, adc

The loops there are relatively bad in the sense the have two branches inside them, one to exit at the top and one to jump back at the bottom. You can manually disable unrolling with llvm flags which gets the loops back with full optimizations but no unrolling to make it easier to read.

One thing I'm not clear on - are you saying that add2_opt is optimal? The difference here is xor ; add ; set vs add ; adc where the adc has prior knowledge that the destination register is zero-extended. Are you saying half these instructions can be removed? What is the target codegen you're looking to see here?

But if you're experiencing difficulty producing an optimal Wasm module, you may want to start with an easier example

Assuming add_wide3, which with i1 analysis is just uadd_overflow_cin, it sounds like what you're expecting is:

let mut carry: i1 = 0;
for (a, b) in a.iter_mut().zip(b) {
    (*a, carry) = uadd_overflow_cin(*a, *b, carry);
}

Which for x64 is something like:

start:
  ;; assume `carry` is already in the carry bit 
  mov %a, (%addr_of_a)
  adc %a, (%addr_of_b)
  mov (%addr_of_a), %a
  ;; somehow `carry` stays in the carry bit
  jne start

What I don't understand is what you're assuming those two comments are. Everything I've seen historically is that performance is critical depending on what those two comments are and it's why LLVM is emitting the code it is.

@jakobkummerow
Copy link

avoiding the setb after the second add in this code sequence requires "fusing"

I see. Clearly I don't have a good understanding of Cranelift. First I thought uadd_overflow_cin was smarter/"higher-level" than it actually is; then I thought it's a platform-independent abstraction of an x64 adc instruction; apparently it's neither but somewhere in between. I was hoping you'd have some way how, given (the Cranelift-internal representation of) some Wasm opcode, you can choose the specific sequence of machine instructions you want to generate for that opcode.
Perhaps that snippet of code is the wrong place to implement it then. Perhaps it would work better (at similar effort of just a handful of lines, just in some different place) to introduce an alternative to uadd_overflow_cin that has the same signature (3 values in, 2 values out) but is handled by some platform-specific code generation backend where you can emit a hand-crafted sequence of instructions such that you can put an add and an adc right behind each other, so the latter takes the former's flags register. (Presumably such a technique must exist, right? Otherwise how do you pull off generating the sequence you want for add128?)

two conceptual add_wide3 operations, meaning that there are a total of four add_overflow_u and two add for the pairs of overflow flags.

Yeah, to make it easy for an engine to emit optimal code, you want two add_wide3 Wasm instructions for two conceptual add_wide3 operations. That's why I described it as "halfway there", it's better than before, but still not ideal. But I suppose in a way you're also not even trying to be ideal there in the sense that the toolchain that produced this version didn't even attempt to emit add_wide3 instructions, right?

there's no way for the loop to be a single adc.

I fully agree, the loop body cannot be a single adc machine instruction. The (not unrolled) loop body can be a single add_wide3 Wasm instruction though. And then any compiler that's expressive enough to let you say "for an add_wide3 Wasm instruction, generate the following block of five machine instructions: add; setb; movzx; add; adc" will generate near-optimal code for that loop, without needing any fancy logic or analysis or fusion or pattern matching or messing with the flags register. Just straightforward replacement: 1 wasm instruction -> 5 machine instructions.
And, of course, if the loop body is unrolled then you just get two (or more) copies of that in a row.

You can manually disable unrolling with llvm flags

Thanks, good to know.
Anyway, my observation still holds: the fn add2 version has add, adc, add, adc to perform two full 128-bit wide additions (it did put the upper halves as constant zeroes in there, but it didn't optimize them out), whereas the fn add2_opt version needs one less addition instruction because it's able to prove that the upper halves of the inputs are all zero. The suffix _opt was meant to be short for "optimized", I'm not sure whether it deserves to be called "optimal" :-)

it sounds like what you're expecting is...

On the Wasm level, in Rust syntax, I ideally want to see:

let mut carry: i1 = 0;
for (a, b) in a.iter_mut().zip(b) {
    (*a, carry) = add_wide3(*a, *b, carry);
}

Which, on x64 (following your syntax of Intel operand order with AT&T register names, i.e. mov %dst, %src) should ideally get compiled to:

mov %c, 0  ;; register %c holds the carry between loop iterations
start:
  mov %a, %c  ;; free up %c as early as possible to help the pipeline
  xor %c, %c  ;; prepare for `setb`
  add %a, (%addr_of_a)
  adc %a, (%addr_of_b)
  setb %c
  mov (%addr_of_a), %a

  ;; ...bookkeeping to figure out if we're at the end of the loop...
  jne start

This is now hand-optimized; I don't necessarily expect compilers to achieve exactly that, but they shouldn't be far off:
Without i1 annotation or analysis, you'd need a third addition to handle the two possible carry bits.
If you can't get a temp register to write into early, you have to replace xor; setb with setb; movzx; then it also doesn't matter whether %c is processed first or last.

@alexcrichton
Copy link
Collaborator Author

Ok so to recap my understanding (sorry if I'm restating the obvious, this is a long thread and my head can only hold but so much):

  • With add_wide3 there is no "single instruction" on native architectures or most programming languages, but sufficiently smart compilers can probably emit this (e.g. LLVM pattern-matching add128(add128(..),..), I've done this already). This achieves the result where the hot loop body is a single add_wide3 wasm instruction, and there is no difficulty translating this to a template of add; setb; movzx; add; adc (in both Wasmtime and V8, I'd have to do work in Wasmtime but that's easy to add an add_wide3 IR node)
  • With add_wide3 alone this does NOT achieve the theoretical optimal loop because the third input is i64, not i1. Achieving an optimal loop requires additional analysis which you've prototyped in V8.
  • Native code is already non-optimal (or at the theoretical optimal), and is not the "gold standard" to cross-reference
  • Both of us (I think) consider it unreasonable to expect IR nodes to be able to keep the carry flag in EFLAGS between IR nodes, instead the inputs/outputs of any IR node are always general-purpose registers.
    • A small corrolary to this is that the primitives of add_wide3 with an i1 input are uadd_overflow and uadd_overflow_cin. If these two instructions were the foundation instead of add_wide3 then getting optimal codegen would require fusing uadd_overflow_cin(uadd_overflow(...), ..) to an add_wide3 instruction. This "fusion" is hard for Wasmtime/Cranelift, but isn't necessary if add_wide3 is the wasm primitive.
  • The only two possibilities being considered at this point is add128 or add128 + add_wide + add_wide3. Adding my old versions of i64.add_{overflow,with_carry}_u are not contenders at this point.

Or at least that's my current attempt to summarize facts without any opinions. My hope is that you have a better understanding at this point of when I chose the abstractions of i64.add_{overflow,with_carry}_u performance was not great and consequently moved to add128. Performance relies on, between wasm instructions, keeping the carry flag in EFLAGS. That subsequently requires some sort of fusion to add_wide3, which I did not see how to implement in Wasmtime or Cranelift.

Going back to a question I had above though:

it sounds like you actually want uadd_overflow_cin which is pre-constrained to take an i1 input. Why do you want add_wide3 when it's not even what bigint case wants? (in that you want a more specific version where the third operand is a i1).

I still don't understand why you think we should have add_wide3. My understanding is:

add128(a, 0, b, 0) ≡ add_wide(a, b)
add128(add128(a, 0, b, 0), c, 0) ≡ add_wide3(a, b, c)

Given this equivalence it's trivial for LLVM to emit add_wide and add_wide3, so I'm not worried about that. Engines could go from add128 to add_wide/add_wide3 as well as needed, but wouldn't need to as it's reasonable to expect toolchains to emit this.

All that said we're also not (or as far as I know) discussing removing add128. Coupled with the fact that I'm under the impression that you don't actually want either add_wide or add_wide3 with i64 inputs/outputs but you instead want versions with i1. If that's true, would you still want these instructions with i64 types instead of i1?

Personally given that I really don't want to add a new value type to wasm and the trivial equivalences here it feels six-to-one-half-dozen-or-the-other here. I don't have a strong reason to leave them out, but I also don't think there's a strong reason to keep them. For example leaving them out cuts down on opcodes. Keeping them in brings some slight size benefits. I feel like neither of these is really strong enough to give a definitive answer either way, so I'd lean more towards the conservative side where they can be added in the future with i1 signatures if an i1 type is added.

@jakobkummerow
Copy link

Yes, that all matches my thoughts, with very few clarifying footnotes:

Whether "native code is optimal" or not depends on which specific Rust snippet you compile with rustc/LLVM. Some of them are more optimal than others.

As potential Wasm instructions, add_wide3 and add_with_carry_u have the same signature and the same behavior. (And the same equality holds for add_wide vs add_overflow_u.) So between the instructions themselves, there is no difference; the difference that matters is how they're used: the key point is that Wasm module producers don't attempt to model machine instructions, instead they should consider the instruction a high-level primitive that will be lowered to a handful of immediately-consecutive machine instructions by the engine.

For the usefulness of add_wide3 in addition to add128, I don't disagree about the facts; it just so happens that my conclusion is a weak leaning towards having it for the minor benefit it brings. But I can also live with having only add128 for now. I do think that add_wide3 with i64 inputs is better than not having it; however the idea of leaving it out for now and then adding it later with an i1 input, which would make it even better, is appealing.

@alexcrichton
Copy link
Collaborator Author

Oh for add_with_carry and add_overflow they were subtly different, the overflow flag was modeled as i32. Both were guaranteed to produce 0: i32 or 1: i32, but the input to add_with_carry_u was interpreted as "zero or nonzero". In essence it was a poor-man's i1 without actually adding i1, and it made using them quite awkward. Notably though I never prototyped something like add_wide3 where the semantics were "add all 3 full-width numbers and return a full-width carry".

Overall though would you feel ok deferring add_wide3 and add_wide to a future where i1 were added? That would basically mean reviving the original instructions I worked with but with i1 in the signatures instead of i32 or i64.

@jakobkummerow
Copy link

Yeah, sure.

Realistically there's a good chance that that future will never happen, and then it'll be sad to not have add_wide3 at all; but my prototype has shown that V8 can get the performance it wants via engine smarts, and I should now stop spending my time arguing for things that would help other implementations.

@alexcrichton
Copy link
Collaborator Author

alexcrichton commented Mar 21, 2025

Ok I've opened up #36 to capture the discussion here and I've annotated that to close this issue as well. I'll leave that open for a week or so to let folks comment on it.

alexcrichton added a commit to alexcrichton/llvm-project that referenced this issue Mar 21, 2025
This commit is the result of investigation and discussion on
WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128`
instruction were discussed but ultimately deferred to a future proposal.
In spite of this though I wanted to apply a few changes to the LLVM
backend here with `wide-arithmetic` enabled for a few minor changes:

* A lowering for the `ISD::UADDO` node is added which uses `add128`
  where the upper bits of the two operands are constant zeros and the
  result of the 128-bit addition is the result of the overflowing addition.
* The high bits of a `I64_ADD128` node are now flagged as "known zero"
  if the upper bits of the inputs are also zero, assisting this `UADDO`
  lowering to ensure the backend knows that the carry result is a 1-bit
  result.

A few tests were then added to showcase various lowerings for various
operations that can be done with wide-arithmetic. They don't all
optimize super well at this time but I wanted to add them as a reference
here regardless to have them on-hand for future evaluations if
necessary.
alexcrichton added a commit to alexcrichton/llvm-project that referenced this issue Mar 21, 2025
This commit is the result of investigation and discussion on
WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128`
instruction were discussed but ultimately deferred to a future proposal.
In spite of this though I wanted to apply a few changes to the LLVM
backend here with `wide-arithmetic` enabled for a few minor changes:

* A lowering for the `ISD::UADDO` node is added which uses `add128`
  where the upper bits of the two operands are constant zeros and the
  result of the 128-bit addition is the result of the overflowing addition.
* The high bits of a `I64_ADD128` node are now flagged as "known zero"
  if the upper bits of the inputs are also zero, assisting this `UADDO`
  lowering to ensure the backend knows that the carry result is a 1-bit
  result.

A few tests were then added to showcase various lowerings for various
operations that can be done with wide-arithmetic. They don't all
optimize super well at this time but I wanted to add them as a reference
here regardless to have them on-hand for future evaluations if
necessary.
alexcrichton added a commit to alexcrichton/llvm-project that referenced this issue Mar 31, 2025
This commit is the result of investigation and discussion on
WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128`
instruction were discussed but ultimately deferred to a future proposal.
In spite of this though I wanted to apply a few changes to the LLVM
backend here with `wide-arithmetic` enabled for a few minor changes:

* A lowering for the `ISD::UADDO` node is added which uses `add128`
  where the upper bits of the two operands are constant zeros and the
  result of the 128-bit addition is the result of the overflowing addition.
* The high bits of a `I64_ADD128` node are now flagged as "known zero"
  if the upper bits of the inputs are also zero, assisting this `UADDO`
  lowering to ensure the backend knows that the carry result is a 1-bit
  result.

A few tests were then added to showcase various lowerings for various
operations that can be done with wide-arithmetic. They don't all
optimize super well at this time but I wanted to add them as a reference
here regardless to have them on-hand for future evaluations if
necessary.
dschuff pushed a commit to llvm/llvm-project that referenced this issue Mar 31, 2025
This commit is the result of investigation and discussion on
WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128`
instruction were discussed but ultimately deferred to a future proposal.
In spite of this though I wanted to apply a few changes to the LLVM
backend here with `wide-arithmetic` enabled for a few minor changes:

* A lowering for the `ISD::UADDO` node is added which uses `add128`
where the upper bits of the two operands are constant zeros and the
result of the 128-bit addition is the result of the overflowing
addition.
* The high bits of a `I64_ADD128` node are now flagged as "known zero"
if the upper bits of the inputs are also zero, assisting this `UADDO`
lowering to ensure the backend knows that the carry result is a 1-bit
result.

A few tests were then added to showcase various lowerings for various
operations that can be done with wide-arithmetic. They don't all
optimize super well at this time but I wanted to add them as a reference
here regardless to have them on-hand for future evaluations if
necessary.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this issue Mar 31, 2025
…430)

This commit is the result of investigation and discussion on
WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128`
instruction were discussed but ultimately deferred to a future proposal.
In spite of this though I wanted to apply a few changes to the LLVM
backend here with `wide-arithmetic` enabled for a few minor changes:

* A lowering for the `ISD::UADDO` node is added which uses `add128`
where the upper bits of the two operands are constant zeros and the
result of the 128-bit addition is the result of the overflowing
addition.
* The high bits of a `I64_ADD128` node are now flagged as "known zero"
if the upper bits of the inputs are also zero, assisting this `UADDO`
lowering to ensure the backend knows that the carry result is a 1-bit
result.

A few tests were then added to showcase various lowerings for various
operations that can be done with wide-arithmetic. They don't all
optimize super well at this time but I wanted to add them as a reference
here regardless to have them on-hand for future evaluations if
necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants