-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative or addition: instructions with overflow/carry flags #6
Comments
I've opened #10 with some more discussion behind the background of this proposal. It notably outlines the possible shape of these instructions plus why they had bad performance in Wasmtime with the initial implementation. |
One concern I might have with going exclusively with overflow/carry flags is that platforms like RISC-V don't have the same instructions as x86/aarch64 and they might be slower. For example implementing |
It is more likely to be a problem with the design of RISC-V ISA itself, as suggested by this discussion. It could be a big sacrifice for WASM if we give up such common operations on other hardware platforms, just because RISC-V people currently don't care about this. |
That's a good point! I should clarify that it's something I think is worth measuring, but I don't think it's a showstopper preventing such an alternative such as this. Orthogonally I wanted to flesh out some more numbers on this of what I'm measuring locally. Using the benchmarks in https://github.com/alexcrichton/wasm-benchmark-i128 I measured the slowdown relative-to-native of wasm today, with this proposal as-is with 128-bit ops, and then with this alternative instead of overflow/carry flags (plus
Summary,:
The most notable result is that add128 is performing much better in addition operations than overflow/carry operations. This may be counterintuitive to expectation. I've outlined the reason as to why in the explainer a bit but to summarize that here the most general lowerings of overflow/carry ops require moving the carry flag out of the EFLAGS register and back into it. That in turn causes significant slowdown if it's not optimized away, and Wasmtime/Cranelift do not have a means of optimizing it away. In the 128-bit op case that more closely matches the original program semantics |
Some recent [benchmarking] had a surprising result I wasn't trying to dig for. Notably as summarized in #11 some more use cases of widening multiplication being more optimal than 128-by-128 bit multiplication have started to arise. Coupled with local benchmarking confirming that both on x64 and aarch64 that widening multiplication has more support in LLVM for more optimal lowerings and was easier to implement in Wasmtime than the 128-by-128 bit multiplication once various optimizations were implemented. In the end `i64.mul128`, which was primarily motivated by "feels cleaner" and "should have the same performance" as widening multiplication, does not appear to have the expected performance/implementation tradeoff. Getting an as-performant `i64.mul128` instruction relative to `i64.mul_wide_{s,u}` has required more work than expected and so the balance of concerns has me now tipping away from `i64.mul128`, despite it being "less clean" compared to the add/sub opcodes proposed in this PR. Closes #11 [benchmarking]: #6 (comment)
I've posted some more thoughts to record in the overview in #18 |
Basically, I agree with your opinion. The correct modeling of how the carry flag works in the compiler is rare. There is a trade-off between less effort and optimal performance. The instructions to be added can not be perfect semantically, especially considering there aren't many people dedicated to the work. The current speedup is already considerable, so please go ahead and push this proposal upstream! |
I'm not an x86 person, so could you clarify how your are lowering these operations? Can't a 'wide' add be lowered to I mainly ask because the |
Hello! And sure! I might also point towards the overview docs which is a summary of the phase 2 presentation I last gave to the CG. You're correct that |
Ah yes, I understand now. Flags are a pain... :) Thanks. |
While I'm strongly in favor of making wide arithmetic faster in Wasm, I continue to be unconvinced by the claim that " The "bignum" use case is, in general and regardless of specific library or language, an addition loop: given two vectors A, B of integers, add them element-by-element, propagating the carry, storing each result element in a result vector S (much like we all did by hand in elementary school). The integers will typically have register width, i.e. be
So, how can we represent A.
For easier comparison with the following options, let's write this in one line and with abbreviated names:
That works fine, but the nesting is unsightly, leading to an idea to streamline that: B.
That's more elegant, but why should we have to emit so many zero-constants? Let's streamline those too, leading to the next idea: C.
D. Notably, all four options should get compiled to the same machine instruction sequence. In fact, engines have the hardest job for (A): they need to special-case the fact that two I think all options are implementable. I do think that (A) is significantly more complex, because it has to be prepared to handle the case where none of the inputs are constant Of course, if you pick addition of two 128-bit integers with ignored/truncated overflow into the 129th bit as the example you study, then (A) will be the best fit, because it does exactly that. But for the "bignum" use case (as exemplified in the "fib_10000" benchmark quoted above), there should be no performance difference between all four options -- if there is one, then something's wrong with the implementation(s) of toolchains or engines or both. In summary, I see a pleasing symmetry here: My pseudocode example above assumed for simplicity that both inputs have the same vector length. If they have unequal lengths, then a two-operand addition with wide output is useful too. So, my suggestion for maximum performance and simplicity would be to add the following six instructions:
(I didn't see any detailed discussion of |
To clarify I'm not claiming that this proposal as-is is the fastest of all possible designs/use-cases. I've primarily been comparing against the idea of representing the overflow flag as a value and that's what's been causing such a slowdown. Your proposed instructions don't do that, however, so I think it's very much worth considering! For reference the code that I've been benchmarking (with Native x64 compilationfull output on godbolt but the main loop is:
Wasm code generationfull output online but the main loop is:
Compiled code on x64 with WasmtimeHere I've just snipped the main loop, not the whole binary.
You're correct that the inner loop here is Basically I wanted to point out that at least from a purely technical point of view For what you're proposing I want to confirm, are you thinking that the translation for
If so that seems reasonable to me and it avoids the problem of dealing with flags-as-a-value where other drafts of this proposal have tried to interpret the One problem is that the In contrast with So overall, again assuming I've understood the lowering of As a final note, I also ideally don't want to over-rotate on bignum arithmetic as well. Lowering an optimal sequence of basic 128-bit addition is naturally easy with As a side note:
I've kept around |
Yes, with enough effort all four options I described lead to the same generated code.
Yes, modulo perhaps replacing
Yes. The pattern they'd match is In other words, the #[inline]
fn adc(a: BigDigit, b: BigDigit, c: BigDigit, acc: &mut BigDigit) -> BigDigit {
let mut tmp = c as DoubleBigDigit;
tmp += a as DoubleBigDigit;
tmp += b as DoubleBigDigit;
let lo = tmp as BigDigit;
*acc = tmp >> BITS;
lo
} then it would more obviously equal the
Generally, Wasm tries to do as much work as possible as early as possible, i.e. it prefers doing things in toolchains rather than in engines (when possible). There's an explicit goal that it should be possible to implement simple engines that still perform well. As an alternative phrasing, it should be possible to implement engines with high peak performance and low latency -- because most optimizations are done ahead of time.
I do think that bignum is the primary use case. In fact, while I'm sure it exists somewhere, I don't recall ever encountering an "actual i128" use case myself. But that said:
is very straightforward on the producer side:
That's even pretty concise in actual wat syntax: (func add128 (param i64 i64 i64 i64) (result i64 i64)
local.get 0
local.get 2
i64.add_wide ;; value stack: res_lo, tmp_hi
local.get 1
local.get 3 ;; value stack: res_lo, tmp_hi, a_hi, b_hi
i64.add3_wide ;; value stack: res_lo, res_hi, discard
drop
) A baseline/single-pass compiler clearly won't emit the best possible machine code for this, but that's okay (firstly because baseline compilers are almost never optimal, and secondly because this use case is likely rare). Come to think of it, this could even be done peephole-style very early on, possibly even in the decoder or its interface to the baseline compiler (in V8 we already do similar tricks there to optimize That said, I also wouldn't mind having both ( |
Do you have a source link on-hand for that? I'd like to add that to the reperotoire of "things to study". Or do you mean V8 has the equivalent of
Oh I'm no stranger to this. I'm talking from a practical "this is all working today" point of view. The work required on behalf of Cranelift is objectively very small to pattern match. I am (again, possibly naively) assuming that the work required to have a few extra patterns in other compilers is also going to be quite small. I'm not proposing we require LICM to have fast wasm compilers, instead just a pattern match or two.
This section of the overview, the slides from the February CG meeting, and these slides from the CG meeting last October outline the case for why I think that the optimization you describe here I believe is infeasible. Empirically in Cranelift this optimization is not possible without significant refactoring/restructuring. Such refactoring is not worth the value of optimizing 128-bit addition we've concluded. You're definitely correct though in that other compilers might find such a transformation easy to implement. I mostly think that if performance relied on such an optimization happening we would find such a proposal not-useful in Wasmtime/Cranelift.
Personally I would also have no issue with this. It's something I've wondered as well where if we admit that the lowest-level primitive of |
I'd call it an exact equivalent of that To save you the trouble of looking up the typedefs: |
Thanks! I've created a dedicated issue for evaluating these instructions. It'll take me some time to get around to doing this in LLVM. |
It turns out there is actually value to an As discussed above, the loop body of a bignum addition needs to sum up three values in each iteration. If these three have unconstrained
So we could type the |
Oh I don't disagree that I don't mean to say it's not useful though, just that the calculus of complexity/performance tradeoff is something where I don't think it's quite worth it at this time. |
You do want the I'm not saying we need to add |
Right yeah I definitely agree that engines could make use of Although you've helped me crystallize another constraint I don't think I've necessarily written down:
I've been additionally thinking about this proposal from the perspective of engines should not need any sort of new analysis pass to make these instructions fast. My goal has been to integrate into existing engines easily with carefully selected instructions and still close the native/wasm performance gap significantly. For example that's what ended up panning out in Wasmtime, just Put another way, if ONLY |
In order to emit optimal code (which is a more ambitious goal than to "close the perf gap significantly"), My prototype of an Now I can quantify its impact. Explanation of the columns:
(Engine: V8 random daily snapshot shortly before 13.5 branch + early prototype implementation of this proposal, Compared to the status quo, anything's a huge improvement. But if we want to aim for optimal code, the I think this table definitively puts to rest last year's claim that " |
Thanks for prototyping and gathering numbers! To clarify, are you saying that If you're thinking though that |
To clarify: I think it would be technically ideal to have My thinking on The rightmost three columns in my table all used the module you provided, i.e. the wire bytes contained only |
Ah ok, interesting! I was actually starting to wonder what you were benchmarking there but starting from "only On the topic of
Of course this can all be done, but at the same time you're saying that you're starting with On the topic of I realize you're saying that V8 probably doesn't care if it's Personally I wouldn't feel too swayed by binary size here. I don't disagree that |
Also, on |
It was the only module I had available.
That is exactly the crux of the matter: it produces an
That's difficult to imagine. Supporting an |
I apologize if I sound like a broken record, but this is what I linked to above, notably this section of the overview coupled with this historic CG presentation. Notably the slide with this table: ![]() I implemented the rough equivalent of I know I've basically been repeating this many many times to you but I'm not really sure how else to say it? I can try to dig up the branches and show you why it's slow but that would both take a bit to dig up and you'd have to boot up on Wasmtime/Cranelift to understand why generating more optimal code is hard. An alternative is I could try to better understand what V8 is doing and try to understand why it's easy in V8. Do you have a link to a diff for the work you've done that I can review? |
I know you've said all that before, and I'm sorry that I'm probably coming across as just not listening; I just can't reconcile it with my own implementation experience. I intentionally go from two nested (You can count the I don't have a good guess how to explain the difference in our respective experiments. One detail I've noticed before in your machine code snippets is an explicit instruction to put the carry value back into the flags register. That seems inefficient, and I don't understand why a compiler would do that, and I'd be quite surprised if that's what Cranelift emitted for My prototype compiles
and that's pretty much hard-coded in the code generator (i.e. the logic is basically "see the In your experiment, did you replace each
then I could certainly see how that'd be much slower than a single Also, without having looked at its code, I'm pretty sure that Cranelift internally replaces I'll get my hacky experimental code uploaded when I get a chance so you can take a look at it in full detail. |
OK, here you go: patch Applies to current tip-of-tree V8, if you want to compile it. Only x64 is implemented (and even there, Quick guidance:
|
Thanks for the link and the explanations! That definitely all appears as I'd expect, and while I don't fully grok the i1 optimization the purpose of removing an I believe the answer to your confusion lies in the primitives that the wasm I originally tested was using and what was implemented in Cranelift. I outlined on this slide but what I was testing was:
The primitives in Cranelift I used at the time were, in V8 terms, IR nodes of Does that help explain the discrepancies in performance? Basically in my experience the performance of a compiler here is heavily dependent on how overflow-flag-using operations are modeled. I have no doubt that if I implemented The benchmark here, bignum addition, is basically FWIW a few days ago on my llvm/rust/wasmtime forks I filled out more implementation, notably I got LLVM to do that double-add128-fold-to-add_wide3 optimization so the input wasm itself had add_wide3. In Wasmtime I translated that right back to add128(add128(..), ..) and got tests passing and the benchmark ran the same as before. The same-performance was expected though given the implementation. One thing I'm a bit hung up on and haven't bottomed out: why did you do the i1 optimization? Did you find a tight loop that had |
Yeah, I know it needs a descriptive comment :-)
It does this by first finding all
I mostly did it for exploratory reasons: I wanted to see what it takes to generate optimal code. Practical relevance is given by every single bigint addition loop, e.g. the one you linked before. Let me rephrase that slightly for clarity:
So With the state of this proposal as it is, the emitted Wasm for the loop body will be
Not really, because:
My turn to sound like a broken record: "
Yeah, I'm not combining them either.
No problem; as my earlier results show, it doesn't have a huge impact. |
Ok I've recompiled all the old bits and here's what I have. Inner loop source codeThe rust source code for the inner loop is here - https://github.com/dignifiedquire/num-bigint/blob/699574df3ae665a6d394c6bc94d6d151aa063c25/src/algorithms/add.rs#L13-L35 Native disassembly of inner loop
add128: Inner loop wasmloop ;; label = @13
local.get 7
local.get 12
i64.const 0
local.get 7
i64.load
i64.const 0
i64.add128
local.get 15
i64.load
i64.const 0
i64.add128
local.set 12
i64.store
local.get 7
i32.const 8
i32.add
local.tee 16
local.get 12
i64.const 0
local.get 16
i64.load
i64.const 0
i64.add128
local.get 15
i32.const 8
i32.add
i64.load
i64.const 0
i64.add128
local.set 12
i64.store
local.get 15
i32.const 16
i32.add
local.set 15
local.get 7
i32.const 16
i32.add
local.set 7
local.get 5
local.get 13
i32.const 2
i32.add
local.tee 13
i32.ne
br_if 0 (;@13;)
end add128: Inner loop native disassembly
overflow: Inner loop wasm
overflow: Inner loop disassembly
The add128 loop is 16% slower than native. The overflow loops is 101% slower than native. My rationale for why the overflow loop is so much slwoer is that the codegen is terrible, the overflow flag keeps getting moved in and out of EFLAGS. Why? It's because Wasmtime and Cranelift generate code for each wasm instruction effectively in isolation in this case. The carry flag in EFLAGS is defined in one wasm instruction and consumed in another. Wasmtime and Cranelift have no means of seeing that the definition of the carry flag is adjacent to the use therefore there's no need to take it out of a register. The reason add128 is fast is the carry flag is never live between wasm instructions. All it takes is a lowering rule or two to fit constants into |
Also if it helps overflow.wasm.gz is the wasm file I was benchmarking for "overflow". I was evaluating a lot of different things in flight so the opcodes aren't the same. Notably this is the decoding, this is the validation, and this is the translation to Cranelift |
I'm not doubting that you saw what you saw. I think the discrepancy between your experimental results and mine boils down to one crucial difference:
Interestingly and confusingly, in both of these perspectives the signature of the operation in question is I think the biggest reason for the performance differences we saw is actually not the flags register (although that probably contributes), but rather suboptimal module generation: So, I think to get your results to match mine, two things would be required:
I guessed before that Cranelift's existing
Codegen sidenote: the sequence
really makes very little sense. If you have a carry bit from a previous operation in
Well, in a sequence
the value But sure, the general point stands that having way too many |
You're right yeah the wasm itself is not very well optimized (nor lowering patterns for To clarify though your translation of
Notably After applying some basic LLVM optimizations the new wasm is: take 2: overflow loop
It's still unrolled twice with 3 adds per unroll (total 6). The two add_overflow_u + add at the end is the equivalent of add_wide3, so I believe this is a "more basic" translation of add_wide3. The generated machine code is: take 2: overflow machine code
This has the "most naive lowering" here which is each Basically what I'm getting at here, which is still true of the original prototype, is that if performance relies on the overflow bit being carried between wasm instructions I don't think this proposal is going to work. The add_wide3 instruction works because, as you say, it keeps the overflow bit within the instruction. One thing I still don't understand: why did you get speedup with your i1 optimization? This loop doesn't benefit or trigger i1 optimizations? Perhaps measurement noise of up to 10% (which wouldn't surprise me, it's a noisy microbenchmark). Also, to clarify, none of this is working with add_wide3. LLVM has no knowledge of it nor does Wasmtime. In this "overflow" world Wasmtime has no nowledge of 128-bit arithmetic either, only LLVM does. In some sense nothing has add_wide3 as a primitive operation, not x64, LLVM, Rust, or wasm (the variant I'm testing here). Personally I still feel very lost with respect to conclusions on this thread. None of this feels like it motivates
The only major open question I have is:
Do you have other conslusions at this point though? |
You're right, I realized after posting it that I gave the
That should result in
Looks better, but seems to get only halfway there. You really want to have one
Yes, it does. Every bigint addition loop I've ever seen does. They all start with I'm sure it wasn't measurement noise. It was pretty stable across several runs, fluctuating by 1-2% at most. (Some of the other tests fluctuate much more wildly.)
My take:
Taking a closer look at the native codegen, here's an interesting observation: the specific example you shared before is written in a way that makes rustc/LLVM not emit optimal code. In fact, we can use it to illustrate the difference between |
To clarify, though, avoiding the
I don't think that's quite right? The loop is still unrolled twice from the original source code. That means there's two conceptual Specifically the heavy lifting here is that if you add a carry flag to another you can use
Ah yes good point, but at the same time there's no way for the loop to be a single In the end though it's basically impossible for the bigint summation loop to be a single
The loops there are relatively bad in the sense the have two branches inside them, one to exit at the top and one to jump back at the bottom. You can manually disable unrolling with llvm flags which gets the loops back with full optimizations but no unrolling to make it easier to read. One thing I'm not clear on - are you saying that
Assuming
Which for x64 is something like:
What I don't understand is what you're assuming those two comments are. Everything I've seen historically is that performance is critical depending on what those two comments are and it's why LLVM is emitting the code it is. |
I see. Clearly I don't have a good understanding of Cranelift. First I thought
Yeah, to make it easy for an engine to emit optimal code, you want two
I fully agree, the loop body cannot be a single
Thanks, good to know.
On the Wasm level, in Rust syntax, I ideally want to see:
Which, on x64 (following your syntax of Intel operand order with AT&T register names, i.e.
This is now hand-optimized; I don't necessarily expect compilers to achieve exactly that, but they shouldn't be far off: |
Ok so to recap my understanding (sorry if I'm restating the obvious, this is a long thread and my head can only hold but so much):
Or at least that's my current attempt to summarize facts without any opinions. My hope is that you have a better understanding at this point of when I chose the abstractions of Going back to a question I had above though:
I still don't understand why you think we should have
Given this equivalence it's trivial for LLVM to emit All that said we're also not (or as far as I know) discussing removing Personally given that I really don't want to add a new value type to wasm and the trivial equivalences here it feels six-to-one-half-dozen-or-the-other here. I don't have a strong reason to leave them out, but I also don't think there's a strong reason to keep them. For example leaving them out cuts down on opcodes. Keeping them in brings some slight size benefits. I feel like neither of these is really strong enough to give a definitive answer either way, so I'd lean more towards the conservative side where they can be added in the future with |
Yes, that all matches my thoughts, with very few clarifying footnotes: Whether "native code is optimal" or not depends on which specific Rust snippet you compile with rustc/LLVM. Some of them are more optimal than others. As potential Wasm instructions, For the usefulness of |
Oh for add_with_carry and add_overflow they were subtly different, the overflow flag was modeled as Overall though would you feel ok deferring |
Yeah, sure. Realistically there's a good chance that that future will never happen, and then it'll be sad to not have |
Ok I've opened up #36 to capture the discussion here and I've annotated that to close this issue as well. I'll leave that open for a week or so to let folks comment on it. |
This commit is the result of investigation and discussion on WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128` instruction were discussed but ultimately deferred to a future proposal. In spite of this though I wanted to apply a few changes to the LLVM backend here with `wide-arithmetic` enabled for a few minor changes: * A lowering for the `ISD::UADDO` node is added which uses `add128` where the upper bits of the two operands are constant zeros and the result of the 128-bit addition is the result of the overflowing addition. * The high bits of a `I64_ADD128` node are now flagged as "known zero" if the upper bits of the inputs are also zero, assisting this `UADDO` lowering to ensure the backend knows that the carry result is a 1-bit result. A few tests were then added to showcase various lowerings for various operations that can be done with wide-arithmetic. They don't all optimize super well at this time but I wanted to add them as a reference here regardless to have them on-hand for future evaluations if necessary.
This commit is the result of investigation and discussion on WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128` instruction were discussed but ultimately deferred to a future proposal. In spite of this though I wanted to apply a few changes to the LLVM backend here with `wide-arithmetic` enabled for a few minor changes: * A lowering for the `ISD::UADDO` node is added which uses `add128` where the upper bits of the two operands are constant zeros and the result of the 128-bit addition is the result of the overflowing addition. * The high bits of a `I64_ADD128` node are now flagged as "known zero" if the upper bits of the inputs are also zero, assisting this `UADDO` lowering to ensure the backend knows that the carry result is a 1-bit result. A few tests were then added to showcase various lowerings for various operations that can be done with wide-arithmetic. They don't all optimize super well at this time but I wanted to add them as a reference here regardless to have them on-hand for future evaluations if necessary.
This commit is the result of investigation and discussion on WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128` instruction were discussed but ultimately deferred to a future proposal. In spite of this though I wanted to apply a few changes to the LLVM backend here with `wide-arithmetic` enabled for a few minor changes: * A lowering for the `ISD::UADDO` node is added which uses `add128` where the upper bits of the two operands are constant zeros and the result of the 128-bit addition is the result of the overflowing addition. * The high bits of a `I64_ADD128` node are now flagged as "known zero" if the upper bits of the inputs are also zero, assisting this `UADDO` lowering to ensure the backend knows that the carry result is a 1-bit result. A few tests were then added to showcase various lowerings for various operations that can be done with wide-arithmetic. They don't all optimize super well at this time but I wanted to add them as a reference here regardless to have them on-hand for future evaluations if necessary.
This commit is the result of investigation and discussion on WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128` instruction were discussed but ultimately deferred to a future proposal. In spite of this though I wanted to apply a few changes to the LLVM backend here with `wide-arithmetic` enabled for a few minor changes: * A lowering for the `ISD::UADDO` node is added which uses `add128` where the upper bits of the two operands are constant zeros and the result of the 128-bit addition is the result of the overflowing addition. * The high bits of a `I64_ADD128` node are now flagged as "known zero" if the upper bits of the inputs are also zero, assisting this `UADDO` lowering to ensure the backend knows that the carry result is a 1-bit result. A few tests were then added to showcase various lowerings for various operations that can be done with wide-arithmetic. They don't all optimize super well at this time but I wanted to add them as a reference here regardless to have them on-hand for future evaluations if necessary.
…430) This commit is the result of investigation and discussion on WebAssembly/wide-arithmetic#6 where alternatives to the `i64.add128` instruction were discussed but ultimately deferred to a future proposal. In spite of this though I wanted to apply a few changes to the LLVM backend here with `wide-arithmetic` enabled for a few minor changes: * A lowering for the `ISD::UADDO` node is added which uses `add128` where the upper bits of the two operands are constant zeros and the result of the 128-bit addition is the result of the overflowing addition. * The high bits of a `I64_ADD128` node are now flagged as "known zero" if the upper bits of the inputs are also zero, assisting this `UADDO` lowering to ensure the backend knows that the carry result is a 1-bit result. A few tests were then added to showcase various lowerings for various operations that can be done with wide-arithmetic. They don't all optimize super well at this time but I wanted to add them as a reference here regardless to have them on-hand for future evaluations if necessary.
Perhaps the primary alternative to this proposal that would try to solve the same original problem would be to add instructions that manipulate/expose the overflow flag directly from native hardware. This is lower level than 128-bit operations themselves and the theory is that 128-bit operations could be built from these lower level instructions. The original proposal explicitly did not propose this due to complexity of optimizing these instructions and the ease of achieving performance on desired benchmarks with 128-bit operations. In the CG meeting however it was brought up that these lower-level primitives might be a better choice.
This issue is intended to be a discussion location for continuing to flesh out the flags-vs-128-bit-ops story. More thorough rationale will be needed in later phases to either explicitly use flag-based ops or not. This will likely be informed through experience in implementations in other engines in addition to performance numbers of relevant benchmarks.
The text was updated successfully, but these errors were encountered: