Skip to content

Remove i64.mul128, add i64.mul_wide_{s,u} #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 10, 2024
Merged

Conversation

alexcrichton
Copy link
Collaborator

Some recent benchmarking had a surprising result I wasn't trying to dig for. Notably as summarized in #11 some more use cases of widening multiplication being more optimal than 128-by-128 bit multiplication have started to arise. Coupled with local benchmarking confirming that both on x64 and aarch64 that widening multiplication has more support in LLVM for more optimal lowerings and was easier to implement in Wasmtime than the 128-by-128 bit multiplication once various optimizations were implemented.

In the end i64.mul128, which was primarily motivated by "feels cleaner" and "should have the same performance" as widening multiplication, does not appear to have the expected performance/implementation tradeoff. Getting an as-performant i64.mul128 instruction relative to i64.mul_wide_{s,u} has required more work than expected and so the balance of concerns has me now tipping away from i64.mul128, despite it being "less clean" compared to the add/sub opcodes proposed in this PR.

Closes #11

Some recent [benchmarking] had a surprising result I wasn't trying to
dig for. Notably as summarized in #11 some more use cases of widening
multiplication being more optimal than 128-by-128 bit multiplication
have started to arise. Coupled with local benchmarking confirming that
both on x64 and aarch64 that widening multiplication has more support in
LLVM for more optimal lowerings and was easier to implement in Wasmtime
than the 128-by-128 bit multiplication once various optimizations were
implemented.

In the end `i64.mul128`, which was primarily motivated by "feels
cleaner" and "should have the same performance" as widening
multiplication, does not appear to have the expected
performance/implementation tradeoff. Getting an as-performant
`i64.mul128` instruction relative to `i64.mul_wide_{s,u}` has required
more work than expected and so the balance of concerns has me now
tipping away from `i64.mul128`, despite it being "less clean" compared
to the add/sub opcodes proposed in this PR.

Closes #11

[benchmarking]: #6 (comment)
@alexcrichton alexcrichton merged commit 91e8f31 into main Sep 10, 2024
@alexcrichton alexcrichton deleted the remove-mul128 branch September 10, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is mul128 going to be as fast as widening multiplication?
1 participant