Hardware-accelerated inverse floating point square root #156
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
https://github.com/Aikku93 wrote a software divide that can keep up with the square root unit and wrote this implementation in assembly.
They asked me to make a pull request for it. Draft for now because I need to update the hw_math test also.
Some general information on the characteristics of this function:
Execution time seems to be about 37 bus cycles on dsi, 61-62 bus cycles on ds.
Accuracy should be within 1 ulp of the result you would get when doing the calculation with doubles and then casting to single precision, which should more accurate than a simple 1.0f/sqrtf(x).
The result is exact (well, identical) in ~89 % of possible mantissas.
Considerations: In this version the LUT is in main memory, but it may be desirable to place it in tcm on ds. I believe on dsi it doesnt matter where you put it because the overall execution time is limited by the hw square root. Alignment to cache lines could also be a consideration.