Add SIMD optimization for int_to_float conversion #580

hjanuschka · 2025-12-22T09:02:40Z

SIMD fast paths for the int_to_float function which converts custom bit-depth floats stored as i32 back to f32.

32-bit float: straightforward bitcast via SIMD.

16-bit float (f16): SIMD handles normal values, zeros, and inf/nan. Subnormals fall back to scalar since they need a variable-iteration normalization loop.

Waiting for perf CI to see the impact.

Add SIMD fast paths for converting custom bit-depth floats to f32: - 32-bit float passthrough: Simple bitcast using SIMD - 16-bit float (f16/half-precision): SIMD conversion with scalar fallback for subnormal values The 16-bit float SIMD path handles normal, zero, and inf/nan cases directly, falling back to scalar for the rare subnormal case which requires variable-iteration normalization. Also adds BitDepth::f16() test helper and comprehensive unit tests for the conversion functions.

github-actions · 2025-12-22T09:23:18Z

Benchmark @ `fca2520`


====================================================================================================
MULTI-FILE BENCHMARK RESULTS (4 files, 5 revisions)
  https://github.com/zond/jxl-perfhistory
  CPU architecture: x86_64
  WARNING: System appears noisy: high system load (2.14). Results may be unreliable.
====================================================================================================

Statistics:
  Revisions:                        5
  Confidence:                    99.0%
  Max relative error:             3.0%

[ 1] ca95cbd3 Merge fca2520d9603cee41c8f04102560191c0c57bee0 into 35103...
     vs fca2520d Fix clippy excessive precision warnings in f16 tests
----------------------------------------------------------------------------------------------------------------------------------------------------------------
     bike.jxl                   |                     ├────────────────│─┤                                         |     23268897 px/s | 0.995 / prev 
     green_queen_modular_e3.jxl |                                       ├╂│─┤                                      |      6149105 px/s | 1.003 / prev 
     green_queen_vardct_e3.jxl  |                       ├──────────│─────╂─┤                                       |     19659119 px/s | 0.984 / prev 
     sunset_logo.jxl            |                                      ├─╂│┤                                       |      2260622 px/s | 1.004 / prev 
                                  Scale: 0.89 to 1.11 (1.0 = same speed, '┃' marks 1.0)

[ 2] fca2520d Fix clippy excessive precision warnings in f16 tests
     vs b289ed54 Add SIMD optimization for int_to_float conversion
----------------------------------------------------------------------------------------------------------------------------------------------------------------
     bike.jxl                   |                         ├──────────────╂│────┤                                   |     23387634 px/s | 1.003 / prev 
     green_queen_modular_e3.jxl |                                   ├──│─┤                                         |      6133575 px/s | 0.995 / prev 
     green_queen_vardct_e3.jxl  |                                      ├─╂───│──┤                                  |     19974922 px/s | 1.012 / prev 
     sunset_logo.jxl            |                         ├─────────────│╂─┤                                       |      2252733 px/s | 0.997 / prev 
                                  Scale: 0.89 to 1.11 (1.0 = same speed, '┃' marks 1.0)

[ 3] b289ed54 Add SIMD optimization for int_to_float conversion
     vs 35103ba7 Bump version to 0.2.0.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
     bike.jxl                   |                        ├────────────│──╂───┤                                     |     23317401 px/s | 0.993 / prev 
     green_queen_modular_e3.jxl |                       ├───│┤           ┃                                         |      6161652 px/s | 0.965 / prev ▼ slower (0.96 / prev)
     green_queen_vardct_e3.jxl  |                             ├────│─────╂──┤                                      |     19742170 px/s | 0.983 / prev 
     sunset_logo.jxl            |                                   ├───│┤                                         |      2260277 px/s | 0.996 / prev 
                                  Scale: 0.89 to 1.11 (1.0 = same speed, '┃' marks 1.0)

[ 4] 35103ba7 Bump version to 0.2.0.
     vs eaf69230 Implement `AtomicRefCell`
----------------------------------------------------------------------------------------------------------------------------------------------------------------
     bike.jxl                   |                           ├────────────╂──│───┤                                  |     23487626 px/s | 1.008 / prev 
     green_queen_modular_e3.jxl |                                        ┃      ├──│─┤                             |      6387331 px/s | 1.027 / prev ▲ faster (1.03 / prev)
     green_queen_vardct_e3.jxl  |                               ├────────╂─│───┤                                   |     20076927 px/s | 1.007 / prev 
     sunset_logo.jxl            |                            ├───────────╂│─┤                                      |      2268814 px/s | 1.004 / prev 
                                  Scale: 0.89 to 1.11 (1.0 = same speed, '┃' marks 1.0)

[ 5] eaf69230 Implement `AtomicRefCell` (oldest, baseline for comparisons)

================================================================================================================================================================

veluca93 · 2025-12-22T16:45:22Z

jxl/src/render/stages/convert.rs

+
+// SIMD 16-bit float (half-precision) to 32-bit float conversion
+// This handles IEEE 754 binary16 format: 1 sign bit, 5 exponent bits, 10 mantissa bits
+simd_function!(


I think I would prefer to have a pair of functions I32Vec::store_u16() and F32Vec::load_f16_bits() instead. Those functions function can use _mm256_cvtph_ps on AVX2 (by also requiring the F16C target feature, which is common) and vcvt_f32_f16 on NEON (although the Rust definition erroneously requires the f16 target feature, so we'd have to use inline assembly for now -- fixed in rust-lang/stdarch#1978), and fall back to scalar on SSE4.2.

We then could add store_f16() -- implemented in a similar way -- and use that to speed up the f16 conversion code..

hjanuschka added 2 commits December 22, 2025 10:02

Fix clippy excessive precision warnings in f16 tests

fca2520

veluca93 reviewed Dec 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SIMD optimization for int_to_float conversion #580

Add SIMD optimization for int_to_float conversion #580

Uh oh!

hjanuschka commented Dec 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

veluca93 Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add SIMD optimization for int_to_float conversion #580

Are you sure you want to change the base?

Add SIMD optimization for int_to_float conversion #580

Uh oh!

Conversation

hjanuschka commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 22, 2025

Benchmark @ fca2520

Uh oh!

veluca93 Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hjanuschka commented Dec 22, 2025 •

edited

Loading

Benchmark @ `fca2520`

veluca93 Dec 22, 2025 •

edited

Loading