Skip to content

Conversation

@hjanuschka
Copy link
Collaborator

@hjanuschka hjanuschka commented Dec 22, 2025

SIMD fast paths for the int_to_float function which converts custom bit-depth floats stored as i32 back to f32.

32-bit float: straightforward bitcast via SIMD.

16-bit float (f16): SIMD handles normal values, zeros, and inf/nan. Subnormals fall back to scalar since they need a variable-iteration normalization loop.

Waiting for perf CI to see the impact.

Add SIMD fast paths for converting custom bit-depth floats to f32:
- 32-bit float passthrough: Simple bitcast using SIMD
- 16-bit float (f16/half-precision): SIMD conversion with scalar fallback
  for subnormal values

The 16-bit float SIMD path handles normal, zero, and inf/nan cases directly,
falling back to scalar for the rare subnormal case which requires
variable-iteration normalization.

Also adds BitDepth::f16() test helper and comprehensive unit tests for
the conversion functions.
@github-actions
Copy link

Benchmark @ fca2520


====================================================================================================
MULTI-FILE BENCHMARK RESULTS (4 files, 5 revisions)
  https://github.com/zond/jxl-perfhistory
  CPU architecture: x86_64
  WARNING: System appears noisy: high system load (2.14). Results may be unreliable.
====================================================================================================

Statistics:
  Revisions:                        5
  Confidence:                    99.0%
  Max relative error:             3.0%

[ 1] ca95cbd3 Merge fca2520d9603cee41c8f04102560191c0c57bee0 into 35103...
     vs fca2520d Fix clippy excessive precision warnings in f16 tests
----------------------------------------------------------------------------------------------------------------------------------------------------------------
     bike.jxl                   |                     ├────────────────│─┤                                         |     23268897 px/s | 0.995 / prev 
     green_queen_modular_e3.jxl |                                       ├╂│─┤                                      |      6149105 px/s | 1.003 / prev 
     green_queen_vardct_e3.jxl  |                       ├──────────│─────╂─┤                                       |     19659119 px/s | 0.984 / prev 
     sunset_logo.jxl            |                                      ├─╂│┤                                       |      2260622 px/s | 1.004 / prev 
                                  Scale: 0.89 to 1.11 (1.0 = same speed, '┃' marks 1.0)

[ 2] fca2520d Fix clippy excessive precision warnings in f16 tests
     vs b289ed54 Add SIMD optimization for int_to_float conversion
----------------------------------------------------------------------------------------------------------------------------------------------------------------
     bike.jxl                   |                         ├──────────────╂│────┤                                   |     23387634 px/s | 1.003 / prev 
     green_queen_modular_e3.jxl |                                   ├──│─┤                                         |      6133575 px/s | 0.995 / prev 
     green_queen_vardct_e3.jxl  |                                      ├─╂───│──┤                                  |     19974922 px/s | 1.012 / prev 
     sunset_logo.jxl            |                         ├─────────────│╂─┤                                       |      2252733 px/s | 0.997 / prev 
                                  Scale: 0.89 to 1.11 (1.0 = same speed, '┃' marks 1.0)

[ 3] b289ed54 Add SIMD optimization for int_to_float conversion
     vs 35103ba7 Bump version to 0.2.0.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
     bike.jxl                   |                        ├────────────│──╂───┤                                     |     23317401 px/s | 0.993 / prev 
     green_queen_modular_e3.jxl |                       ├───│┤           ┃                                         |      6161652 px/s | 0.965 / prev ▼ slower (0.96 / prev)
     green_queen_vardct_e3.jxl  |                             ├────│─────╂──┤                                      |     19742170 px/s | 0.983 / prev 
     sunset_logo.jxl            |                                   ├───│┤                                         |      2260277 px/s | 0.996 / prev 
                                  Scale: 0.89 to 1.11 (1.0 = same speed, '┃' marks 1.0)

[ 4] 35103ba7 Bump version to 0.2.0.
     vs eaf69230 Implement `AtomicRefCell`
----------------------------------------------------------------------------------------------------------------------------------------------------------------
     bike.jxl                   |                           ├────────────╂──│───┤                                  |     23487626 px/s | 1.008 / prev 
     green_queen_modular_e3.jxl |                                        ┃      ├──│─┤                             |      6387331 px/s | 1.027 / prev ▲ faster (1.03 / prev)
     green_queen_vardct_e3.jxl  |                               ├────────╂─│───┤                                   |     20076927 px/s | 1.007 / prev 
     sunset_logo.jxl            |                            ├───────────╂│─┤                                      |      2268814 px/s | 1.004 / prev 
                                  Scale: 0.89 to 1.11 (1.0 = same speed, '┃' marks 1.0)

[ 5] eaf69230 Implement `AtomicRefCell` (oldest, baseline for comparisons)

================================================================================================================================================================


// SIMD 16-bit float (half-precision) to 32-bit float conversion
// This handles IEEE 754 binary16 format: 1 sign bit, 5 exponent bits, 10 mantissa bits
simd_function!(
Copy link
Member

@veluca93 veluca93 Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to have a pair of functions I32Vec::store_u16() and F32Vec::load_f16_bits() instead. Those functions function can use _mm256_cvtph_ps on AVX2 (by also requiring the F16C target feature, which is common) and vcvt_f32_f16 on NEON (although the Rust definition erroneously requires the f16 target feature, so we'd have to use inline assembly for now -- fixed in rust-lang/stdarch#1978), and fall back to scalar on SSE4.2.

We then could add store_f16() -- implemented in a similar way -- and use that to speed up the f16 conversion code..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants