Skip to content

NEON-optimize 5-3 IDWT#1630

Merged
rouault merged 1 commit intouclouvain:masterfrom
nico:neon-reversible
Apr 24, 2026
Merged

NEON-optimize 5-3 IDWT#1630
rouault merged 1 commit intouclouvain:masterfrom
nico:neon-reversible

Conversation

@nico
Copy link
Copy Markdown
Contributor

@nico nico commented Apr 12, 2026

Takes bin/bench_dwt from 1.618 s to 0.432 s on my system.


Sadly, despite being a much bigger win in bench_dwt time, it's a much smaller end-to-end time improvement than #1629, as time spent in idwt is smaller than time spent in T1 / MQC decoding for lossless files (as there's more data). But still, it seems nice to have fast 5-3 IDWT anyways.

For a large lossless file I have, it takes bin/opj_decompress -i balloon-reversible.jp2 -o test.ppm -threads 0 from printing "decode time: 588 ms" to "decode time: 559 ms", around a 5% speedup for decoding. It's not nothing, but much less than the other PR.

I created that input file with bin/opj_compress -i image.ppm -o balloon-reversible.jp2 -M 1, where image.ppm is balloon.jp2 in decompressed (and balloon.jp2 is the usual jp2 balloon test file).

Takes `bin/bench_dwt` from 1.618 s to 0.432 s on my system.
@nico
Copy link
Copy Markdown
Contributor Author

nico commented Apr 24, 2026

For this one, I also verified that ctest --parallel has the same 50 failures it was without the PR, and I checked that dwt.c compiles fine with --target=armv7a-linux-gnueabihf --sysroot ~/src/chrome/src/build/linux/debian_bullseye_armhf-sysroot (to test 32-bit arm).

@rouault rouault merged commit 530bebd into uclouvain:master Apr 24, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants