How is it decided how to optimize images? #1487

homocomputeris · 2025-02-22T14:49:59Z

homocomputeris
Feb 22, 2025

As an example, I scan documents in grayscale TIFF, combine them with img2pdf and then run

ocrmypdf --tesseract-timeout=0 --optimize 3 --remove-background input.pdf output.pdf

The CLI output shows that some files are optimized as JPEG, some as JBIG, some as PNG (although initially all were TIFFs).

How does OCRmyPDF decide how to optimize images and to which formats to convert them?

jbarlow83 · 2025-02-22T19:13:36Z

jbarlow83
Feb 22, 2025
Maintainer

The process is complicated.

In the default mode we process with Ghostscript which converts to PDF/A. In the process Ghostscript will optimize and change some image formats. This can converted lossless images to lossy in some cases. This behavior can be disabled with --output-type pdf or tweaking --pdfa-image-compression.

The second pass optimizer then reviews each image and tries a variety of optimization strategies. If an image can be quantized to 1-bit PNG without much loss, then you might see an original JPEG/TIFF converted to PNG then to JBIG2, provided all the dependencies are available. If an attempt to optimize an image results in a larger byte count, the optimization is discarded. If JBIG2 is available, all monochrome images get converted to JBIG2 or CCITT. All palette images get converted to PNG. Most JPEG2000 images get converted to JPEG.

There are all kinds of exceptions for nonstandard images.

0 replies

homocomputeris · 2025-02-22T19:57:11Z

homocomputeris
Feb 22, 2025
Author

monochrome images

Are BW images detected using metadata or by euristics? Say, if it's a BW images saved in grayscale, will it be optimized "back" to BW? Or is this part done by GS?

2 replies

jbarlow83 Feb 22, 2025
Maintainer

In most cases a BW image is only BW if the metadata says it is.

If pngquant is installed and enabled by the --optimize setting, and it is able to quantize an image 1bpp, then a grayscale or color image could become BW. I'm fairly certain Ghostscript does not ever quantize images but not completely certain.

homocomputeris Feb 22, 2025
Author

Thanks for the answer.
So, in practice, does it make sense to convert images to, say, PNG+BW before feeding them to img2pdf if I know they are BW, so that ocrmypdf can make a better decision when optimizing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is it decided how to optimize images? #1487

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How is it decided how to optimize images? #1487

homocomputeris Feb 22, 2025

Replies: 2 comments · 2 replies

jbarlow83 Feb 22, 2025 Maintainer

homocomputeris Feb 22, 2025 Author

jbarlow83 Feb 22, 2025 Maintainer

homocomputeris Feb 22, 2025 Author

homocomputeris
Feb 22, 2025

Replies: 2 comments 2 replies

jbarlow83
Feb 22, 2025
Maintainer

homocomputeris
Feb 22, 2025
Author

jbarlow83 Feb 22, 2025
Maintainer

homocomputeris Feb 22, 2025
Author