dpsprep
[options] src [dest]
This tool, initially made specifically for use with Sony's Digital Paper System (DPS), is now a general-purpose DjVu to PDF converter with a focus on small output size and the ability to preserve document outlines (e.g. TOC) and text layers (e.g. OCR).
-q
,--quality
: Quality of images in output. Used only for JPEG compression, i.e. RGB and Grayscale images. Passed directly to Pillow and to OCRmyPDF's optimizer.-p
,--pool-size
: Size of MultiProcessing pool for handling page-by-page operations.-v
,--verbose
: Display debug messages.-o
,--overwrite
: Overwrite destination file.-w
,--preserve-working
: Preserve the working directory after script termination.-d
,--delete-working
: Delete any existing files in the working directory prior to writing to it.-t
,--no-text
: Disable the generation of text layers. Implied by --ocr.-m
,--mode
: Image mode. The default is to ask libdjvu for the image mode of every page. It sometimes makes sense to force bitonal images since they compress well.--ocr
Perform OCR via OCRmyPDF rather than trying to convert the text layer. If this parameter has a value, it should be a JSON dictionary of options to be passed to OCRmyPDF.-O1
: Use the lossless PDF image optimization from OCRmyPDF (without performing OCR).-O2
: Use the PDF image optimization from OCRmyPDF.-O3
: Use the aggressive lossy PDF image optimization from OCRmyPDF.--help
: Show this message and exit.
Produce file.pdf
in the current directory:
dpsprep /wherever/file.djvu
Produce output.pdf
with reduced image quality and aggressive PDF image optimizations:
dpsprep ---quality=30 -O3 input.djvu output.pdf
Produce an output file using a large pool of workers:
dpsprep --pool=16 input.djvu
Force bitonal images:
dpsprep --mode bitonal input.djvu
Produce an output file by disregarding the text layer and running OCRmyPDF instead:
dpsprep --ocr '{"language": ["rus", "eng"]}' input.djvu
Or simply disregard the text layer without OCR:
dpsprep --no-text input.djvu
We perform compression in two stages:
-
The first one is the default compression provided by Pillow. For bitonal images, the PDF generation code says that, if
libtiff
is available,group4
compression is used. -
If OCRmyPDF is installed, its PDF optimization can be used via the flags
-O1
to-O3
(this involves no OCR). This allows us to use advanced techniques, including JBIG2 compression viajbig2enc
.
If manually running OCRmyPDF, note that the optimization command suggested in the documentation (setting --tesseract-timeout
to 0
) may ruin existing text layers. To perform only PDF optimization you can use the following undocumented tool instead:
python -m ocrmypdf.optimize <input_file> <level> <output_file>