I had been using pdfsandwich to
create searchable PDFs from non-searchable PDFs. However, it's a pain to collect
all the dependencies if e.g. you don't have root access. So I thought to package
them up with Julia's BinaryBuilder to make installation simple. However, I
wasn't able to cross-compile pdfsandwich
itself. But since tesseract is doing
the hard work anyway, I thought I would just write the glue script myself. It
turns out there are several of
these
already.
I believe I have likely diverged from the pdfsandwich
implementation since I
haven't used ImageMagick's convert
which is one of the dependencies of
pdfsandwich
. Since the job can be done very simply, e.g.
- convert each page of the PDF to an image
- possibly clean it up with
unpaper
- use tesseract to create a single-page searchable PDF
- combine the PDFs,
I decided to not look at the source of pdfsandwich
when creating my implementation so I can stick to an MIT
license, which is the usual one in the Julia community.
It more-or-less works on MacOS (both Intel and Apple Silicon) and Linux.
Next steps:
- Allow choice of training data used for tesseract
- Look at what settings should be used for
unpaper
- Robustify and test on more files
- Add better tests?
using SearchablePDFs
file = ocr("test/test_rasterized.pdf")
or use searchable
.
TODO- CLI using @main
.