Skip to content

Create searchable PDFs via optical character recognition

License

Notifications You must be signed in to change notification settings

ericphanson/SearchablePDFs.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SearchablePDFs

Build Status Coverage

I had been using pdfsandwich to create searchable PDFs from non-searchable PDFs. However, it's a pain to collect all the dependencies if e.g. you don't have root access. So I thought to package them up with Julia's BinaryBuilder to make installation simple. However, I wasn't able to cross-compile pdfsandwich itself. But since tesseract is doing the hard work anyway, I thought I would just write the glue script myself. It turns out there are several of these already.

I believe I have likely diverged from the pdfsandwich implementation since I haven't used ImageMagick's convert which is one of the dependencies of pdfsandwich. Since the job can be done very simply, e.g.

  1. convert each page of the PDF to an image
  2. possibly clean it up with unpaper
  3. use tesseract to create a single-page searchable PDF
  4. combine the PDFs,

I decided to not look at the source of pdfsandwich when creating my implementation so I can stick to an MIT license, which is the usual one in the Julia community.

Status

It more-or-less works on MacOS (both Intel and Apple Silicon) and Linux.

Next steps:

  • Allow choice of training data used for tesseract
  • Look at what settings should be used for unpaper
  • Robustify and test on more files
  • Add better tests?

Usage

using SearchablePDFs
file = ocr("test/test_rasterized.pdf")

or use searchable.

TODO- CLI using @main.

About

Create searchable PDFs via optical character recognition

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages