Skip to content

A simple python tool to extract the contents of PDF documents into text files

License

Notifications You must be signed in to change notification settings

gameboy88/pdf2txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf2txt / img2txt

pdf2txt is a python tool that can be used to convert pdf content to text. It contains another tool img2txt that can be used to convert image content to text.

Install External Packages

You need to install poppler and tesseract. Configure these operating environments, and configure config.yml.

    poppler_path: your_path\poppler\Library\bin
    tesseract_cmd: your_path\Tesseract-OCR\tesseract.exe

Usage

You need to specify the input and output file locations. The default input file is data/input.pdf or data/input.jpg, and the output is data/output.txt or data/page_no.txt.

Command and Parameters

pdf2txt.py [-h] [-v] [--type TYPE] [--input INPUT] [--output OUTPUT] [--thresh THRESH] [--maxval MAXVAL]

options: -h, --help show this help message and exit
-v, --verbose print output
--type TYPE content type of pdf file: text or image
--input INPUT input pdf file
--output OUTPUT prefix name of output files
--thresh THRESH used for thresholding image
--maxval MAXVAL used for Thresholding image

img2txt.py [-h] [-v] [--input INPUT] [--output OUTPUT] [--thresh THRESH] [--maxval MAXVAL]

options: -h, --help show this help message and exit
-v, --verbose print output
--input INPUT input image file
--output OUTPUT output text file
--thresh THRESH used for thresholding image
--maxval MAXVAL used for Thresholding image

About

A simple python tool to extract the contents of PDF documents into text files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages