pdf2txt / img2txt

pdf2txt is a python tool that can be used to convert pdf content to text. It contains another tool img2txt that can be used to convert image content to text.

Install External Packages

You need to install poppler and tesseract. Configure these operating environments, and configure config.yml.

    poppler_path: your_path\poppler\Library\bin
    tesseract_cmd: your_path\Tesseract-OCR\tesseract.exe

Usage

You need to specify the input and output file locations. The default input file is data/input.pdf or data/input.jpg, and the output is data/output.txt or data/page_no.txt.

Command and Parameters

pdf2txt.py [-h] [-v] [--type TYPE] [--input INPUT] [--output OUTPUT] [--thresh THRESH] [--maxval MAXVAL]

options: -h, --help show this help message and exit
-v, --verbose print output
--type TYPE content type of pdf file: text or image
--input INPUT input pdf file
--output OUTPUT prefix name of output files
--thresh THRESH used for thresholding image
--maxval MAXVAL used for Thresholding image

img2txt.py [-h] [-v] [--input INPUT] [--output OUTPUT] [--thresh THRESH] [--maxval MAXVAL]

options: -h, --help show this help message and exit
-v, --verbose print output
--input INPUT input image file
--output OUTPUT output text file
--thresh THRESH used for thresholding image
--maxval MAXVAL used for Thresholding image

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
img2txt.py		img2txt.py
pdf2txt.py		pdf2txt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2txt / img2txt

Install External Packages

Usage

Command and Parameters

About

Releases

Packages

Languages

License

gameboy88/pdf2txt

Folders and files

Latest commit

History

Repository files navigation

pdf2txt / img2txt

Install External Packages

Usage

Command and Parameters

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages