PDF Extractor

A Streamlit web application that extracts and displays metadata and text content from PDF files.

Features

Upload Multiple PDFs: Upload one or more PDF files through a simple interface
Extract Metadata: Automatically extract all available metadata from each PDF
View PDF Content: View the full text content of each PDF
Tab Navigation: Easily navigate between multiple PDFs using tabs
Export to CSV: Export all metadata to a CSV file for further analysis
Clean UI: Streamlined user interface with custom styling

Installation

Prerequisites

Python 3.7 or higher
pip (Python package installer)

Setup

Clone this repository:

git clone https://github.com/username/pdf-extractor.git
cd pdf-extractor

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Running the Application

Run the application with the following command:

streamlit run pdf_extractor.py

For deployment with a custom base URL path:

streamlit run pdf_extractor.py --server.baseUrlPath="/pdf"

Using the Application

Upload PDF Files:
- Click the "Choose PDF files" button in the sidebar
- Select one or more PDF files from your computer
View Metadata:
- The application will automatically extract and display metadata for each PDF
- Navigate between PDFs using the tabs at the top
View PDF Content:
- Click the "PDF DATA" expander to view the full text content of the PDF
Export Metadata:
- Use the "Export Metadata" button in the sidebar to download a CSV file
- Optionally include the full PDF text content in the export

Docker Support

A Dockerfile is included for containerized deployment:

docker build -t pdf-extractor .
docker run -p 8501:8501 pdf-extractor

To run the application with a custom base URL path in Docker:

docker run -p 8501:8501 -e BASE_URL_PATH="/pdf" pdf-extractor

The BASE_URL_PATH environment variable is optional. If not specified, the application will run at the root path.

Technical Details

Dependencies

streamlit: Web application framework
pdfminer.six: PDF parsing and text extraction
pandas: Data manipulation and CSV export

Code Structure

pdf_extractor.py: Main application file containing:
- PDF metadata extraction functions
- Text content extraction
- Streamlit UI components
- CSV export functionality

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pdf_extractor.py		pdf_extractor.py
requirements.txt		requirements.txt
screenshot1.png		screenshot1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Extractor

Features

Installation

Prerequisites

Setup

Usage

Running the Application

Using the Application

Docker Support

Technical Details

Dependencies

Code Structure

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

metalshanked/pdf-extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Extractor

Features

Installation

Prerequisites

Setup

Usage

Running the Application

Using the Application

Docker Support

Technical Details

Dependencies

Code Structure

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages