Document Intelligence

A comprehensive Python-based document processing toolkit for OCR, text extraction, NLP analysis, and document classification.

Features

OCR Processing: Extract text from PDF documents using Tesseract OCR
Text Preprocessing: Tokenization, stopword removal, lemmatization
Named Entity Recognition: Extract persons, organizations, locations, and custom entities
Sentiment Analysis: Multiple engines (TextBlob, Flair, VADER, HuggingFace)
Document Similarity: Compare documents using Word2Vec, TF-IDF, and GZIP-based methods
Document Clustering: Group similar documents using K-means and LSA
Text Summarization: Automatic text summarization using LSA
Document Classification: Classify documents into categories
Email Processing: Download and process emails from Gmail
Key-Value Extraction: Extract structured data from documents

Installation

Prerequisites

Python 3.8 or higher
Tesseract OCR (for optical character recognition)
- Linux: sudo apt-get install tesseract-ocr
- macOS: brew install tesseract
- Windows: Download from GitHub

Install Python Dependencies

pip install -r requirements.txt

Download Required NLP Models

spaCy Model

python -m spacy download en_core_web_sm

NLTK Data

python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"

TextBlob Corpora

python -m textblob.download_corpora

Configuration

Optional: Create settings.ini

Copy the example configuration file and customize it:

cp settings.ini.example settings.ini

Edit settings.ini to configure:

Document paths
Similarity thresholds
Document categories for classification

Note: If settings.ini is not found, scripts will use sensible defaults.

Optional: Gmail API Setup

For email downloading features, you'll need Google API credentials:

Go to Google Cloud Console
Create a new project
Enable Gmail API
Create OAuth 2.0 credentials
Download the credentials file as credentials.json in the project root

Project Structure

document_intelligence/
├── documents/              # Input PDF documents
├── txt_output/            # Extracted text files
├── category/              # Clustered documents
├── NER/                   # Named entity extraction results
├── sentiments/            # Sentiment analysis results
├── summarization/         # Document summaries
├── document_classification/ # Classification results
├── FL_sentiment/          # Flair sentiment analysis results
├── kvextract/             # Key-value extraction results
└── extract/               # Pattern extraction results

Usage

Main Pipeline

Run the main document processing pipeline:

python main.py

This will:

Process PDFs with OCR
Extract and preprocess text
Generate document vectors
Cluster similar documents

Individual Scripts

OCR Processing

python optical_character_recognition.py document1.pdf document2.pdf

Sentiment Analysis

python sentiment_analysis.py                    # TextBlob
python sentiment_analysis_using_flair.py        # Flair
python sentiment_analysis_using_vader.py        # VADER

Named Entity Recognition

python extract_named_entities.py

Document Clustering

python cluster_documents.py
python fuzzy_categorize_documents.py

Text Summarization

python summarize_text.py

Document Similarity

python document_similarity.py
python text_similarity.py
python gzip_knn_similarity.py

Email Processing

python download_email.py
python dl_email.py

Bug Fixes (Latest Release)

This release includes comprehensive bug fixes that resolve all execution-blocking issues:

Critical Fixes

✅ Fixed syntax error in optical_character_recognition.py (en-dash → hyphen in Tesseract config)
✅ Added missing logging import in optical_character_recognition.py
✅ Fixed module-level model downloads in extract_features_from_text.py and sentiment_analysis_using_flair.py (now uses lazy loading)
✅ Added all missing dependencies to requirements.txt:
- google-auth, google-auth-oauthlib, google-api-python-client
- textblob, flair, sumy, fuzzywuzzy
- scikit-learn, pandas, torch, transformers

Platform Compatibility

✅ Replaced all hardcoded Windows paths with cross-platform os.path.join()
✅ Made Tesseract path platform-aware (Windows vs Linux/Mac)
✅ All output directories now created automatically with os.makedirs(exist_ok=True)

Configuration & Error Handling

✅ Graceful handling of missing settings.ini (uses sensible defaults)
✅ Graceful handling of missing credentials.json (clear error message with instructions)
✅ Created settings.ini.example template for easy configuration

Type Errors & Logic Bugs

✅ Fixed type error in document_classification.py (convert numpy array to string)
✅ Fixed logic error in document_similarity.py (now reads file contents instead of comparing file paths)
✅ Fixed newline escaping in gzip_knn_similarity.py (\\n → \n)
✅ Fixed missing output directory creation in multiple scripts

Code Quality Improvements

✅ Moved module-level script logic into main() functions
✅ Added if __name__ == "__main__" guards to prevent execution on import
✅ Optimized spaCy model loading (load once at module level instead of per function call)

Dependencies

See requirements.txt for the complete list. Major dependencies include:

NLP: spaCy, NLTK, Flair, TextBlob, Gensim, Transformers
ML: scikit-learn, PyTorch, pandas, numpy, scipy
OCR: pytesseract, pdf2image, PyMuPDF, PyPDF4, Pillow
Other: google-api-python-client, fuzzywuzzy, sumy

Performance Notes

First run may take longer due to model downloads (Word2Vec, Flair, etc.)
Models are cached after first download
Word2Vec model (~1.6GB) is downloaded on-demand when needed
Use --verbose flag (where available) for detailed progress

Troubleshooting

Tesseract Not Found

Ensure Tesseract is installed and in your system PATH, or edit the path in optical_character_recognition.py

spaCy Model Not Found

Run: python -m spacy download en_core_web_sm

Gmail API Errors

Ensure credentials.json is present and you've enabled the Gmail API in Google Cloud Console

Out of Memory

For large document sets, process in smaller batches or increase system RAM

Contributing

Contributions are welcome! Please ensure all code:

Uses cross-platform paths (os.path.join())
Includes error handling
Uses lazy loading for large models
Has proper documentation

License

[Add your license here]

Support

For issues and questions, please open an issue on the project repository.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
SLDC		SLDC
audio		audio
documents		documents
help		help
LICENSE		LICENSE
README.md		README.md
classify_documents.py		classify_documents.py
clean.cmd		clean.cmd
cluster_documents.py		cluster_documents.py
cluster_documents_based_on_similarity.py		cluster_documents_based_on_similarity.py
compare_addresses.py		compare_addresses.py
compare_documents.py		compare_documents.py
convert_audio.py		convert_audio.py
convert_audio_format.py		convert_audio_format.py
convert_pdf_to_png.py		convert_pdf_to_png.py
create_topic_model.py		create_topic_model.py
data.jsonl		data.jsonl
dl_email.py		dl_email.py
document_classification.py		document_classification.py
document_similarity.py		document_similarity.py
download_email.py		download_email.py
extract.py		extract.py
extract_features_from_text.py		extract_features_from_text.py
extract_information_from_text.py		extract_information_from_text.py
extract_key_value_pairs.py		extract_key_value_pairs.py
extract_keywords_using_rake.py		extract_keywords_using_rake.py
extract_named_entities.py		extract_named_entities.py
extract_named_entities_using_spacy.py		extract_named_entities_using_spacy.py
extract_relations_between_entities.py		extract_relations_between_entities.py
extract_text_from_audio.py		extract_text_from_audio.py
extract_text_from_document.py		extract_text_from_document.py
extract_text_from_pdf.py		extract_text_from_pdf.py
find_unique_words_for_categorization.py		find_unique_words_for_categorization.py
fuzzy_categorize_documents.py		fuzzy_categorize_documents.py
fuzzy_match_text.py		fuzzy_match_text.py
gzip_knn_similarity.py		gzip_knn_similarity.py
insert_filenames.py		insert_filenames.py
javakill.cmd		javakill.cmd
main.py		main.py
make_pdf_searchable.py		make_pdf_searchable.py
manage_files.py		manage_files.py
mermaid.js		mermaid.js
optical_character_recognition.py		optical_character_recognition.py
preprocess_text.py		preprocess_text.py
readme.md		readme.md
remove_filenames_from_jsonl_after_annotation.py		remove_filenames_from_jsonl_after_annotation.py
requirements.txt		requirements.txt
sentiment_analysis.py		sentiment_analysis.py
sentiment_analysis_using_flair.py		sentiment_analysis_using_flair.py
sentiment_analysis_using_huggingface.py		sentiment_analysis_using_huggingface.py
sentiment_analysis_using_vader.py		sentiment_analysis_using_vader.py
settings.ini.example		settings.ini.example
show_jsonl_lines_without_label.py		show_jsonl_lines_without_label.py
summarize_text.py		summarize_text.py
text_similarity.md		text_similarity.md
text_similarity.py		text_similarity.py
train_distilbert_model.py		train_distilbert_model.py

License

jimmc414/document_intelligence

Folders and files

Latest commit

History

Repository files navigation

Document Intelligence

Features

Installation

Prerequisites

Install Python Dependencies

Download Required NLP Models

spaCy Model

NLTK Data

TextBlob Corpora

Configuration

Optional: Create settings.ini

Optional: Gmail API Setup

Project Structure

Usage

Main Pipeline

Individual Scripts

OCR Processing

Sentiment Analysis

Named Entity Recognition

Document Clustering

Text Summarization

Document Similarity

Email Processing

Bug Fixes (Latest Release)

Critical Fixes

Platform Compatibility

Configuration & Error Handling

Type Errors & Logic Bugs

Code Quality Improvements

Dependencies

Performance Notes

Troubleshooting

Tesseract Not Found

spaCy Model Not Found

Gmail API Errors

Out of Memory

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages