GraphifyPDF

GraphifyPDF is a cutting-edge, open-source Python framework to automatically extract, structure, and visualize knowledge from unstructured documents (like PDFs) into an interactive knowledge graph. It leverages large language models (LLMs), OCR and Neo4j graph technology, making AI-driven document understanding—and transformation—simple, robust, and explorable for everyone.

🚀 What Does GraphifyPDF Do?

GraphifyPDF transforms complex, unstructured documents (such as PDFs processed with OCR) into knowledge graphs. Through a state-of-the-art LLM pipeline, it automatically identifies entities and their relationships from text, then inserts this structured data into a Neo4j graph database. Visualize, query, and explore your documents like never before—turning raw PDFs into actionable, interconnected knowledge.

🧩 Core Features

Automated Entity & Relationship Extraction
- Uses advanced LLMs (OpenAI GPT-4o via DSPy) for high-accuracy semantic extraction.
Semantic Chunking
- Chunks documents at the concept/paragraph level for smarter context processing.
Graph Database Integration
- Seamlessly pushes results into a Neo4j database for lightning-fast, flexible querying.
Instant Visualization
- Built-in Streamlit UI with pyvis lets users interact with and explore the knowledge graph.
End-to-End Processing
- Complete workflow from PDF (via OCR) → JSON → graph database → visual UI.

📈 Why Is GraphifyPDF Unique?

Next-Gen Document AI: Most solutions stop at data extraction. GraphifyPDF goes further by automatically mapping extracted knowledge into a graph structure, powering downstream semantic applications.
LLM Power: The system is built on bleeding-edge LLMs—outperforming old-school regex/POS/NLP methods.
Full Stack for Knowledge Graphs: You get extraction, structuring, storage, and visualization out-of-the-box.
Open, Extensible, and Fast: Built for real developers and researchers. Modern Python. Pluggable analysis.

🗂️ Project Structure

graphifypdf/
├── agents/             # LLM-based entity/relation extraction modules
├── configs/            # Configuration and secrets
├── ocr_extractor/      # OCR-specific helpers
├── schemas/            # Data schemas for validation and modeling
├── ui/                 # Streamlit/pyvis front-end for graph viz
├── utils/              # Chunking, Neo4j connectors, helpers
└── main.py             # Core workflow script

🔬 How Does It Work? — Basic Flow

Document Ingestion:
After uploading a document using ocr it convert pdf in to json with markdown.
Semantic Chunking:
The DocumentChunker breaks your document into meaningful text chunks (usually by paragraphs, logical sections, or topics).
Entity & Relation Extraction:
Each chunk is processed by the modular ExtractionPipeline (built with DSPy and LLMs):
- Entity Extraction: Detects real-world objects, concepts, people, events, etc.
- Relation Extraction: Finds how entities relate to each other.
- Entity Reflection & Refinement: Uses AI feedback to improve extraction accuracy.
- Description Generation: Automatically builds natural-language descriptions.
- Data Unification: Cleans and prepares data for graph ingestion.
Graph Construction/Storage:
Detected entities are turned into Neo4j nodes, relationships into Neo4j edges—structuring your document as a real knowledge graph.
Visualization:
Use the Streamlit-powered UI to interactively explore and search your new knowledge graph using pyvis.

⚡ Quickstart

# Clone the repo
git clone https://github.com/yellowberard/GraphifyPDF.git && cd GraphifyPDF

# Install Python dependencies
pip install -r requirements.txt  # or use pyproject.toml / uv

# Configure Neo4j and OpenAI/Mistral API keys in configs

# Launch visualization UI
streamlit run graphifypdf/ui/graphify_ui.py

(Instructions may be updated based on future CLI improvements.)

Tech Stack

Python 3.10+
DSPy (NLP pipeline), OpenAI GPT-4o/Mistral (LLMs)
Neo4j (graph database)
Streamlit + pyvis (interactive graph UI)
pydantic (schema, config)

🌍 Use Cases

Research paper knowledge mining
Legal document intelligence
Automated business process understanding
Healthcare or scientific document structuring
Semantic information retrieval across large archives

🙌 Contributing

PRs and issues are welcome! See CONTRIBUTING.md (to be created) for guidelines.

📝 License

MIT License

GraphifyPDF uniquely empowers you to transform raw, unsearchable documents into rich, structured, explorable knowledge graphs—automatically and at scale.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
graphifypdf		graphifypdf
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GraphifyPDF

🚀 What Does GraphifyPDF Do?

🧩 Core Features

📈 Why Is GraphifyPDF Unique?

🗂️ Project Structure

🔬 How Does It Work? — Basic Flow

⚡ Quickstart

Tech Stack

🌍 Use Cases

🙌 Contributing

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

yellowberard/GraphifyPDF

Folders and files

Latest commit

History

Repository files navigation

GraphifyPDF

🚀 What Does GraphifyPDF Do?

🧩 Core Features

📈 Why Is GraphifyPDF Unique?

🗂️ Project Structure

🔬 How Does It Work? — Basic Flow

⚡ Quickstart

Tech Stack

🌍 Use Cases

🙌 Contributing

📝 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages