GraphifyPDF is a cutting-edge, open-source Python framework to automatically extract, structure, and visualize knowledge from unstructured documents (like PDFs) into an interactive knowledge graph. It leverages large language models (LLMs), OCR and Neo4j graph technology, making AI-driven document understanding—and transformation—simple, robust, and explorable for everyone.
GraphifyPDF transforms complex, unstructured documents (such as PDFs processed with OCR) into knowledge graphs. Through a state-of-the-art LLM pipeline, it automatically identifies entities and their relationships from text, then inserts this structured data into a Neo4j graph database. Visualize, query, and explore your documents like never before—turning raw PDFs into actionable, interconnected knowledge.
- Automated Entity & Relationship Extraction
- Uses advanced LLMs (OpenAI GPT-4o via DSPy) for high-accuracy semantic extraction.
- Semantic Chunking
- Chunks documents at the concept/paragraph level for smarter context processing.
- Graph Database Integration
- Seamlessly pushes results into a Neo4j database for lightning-fast, flexible querying.
- Instant Visualization
- Built-in Streamlit UI with pyvis lets users interact with and explore the knowledge graph.
- End-to-End Processing
- Complete workflow from PDF (via OCR) → JSON → graph database → visual UI.
- Next-Gen Document AI: Most solutions stop at data extraction. GraphifyPDF goes further by automatically mapping extracted knowledge into a graph structure, powering downstream semantic applications.
- LLM Power: The system is built on bleeding-edge LLMs—outperforming old-school regex/POS/NLP methods.
- Full Stack for Knowledge Graphs: You get extraction, structuring, storage, and visualization out-of-the-box.
- Open, Extensible, and Fast: Built for real developers and researchers. Modern Python. Pluggable analysis.
graphifypdf/
├── agents/ # LLM-based entity/relation extraction modules
├── configs/ # Configuration and secrets
├── ocr_extractor/ # OCR-specific helpers
├── schemas/ # Data schemas for validation and modeling
├── ui/ # Streamlit/pyvis front-end for graph viz
├── utils/ # Chunking, Neo4j connectors, helpers
└── main.py # Core workflow script
-
Document Ingestion:
After uploading a document using ocr it convert pdf in to json with markdown. -
Semantic Chunking:
TheDocumentChunkerbreaks your document into meaningful text chunks (usually by paragraphs, logical sections, or topics). -
Entity & Relation Extraction:
Each chunk is processed by the modularExtractionPipeline(built with DSPy and LLMs):- Entity Extraction: Detects real-world objects, concepts, people, events, etc.
- Relation Extraction: Finds how entities relate to each other.
- Entity Reflection & Refinement: Uses AI feedback to improve extraction accuracy.
- Description Generation: Automatically builds natural-language descriptions.
- Data Unification: Cleans and prepares data for graph ingestion.
-
Graph Construction/Storage:
Detected entities are turned into Neo4j nodes, relationships into Neo4j edges—structuring your document as a real knowledge graph. -
Visualization:
Use the Streamlit-powered UI to interactively explore and search your new knowledge graph using pyvis.
# Clone the repo
git clone https://github.com/yellowberard/GraphifyPDF.git && cd GraphifyPDF
# Install Python dependencies
pip install -r requirements.txt # or use pyproject.toml / uv
# Configure Neo4j and OpenAI/Mistral API keys in configs
# Launch visualization UI
streamlit run graphifypdf/ui/graphify_ui.py(Instructions may be updated based on future CLI improvements.)
- Python 3.10+
- DSPy (NLP pipeline), OpenAI GPT-4o/Mistral (LLMs)
- Neo4j (graph database)
- Streamlit + pyvis (interactive graph UI)
- pydantic (schema, config)
- Research paper knowledge mining
- Legal document intelligence
- Automated business process understanding
- Healthcare or scientific document structuring
- Semantic information retrieval across large archives
PRs and issues are welcome! See CONTRIBUTING.md (to be created) for guidelines.
MIT License
GraphifyPDF uniquely empowers you to transform raw, unsearchable documents into rich, structured, explorable knowledge graphs—automatically and at scale.