Skip to content

GraphifyPDF is an AI Agents-powered framework for turning raw PDF text into rich knowledge graphs—making document understanding, querying, and visualization easy from start to finish.

License

Notifications You must be signed in to change notification settings

yellowberard/GraphifyPDF

Repository files navigation

GraphifyPDF

GraphifyPDF is a cutting-edge, open-source Python framework to automatically extract, structure, and visualize knowledge from unstructured documents (like PDFs) into an interactive knowledge graph. It leverages large language models (LLMs), OCR and Neo4j graph technology, making AI-driven document understanding—and transformation—simple, robust, and explorable for everyone.


🚀 What Does GraphifyPDF Do?

GraphifyPDF transforms complex, unstructured documents (such as PDFs processed with OCR) into knowledge graphs. Through a state-of-the-art LLM pipeline, it automatically identifies entities and their relationships from text, then inserts this structured data into a Neo4j graph database. Visualize, query, and explore your documents like never before—turning raw PDFs into actionable, interconnected knowledge.


🧩 Core Features

  • Automated Entity & Relationship Extraction
    • Uses advanced LLMs (OpenAI GPT-4o via DSPy) for high-accuracy semantic extraction.
  • Semantic Chunking
    • Chunks documents at the concept/paragraph level for smarter context processing.
  • Graph Database Integration
    • Seamlessly pushes results into a Neo4j database for lightning-fast, flexible querying.
  • Instant Visualization
    • Built-in Streamlit UI with pyvis lets users interact with and explore the knowledge graph.
  • End-to-End Processing
    • Complete workflow from PDF (via OCR) → JSON → graph database → visual UI.

📈 Why Is GraphifyPDF Unique?

  • Next-Gen Document AI: Most solutions stop at data extraction. GraphifyPDF goes further by automatically mapping extracted knowledge into a graph structure, powering downstream semantic applications.
  • LLM Power: The system is built on bleeding-edge LLMs—outperforming old-school regex/POS/NLP methods.
  • Full Stack for Knowledge Graphs: You get extraction, structuring, storage, and visualization out-of-the-box.
  • Open, Extensible, and Fast: Built for real developers and researchers. Modern Python. Pluggable analysis.

🗂️ Project Structure

graphifypdf/
├── agents/             # LLM-based entity/relation extraction modules
├── configs/            # Configuration and secrets
├── ocr_extractor/      # OCR-specific helpers
├── schemas/            # Data schemas for validation and modeling
├── ui/                 # Streamlit/pyvis front-end for graph viz
├── utils/              # Chunking, Neo4j connectors, helpers
└── main.py             # Core workflow script

🔬 How Does It Work? — Basic Flow

  1. Document Ingestion:
    After uploading a document using ocr it convert pdf in to json with markdown.

  2. Semantic Chunking:
    The DocumentChunker breaks your document into meaningful text chunks (usually by paragraphs, logical sections, or topics).

  3. Entity & Relation Extraction:
    Each chunk is processed by the modular ExtractionPipeline (built with DSPy and LLMs):

    • Entity Extraction: Detects real-world objects, concepts, people, events, etc.
    • Relation Extraction: Finds how entities relate to each other.
    • Entity Reflection & Refinement: Uses AI feedback to improve extraction accuracy.
    • Description Generation: Automatically builds natural-language descriptions.
    • Data Unification: Cleans and prepares data for graph ingestion.
  4. Graph Construction/Storage:
    Detected entities are turned into Neo4j nodes, relationships into Neo4j edges—structuring your document as a real knowledge graph.

  5. Visualization:
    Use the Streamlit-powered UI to interactively explore and search your new knowledge graph using pyvis.


Quickstart

# Clone the repo
git clone https://github.com/yellowberard/GraphifyPDF.git && cd GraphifyPDF

# Install Python dependencies
pip install -r requirements.txt  # or use pyproject.toml / uv

# Configure Neo4j and OpenAI/Mistral API keys in configs

# Launch visualization UI
streamlit run graphifypdf/ui/graphify_ui.py

(Instructions may be updated based on future CLI improvements.)


Tech Stack

  • Python 3.10+
  • DSPy (NLP pipeline), OpenAI GPT-4o/Mistral (LLMs)
  • Neo4j (graph database)
  • Streamlit + pyvis (interactive graph UI)
  • pydantic (schema, config)

🌍 Use Cases

  • Research paper knowledge mining
  • Legal document intelligence
  • Automated business process understanding
  • Healthcare or scientific document structuring
  • Semantic information retrieval across large archives

🙌 Contributing

PRs and issues are welcome! See CONTRIBUTING.md (to be created) for guidelines.


📝 License

MIT License


GraphifyPDF uniquely empowers you to transform raw, unsearchable documents into rich, structured, explorable knowledge graphs—automatically and at scale.

About

GraphifyPDF is an AI Agents-powered framework for turning raw PDF text into rich knowledge graphs—making document understanding, querying, and visualization easy from start to finish.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages