📄 DocVision Parser

Framework document parsing powered by Vision Language Models (VLMs) and PDF extraction.

Overview

DocVision Parser is a robust Python library designed to extract high-quality structured text and markdown from documents (images and PDFs). It combines the speed of native PDF extraction with the reasoning power of Vision Language Models (like GPT-4o, Claude, or Llama 3.2).

The framework provides three powerful parsing modes:

PDF (Native): Ultra-fast extraction of text and tables using deterministic rules.
VLM Mode: High-fidelity single-shot parsing using Vision models to understand layout and context.
Agentic Mode: A self-correcting, iterative workflow that handles long documents and complex layouts by automatically detecting truncation or repetition.

Features

Hybrid PDF Parsing: Extract native text/tables and optionally use VLM to describe charts and images in-situ.
Agentic/Iterative Workflow: Self-correcting loop that handles model token limits and ensures complete transcription for long pages.
Intelligent Vision Pipeline: Automatic image rotation correction, DPI management, and dynamic optimization for the best VLM input.
Async-First: High-throughput processing with built-in concurrency control (Semaphores).
Structured Output: Native Pydantic support for extracting structured JSON data from any document.
Production-Ready: Automatic retries, error handling, and direct export to Markdown or JSON files.

Installation

Install using pip:

pip install docvision

Or using uv (recommended):

uv add docvision

Quick Start

Basic Usage

Initialize the DocumentParser and parse an image into Markdown.

import asyncio
from docvision import DocumentParser

async def main():
    # Initialize the parser
    parser = DocumentParser(
        vlm_base_url="https://api.openai.com/v1",
        vlm_model="gpt-4o-mini",
        vlm_api_key="your_api_key"
    )

    # Parse an image
    result = await parser.parse_image("document.jpg")
    
    print(result.content)
    print(f"ID: {result.id}")

if __name__ == "__main__":
    asyncio.run(main())

Parsing PDFs

The parser can handle PDFs using different strategies.

from docvision import DocumentParser, ParsingMode

async def parse_doc():
    parser = DocumentParser(vlm_base_url=..., vlm_model=..., vlm_api_key=...)

    # Mode 1: Native PDF (Fastest, no Vision costs)
    results = await parser.parse_pdf("report.pdf", parsing_mode=ParsingMode.PDF)

    # Mode 2: VLM (Best for complex layouts/handwriting)
    results = await parser.parse_pdf("scanned.pdf", parsing_mode=ParsingMode.VLM)

    # Mode 3: AGENTIC (Self-correcting for long tables/text)
    results = await parser.parse_pdf("dense.pdf", parsing_mode=ParsingMode.AGENTIC)

    # Save results directly to file
    await parser.parse_pdf("input.pdf", save_path="./output/results.md")

Advanced Features

Structured Output (JSON)

Extract data directly into Pydantic models.

from pydantic import BaseModel
from typing import List

class Item(BaseModel):
    description: str
    price: float

class Invoice(BaseModel):
    invoice_no: str
    items: List[Item]

# Note: system_prompt is required when using structured output
parser = DocumentParser(
    vlm_api_key="...", 
    system_prompt="Extract invoice details correctly."
)

result = await parser.parse_image("invoice.png", output_schema=Invoice)
print(result.content.invoice_no) # Content is now a Pydantic object

Hybrid Parsing (Native + VLM)

Use native extraction for text but let the VLM describe the charts.

parser = DocumentParser(
    vlm_api_key="...", 
    chart_description=True # This enables VLM hybrid for Native Mode
)

# Text and Tables are extracted natively, but <chart> tags 
# will contain VLM-generated descriptions.
results = await parser.parse_pdf("chart_heavy.pdf", parsing_mode=ParsingMode.PDF)

Configuration

The DocumentParser is configured during initialization.

Parameter	Type	Default	Description
`vlm_base_url`	`str`	`None`	OpenAI-compatible API base URL.
`vlm_model`	`str`	`None`	Model name (e.g., `gpt-4o`).
`vlm_api_key`	`str`	`None`	Your API key.
`temperature`	`float`	`0.7`	Model sampling temperature.
`max_tokens`	`int`	`4096`	Max tokens per VLM call.
`max_iterations`	`int`	`3`	Max retries/loops in Agentic mode.
`max_concurrency`	`int`	`5`	Max concurrent pages being processed.
`enable_rotate`	`bool`	`True`	Auto-fix image orientation.
`chart_description`	`bool`	`False`	Use VLM to describe charts in Native mode.
`render_zoom`	`float`	`2.0`	DPI multiplier for PDF rendering.
`debug_dir`	`str`	`None`	Directory to save debug images.

Architecture

DocVision Parser is built for reliability and scale:

VLMClient: Handles asynchronous communication with OpenAI/Groq/OpenRouter with built-in retries and timeout management.
NativePDFParser: Uses pdfplumber to extract structured text and complex tables while maintaining reading order.
ImageProcessor: A high-performance pipeline for converting PDFs and optimizing images (resizing, padding, rotating).
AgenticWorkflow: A state-machine that manages long-running generation tasks, ensuring complete document transcription.

Development

# Setup
uv sync --dev

# Run Tests
make test

# Lint & Format
make lint
make format

License

Apache 2.0 License. See LICENSE for details.

Author

Fahmi Aziz Fadhil

GitHub: @fahmiaziz98
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
examples		examples
src/docvision		src/docvision
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 DocVision Parser

Overview

Features

Installation

Quick Start

Basic Usage

Parsing PDFs

Advanced Features

Structured Output (JSON)

Hybrid Parsing (Native + VLM)

Configuration

Architecture

Development

License

Author

About

Uh oh!

Releases 9

Packages

Languages

License

fahmiaziz98/docvision

Folders and files

Latest commit

History

Repository files navigation

📄 DocVision Parser

Overview

Features

Installation

Quick Start

Basic Usage

Parsing PDFs

Advanced Features

Structured Output (JSON)

Hybrid Parsing (Native + VLM)

Configuration

Architecture

Development

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Languages

Packages