Groq PDF Vision

Intelligent PDF processing with enterprise-grade reliability

Transform PDF documents into structured data using Groq's vision models. Extract tables, images, text, and metadata with AI processing and receive structured JSON output with automatic error handling and retry mechanisms.

Key Features
Requirements
Installation
Configuration
- Environment Variables
- API Key Setup
Quick Start
- With Progress Tracking (Recommended for Large PDFs)
Usage
Table Extraction Capabilities
- Supported Table Formats
- Enhanced Processing Features
Schema Building
Integration Examples
Example Schema
- Using the Example Schema
- Schema Design Tips
API Reference
Output Structure
Performance
Testing
Recent Improvements

Key Features

High-Speed Processing: Powered by Groq's optimized inference infrastructure
Comprehensive Data Extraction: Tables, text, images, charts, and document metadata
Automatic Configuration: Processing parameters optimized based on document characteristics
Batch Processing: Efficient handling of multi-page documents with retry mechanisms
Flexible Schema System: Customizable extraction schemas for domain-specific requirements
Multiple Interfaces: Command-line interface, Python API, and web application
Advanced Table Processing: Support for both array-based and object-based table structures
Error Handling: Comprehensive retry logic with exponential backoff
Progress Tracking: Real-time processing status with estimated completion times

Requirements

Python 3.8+
Groq API key
Dependencies: groq, pypdfium2, streamlit (for web interface)

Installation

# Create and activate a virtual environment (recommended)
python -m venv groq_pdf_vision_env
source groq_pdf_vision_env/bin/activate  # On Windows: groq_pdf_vision_env\Scripts\activate

# Clone and install from source
pip install -e .

Configuration

Environment Variables

# Required: Groq API key
export GROQ_API_KEY="your-api-key-here"

# Optional: Custom model (default: meta-llama/llama-4-scout-17b-16e-instruct)
export GROQ_MODEL="meta-llama/llama-4-scout-17b-16e-instruct"

API Key Setup

Get your API key from console.groq.com
Set it as an environment variable:

export GROQ_API_KEY="your-api-key-here"

Quick Start

from groq_pdf_vision import extract_pdf

# Extract data from a PDF (limit pages for testing)
result = extract_pdf("document.pdf", start_page=1, end_page=5)

# Access page-level results
for page in result["page_results"]:
    print(f"Page {page['page_number']}: {page['content'][:100]}...")

# Access accumulated data
print(f"Total pages: {len(result['page_results'])}")
print(f"Images found: {len(result['accumulated_data']['image_descriptions'])}")

With Progress Tracking (Recommended for Large PDFs)

Synchronous with Progress:

from groq_pdf_vision import extract_pdf

def progress_callback(message, current, total):
    percentage = (current / total) * 100
    print(f"Processing [{current}/{total}] ({percentage:.1f}%) {message}")

result = extract_pdf(
    "large_document.pdf",
    progress_callback=progress_callback
)
print(f"Completed in {result['metadata']['processing_time_seconds']:.1f} seconds")

Async with Progress:

import asyncio
from groq_pdf_vision import extract_pdf_async

def progress_callback(message, current, total):
    percentage = (current / total) * 100
    print(f"Processing [{current}/{total}] ({percentage:.1f}%) {message}")

async def main():
    result, metadata = await extract_pdf_async(
        "large_document.pdf",
        progress_callback=progress_callback
    )
    print(f"Completed in {metadata['processing_time_seconds']:.1f} seconds")

asyncio.run(main())

Usage

Python SDK

Basic Extraction

import os
from groq_pdf_vision import extract_pdf

# Set your API key
os.environ["GROQ_API_KEY"] = "your-api-key-here"

# Extract from PDF
result = extract_pdf("financial_report.pdf", save_results=True)

# Process results
for page in result["page_results"]:
    print(f"Page {page['page_number']}:")
    print(f"  Content: {len(page['content'])} characters")
    print(f"  Images: {len(page['image_descriptions'])} found")
    print(f"  Tables: {len(page['tables_data'])} found")

Async Processing

import asyncio
from groq_pdf_vision import extract_pdf_async

async def process_document():
    result, metadata = await extract_pdf_async("large_document.pdf")
    print(f"Processed in {metadata['processing_time_seconds']:.1f} seconds")
    return result

result = asyncio.run(process_document())

Custom Schemas

from groq_pdf_vision import extract_pdf
from groq_pdf_vision.schema_helpers import create_base_schema, add_custom_fields

# Use the default comprehensive schema (recommended for most cases)
result = extract_pdf("document.pdf")

# Create a custom schema by extending the base
base_schema = create_base_schema()
custom_fields = {
    "product_names": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Product names mentioned"
    },
    "prices": {
        "type": "array", 
        "items": {"type": "string"},
        "description": "Prices and costs mentioned"
    }
}
custom_schema = add_custom_fields(base_schema, custom_fields)
result = extract_pdf("catalog.pdf", schema=custom_schema)

# Or define a completely custom schema
minimal_schema = {
    "type": "object",
    "properties": {
        "page_number": {"type": "integer"},
        "summary": {"type": "string"},
        "key_points": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}
result = extract_pdf("document.pdf", schema=minimal_schema)

Command Line Interface

# Basic processing with default comprehensive schema
groq-pdf document.pdf --save

# Specific page range
groq-pdf document.pdf --start-page 1 --end-page 10

# Custom schema file
groq-pdf document.pdf --schema my_schema.json

# Inline JSON schema
groq-pdf document.pdf --schema '{"type":"object","properties":{"summary":{"type":"string"}}}'

# Get document info and processing estimates
groq-pdf document.pdf --info-only

Web Interface

Launch the Streamlit web interface:

streamlit run app.py

Then open http://localhost:8501 for drag-and-drop PDF processing.

Table Extraction Capabilities

The library features advanced table extraction that handles multiple data structures:

Supported Table Formats

Array-based Tables (Traditional):

{
  "headers": ["Column 1", "Column 2", "Column 3"],
  "rows": [
    ["Value 1", "Value 2", "Value 3"],
    ["Value 4", "Value 5", "Value 6"]
  ]
}

Object-based Tables (Financial/Complex):

{
  "headers": ["Share capital", "Subscribed", "Callable", "Total"],
  "rows": [
    {
      "Subscribed share capital": "2,288,500",
      "Callable share capital": "(1,601,950)",
      "Total": "885,722"
    }
  ]
}

Enhanced Processing Features

Smart Structure Detection: Automatically handles both array and object-based row formats
Intelligent Column Mapping: Maps dictionary keys to table headers with fuzzy matching
Data Quality Filtering: Removes placeholder and example data automatically
Consistent Output: Converts all formats to standardized DataFrames for display
Error Recovery: Graceful handling of malformed or incomplete table data

Example Usage

from groq_pdf_vision import extract_pdf

# Extract tables from financial documents
result = extract_pdf("financial_report.pdf")

# Access table data
for table in result["accumulated_data"]["tables_data"]:
    print(f"Table: {table['table_title']}")
    print(f"Structure: {len(table['headers'])} columns × {len(table['rows'])} rows")
    
    # Both formats work seamlessly
    if table['headers'] and table['rows']:
        # Process table data regardless of internal structure
        headers = table['headers']
        rows = table['rows']  # Can be arrays or objects

Schema Building

The library provides flexible schema building helpers to extract exactly what you need:

Basic Usage

from groq_pdf_vision import extract_pdf
from groq_pdf_vision.schema_helpers import create_base_schema, add_custom_fields

# Default schema works for most documents
result = extract_pdf("document.pdf")

Custom Field Examples

# Financial document extraction
base = create_base_schema()
financial_fields = {
    "financial_figures": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Revenue, profit, costs, and other financial amounts"
    },
    "companies_mentioned": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Company names and organizations"
    }
}
financial_schema = add_custom_fields(base, financial_fields)

# Research document extraction  
research_fields = {
    "methodology": {"type": "string", "description": "Research methodology"},
    "findings": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Key findings and results"
    }
}
research_schema = add_custom_fields(base, research_fields)

# Product catalog extraction
product_fields = {
    "product_names": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Product names mentioned"
    },
    "specifications": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Technical specifications"
    }
}
product_schema = add_custom_fields(base, product_fields)

Schema Building Helpers

from groq_pdf_vision.schema_helpers import (
    create_base_schema,
    add_custom_fields,
    create_entity_extraction_fields,
    create_list_field,
    create_object_field
)

# Build a schema step by step
schema = create_base_schema(include_images=True, include_tables=False)

# Add entity extraction for any domain
entity_fields = create_entity_extraction_fields(["person", "company", "location"])
schema = add_custom_fields(schema, entity_fields)

# Add custom list fields
contact_fields = create_list_field("contact_emails", "Email addresses found")
schema = add_custom_fields(schema, contact_fields)

Integration Examples

Flask Application

from flask import Flask, request, jsonify
from groq_pdf_vision import extract_pdf_async
import asyncio

app = Flask(__name__)

@app.route('/process-pdf', methods=['POST'])
def process_pdf():
    file = request.files['file']
    filepath = f"uploads/{file.filename}"
    file.save(filepath)
    
    async def process():
        return await extract_pdf_async(filepath)
    
    result, metadata = asyncio.run(process())
    
    return jsonify({
        "pages": len(result["page_results"]),
        "processing_time": metadata["processing_time_seconds"],
        "data": result["accumulated_data"]
    })

FastAPI Application

from fastapi import FastAPI, UploadFile, File
from groq_pdf_vision import extract_pdf_async
import tempfile

app = FastAPI()

@app.post("/process-pdf/")
async def process_pdf(file: UploadFile = File(...)):
    with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name
    
    result, metadata = await extract_pdf_async(tmp_path)
    
    return {
        "filename": file.filename,
        "pages_processed": len(result["page_results"]),
        "processing_time": metadata["processing_time_seconds"]
    }

Batch Processing

import asyncio
from pathlib import Path
from groq_pdf_vision import extract_pdf_async

async def process_batch(input_dir, output_dir):
    pdf_files = list(Path(input_dir).glob("*.pdf"))
    
    for pdf_file in pdf_files:
        print(f"Processing {pdf_file.name}")
        
        result, metadata = await extract_pdf_async(
            str(pdf_file),
            save_results=True,
            output_filename=f"{output_dir}/{pdf_file.stem}_results.json"
        )
        
        print(f"  Completed: {len(result['page_results'])} pages in {metadata['processing_time_seconds']:.1f}s")

# Process all PDFs in a directory
asyncio.run(process_batch("./input", "./output"))

Example Schema

The example_docs/ directory contains a comprehensive example schema (example_custom_schema.json) that demonstrates various field types and extraction patterns.

Using the Example Schema

Method 1: Load JSON Schema Directly

import json
from groq_pdf_vision import extract_pdf

# Load the example schema
with open('example_docs/example_custom_schema.json', 'r') as f:
    schema = json.load(f)

# Use it for extraction
result = extract_pdf("example_docs/example.pdf", schema=schema)

Method 2: Build with Schema Helpers (Recommended)

from groq_pdf_vision import extract_pdf
from groq_pdf_vision.schema_helpers import create_base_schema, add_custom_fields

# Start with the base schema
base = create_base_schema()

# Add your custom fields
custom_fields = {
    "document_type": {
        "type": "string", 
        "description": "Type of document (financial, technical, academic, etc.)"
    },
    "key_findings": {
        "type": "array", 
        "items": {"type": "string"}, 
        "description": "Most important findings or insights from this page"
    },
    "sentiment": {
        "type": "string", 
        "description": "Overall sentiment of the page content"
    }
}

# Combine them
schema = add_custom_fields(base, custom_fields)
result = extract_pdf("example_docs/example.pdf", schema=schema)

Schema Design Tips

Start Simple - Begin with create_base_schema() and add only what you need
Clear Descriptions - Good field descriptions help the AI understand what to extract
Appropriate Types - Use arrays for lists, objects for structured data, strings for text
Required Fields - Always include page_number and content as required
Test Iteratively - Start with a few pages to test your schema before processing large documents

API Reference

Core Functions

`extract_pdf(pdf_path, **kwargs)`

Extract data from a PDF file synchronously with comprehensive error handling and progress tracking.

Syntax:

extract_pdf(
    pdf_path: str,
    schema: Optional[Dict] = None,
    start_page: Optional[int] = None,
    end_page: Optional[int] = None,
    save_results: bool = False,
    output_filename: Optional[str] = None,
    progress_callback: Optional[Callable] = None
) -> Dict[str, Any]

Parameters:

Parameter	Type	Required	Default	Description
`pdf_path`	`str`	Yes	-	Path to the PDF file to process. Must be a valid file path.
`schema`	`Dict`	No	`None`	Custom JSON schema for extraction. If None, uses comprehensive default schema.
`start_page`	`int`	No	`1`	Starting page number (1-indexed). Must be ≥ 1.
`end_page`	`int`	No	`None`	Ending page number (1-indexed). If None, processes to end of document.
`save_results`	`bool`	No	`False`	Whether to save results to JSON file automatically.
`output_filename`	`str`	No	`None`	Custom output filename. If None and save_results=True, auto-generates filename.
`progress_callback`	`Callable`	No	`None`	Callback function for progress updates: `callback(message, current, total)`

Returns:

Type: Dict[str, Any]
Structure: Complete extraction results containing:
- page_results: List of per-page extraction data
- accumulated_data: Aggregated data across all pages
- processing_stats: Performance and timing information
- metadata: Processing configuration and timestamps

Example Usage:

# Basic extraction
result = extract_pdf("document.pdf")

# With custom schema and page range
custom_schema = {"type": "object", "properties": {"summary": {"type": "string"}}}
result = extract_pdf(
    "document.pdf",
    schema=custom_schema,
    start_page=1,
    end_page=10,
    save_results=True
)

# With progress tracking
def progress_handler(message, current, total):
    print(f"[{current}/{total}] {message}")

result = extract_pdf("document.pdf", progress_callback=progress_handler)

Exceptions:

FileNotFoundError: PDF file does not exist
ValueError: Invalid page range or malformed schema
PermissionError: Insufficient permissions to read PDF or write output
APIError: Groq API authentication or rate limit issues

`extract_pdf_async(pdf_path, **kwargs)`

Extract data from a PDF file asynchronously for better performance with large documents.

Syntax:

async extract_pdf_async(
    pdf_path: str,
    schema: Optional[Dict] = None,
    start_page: Optional[int] = None,
    end_page: Optional[int] = None,
    save_results: bool = False,
    output_filename: Optional[str] = None,
    progress_callback: Optional[Callable] = None
) -> Tuple[Dict[str, Any], Dict[str, Any]]

Parameters: Same as extract_pdf() above.

Returns:

Type: Tuple[Dict[str, Any], Dict[str, Any]]
Structure:
- [0]: Extraction results (same structure as extract_pdf)
- [1]: Metadata dictionary with processing statistics and configuration

Example Usage:

import asyncio

async def process_document():
    result, metadata = await extract_pdf_async("large_document.pdf")
    print(f"Processed {metadata['total_pages']} pages in {metadata['processing_time_seconds']:.1f}s")
    return result

# Run async processing
result = asyncio.run(process_document())

Exceptions: Same as extract_pdf() above.

Schema Helper Functions

`create_base_schema(include_images=True, include_tables=True)`

Create a comprehensive base schema suitable for most document types.

Syntax:

create_base_schema(
    include_images: bool = True,
    include_tables: bool = True
) -> Dict[str, Any]

Parameters:

Parameter	Type	Default	Description
`include_images`	`bool`	`True`	Whether to include image analysis fields in schema
`include_tables`	`bool`	`True`	Whether to include table extraction fields in schema

Returns:

Type: Dict[str, Any]
Description: JSON schema with standard fields for page content, metadata, and optional image/table analysis

Example Usage:

# Full schema with all features
schema = create_base_schema()

# Text-only schema (faster processing)
text_schema = create_base_schema(include_images=False, include_tables=False)

`add_custom_fields(base_schema, custom_fields)`

Add custom extraction fields to an existing schema.

Syntax:

add_custom_fields(
    base_schema: Dict[str, Any],
    custom_fields: Dict[str, Any]
) -> Dict[str, Any]

Parameters:

Parameter	Type	Description
`base_schema`	`Dict`	Base schema to extend (typically from `create_base_schema()`)
`custom_fields`	`Dict`	Dictionary of custom field definitions following JSON Schema format

Returns:

Type: Dict[str, Any]
Description: Extended schema with custom fields merged into base schema

Example Usage:

base = create_base_schema()
custom_fields = {
    "financial_figures": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Revenue, profit, and cost figures mentioned"
    },
    "risk_factors": {
        "type": "array", 
        "items": {"type": "string"},
        "description": "Risk factors and warnings identified"
    }
}
schema = add_custom_fields(base, custom_fields)

`create_entity_extraction_fields(entity_types)`

Generate schema fields for extracting specific entity types.

Syntax:

create_entity_extraction_fields(
    entity_types: List[str]
) -> Dict[str, Any]

Parameters:

Parameter	Type	Description
`entity_types`	`List[str]`	List of entity types to extract. Supported: `["person", "company", "location", "date", "money", "product", "technology", "email"]`

Returns:

Type: Dict[str, Any]
Description: Schema fields for extracting specified entity types

Example Usage:

# Extract people and companies
entity_fields = create_entity_extraction_fields(["person", "company", "location"])
schema = add_custom_fields(create_base_schema(), entity_fields)

Utility Functions

`validate_schema(schema)`

Validate a JSON schema for compatibility with the extraction system.

Syntax:

validate_schema(schema: Dict[str, Any]) -> bool

Parameters:

Parameter	Type	Description
`schema`	`Dict`	JSON schema to validate

Returns:

Type: bool
Description: True if schema is valid, raises exception if invalid

Exceptions:

ValueError: Schema format is invalid or incompatible

`get_pdf_info(pdf_path)`

Get metadata and information about a PDF file.

Syntax:

get_pdf_info(pdf_path: str) -> Dict[str, Any]

Parameters:

Parameter	Type	Description
`pdf_path`	`str`	Path to PDF file

Returns:

Type: Dict[str, Any]

Structure:

{
    "total_pages": int,
    "file_size_mb": float,
    "estimated_processing_time": float,
    "recommended_batch_size": int,
    "estimated_cost": float
}

Example Usage:

info = get_pdf_info("large_document.pdf")
print(f"Document has {info['total_pages']} pages")
print(f"Estimated processing time: {info['estimated_processing_time']:.1f} minutes")
print(f"Estimated cost: ${info['estimated_cost']:.2f}")

Error Handling

All functions may raise the following common exceptions:

Exception	Description	Common Causes
`FileNotFoundError`	PDF file not found	Invalid file path, file moved/deleted
`PermissionError`	Insufficient file permissions	Read-only files, permission restrictions
`ValueError`	Invalid parameter values	Negative page numbers, malformed schema
`APIError`	Groq API issues	Invalid API key, rate limits, service unavailable
`ProcessingError`	PDF processing failures	Corrupted PDF, unsupported format

Example Error Handling:

try:
    result = extract_pdf("document.pdf", start_page=1, end_page=10)
except FileNotFoundError:
    print("PDF file not found. Please check the file path.")
except ValueError as e:
    print(f"Invalid parameters: {e}")
except APIError as e:
    print(f"API error: {e}. Check your API key and rate limits.")
except Exception as e:
    print(f"Unexpected error: {e}")

Output Structure

{
  "source_pdf": "document.pdf",
  "page_results": [
    {
      "page_number": 1,
      "content": "Extracted text content...",
      "image_descriptions": [
        {
          "image_type": "chart",
          "description": "Bar chart showing quarterly revenue",
          "location": "center",
          "content_relation": "Supports revenue discussion"
        }
      ],
      "tables_data": [
        {
          "table_content": "Q1: $1M, Q2: $1.2M...",
          "table_structure": "2x4 table with headers"
        }
      ]
    }
  ],
  "accumulated_data": {
    "total_content": "All extracted text...",
    "all_image_descriptions": [...],
    "all_tables_data": [...],
    "visual_summary": "Document contains 5 charts and 3 tables"
  },
  "processing_stats": {
    "total_pages": 10,
    "pages_with_images": 3,
    "pages_with_tables": 5,
    "processing_time_seconds": 45.2
  }
}

Performance

Optimized Processing

Processing Speed: 1,300-2,000+ tokens/second (optimized batch processing)
Intelligent Auto-Configuration: Batch sizes automatically scale based on document size
Reliability: Retry mechanisms with graceful error handling across test scenarios
Memory Usage: Optimized for large documents up to 200+ pages
Enhanced Table Extraction: Supports both array-based and object-based table row structures
Improved Consistency: Lower temperature settings (0.05) for more reliable extraction results
Real-time Progress: Live progress tracking with terminal logging and ETA calculations

Real-World Benchmarks

Document Type	Pages	Processing Time	Throughput
Government Reports	85-118 pages	1.8-3.4 minutes	1,700+ tok/sec
Financial Documents	76-88 pages	2.4-2.9 minutes	1,400+ tok/sec
Technical Documents	50+ pages	1-2 minutes	1,500+ tok/sec

Auto-Configuration System

Small PDFs (≤10 pages): batch_size=2, high quality processing
Medium PDFs (11-50 pages): batch_size=3, balanced processing
Large PDFs (51-200 pages): batch_size=4, efficient batch processing
Enterprise PDFs (>200 pages): batch_size=5, maximum batch efficiency

Performance Optimizations

50% fewer API calls through intelligent batching
38% faster processing with optimized auto-configuration
Real-time progress tracking with ETA calculations
Automatic retry logic with exponential backoff
Enhanced table extraction supporting multiple data structures
Improved data filtering to remove placeholder content
Lower temperature settings for more consistent results

Testing

Comprehensive Test Suite

The repository includes a comprehensive test suite with both quick validation and full document stress testing.

Quick Test (Recommended):

cd tests
python run_all_tests.py

Expected Output:

Running Basic Tests...
README Examples: All Python SDK examples working
Integration Tests: Flask/FastAPI patterns verified  
Schema Tests: Custom schema functionality working
CLI Tests: All command-line options working

Results: 4/4 basic test suites passed (completed in ~30 seconds)

Would you like to run full document tests? (y/N): N
Skipping full document tests (saves 15-20 minutes and ~$1.00 in API costs)

All basic tests passed! The library is working correctly.

Test Categories

Basic Tests (~30 seconds)

README Examples (test_readme_examples.py): All Python SDK examples
Integration Tests (test_flask_integration.py): Flask/FastAPI patterns
Schema Tests (test_example_schema.py): Custom schema functionality
CLI Tests: All command-line options and parameters

Full Document Tests (~15-20 minutes)

Vision 2030 (85 pages): Saudi government document processing
Example Financial (76 pages): Financial document with heavy table content
Americas Children (118 pages): US government statistical report
Fed Economic Wellbeing (88 pages): Federal Reserve economic research

Performance Benchmarks

Our test suite includes comprehensive performance validation:

Document	Pages	Time	Tokens	Speed	Cost Est.
Vision 2030	85	1.9 min	230K	2,033 tok/sec	~$0.46
Example Financial	76	3.3 min	187K	1,354 tok/sec	~$0.37
Americas Children	118	3.3 min	352K	1,761 tok/sec	~$0.70
Fed Economic	88	3.0 min	259K	1,433 tok/sec	~$0.52

Individual Test Execution

Run specific test categories:

# Basic functionality tests
python tests/test_readme_examples.py
python tests/test_flask_integration.py
python tests/test_example_schema.py

# Full document stress tests (API costs apply)
python tests/test_vision2030_full_async.py
python tests/test_americas_children_full_async.py
python tests/test_fed_economic_wellbeing_full_async.py

Test Coverage

The test suite verifies:

All README Python SDK examples work correctly
All CLI commands and options function properly
Schema creation, validation, and custom field usage
Synchronous and asynchronous processing modes
Progress callbacks and real-time updates
Integration patterns (Flask, FastAPI, batch processing)
File I/O, path handling, and error conditions
Performance benchmarks and auto-configuration
Large document processing (up to 118 pages)
Enhanced table extraction with multiple data structures
Streamlit UI functionality without nested expander errors
Data quality filtering and placeholder content removal

Prerequisites for Testing

API Key: export GROQ_API_KEY="your-key-here"
Package Installation: pip install -e .
Virtual Environment (recommended)

Cost-Aware Testing

The test runner includes built-in cost awareness:

Basic Tests: Free validation using small document excerpts
Full Tests: User confirmation with clear cost warnings ($0.20-0.35 per test)
Default Behavior: Skips expensive tests unless explicitly confirmed
Graceful Handling: Missing test files don't break the test suite

Complete Documentation

For detailed performance analysis, auto-configuration results, and quality metrics, see tests/README.md.

Recent Improvements

Version 1.1.0 - Enhanced Table Processing & UI Fixes

Major Enhancements:

Enhanced Table Extraction: Now supports both array-based and object-based table row structures
Improved Data Quality: Better filtering of placeholder and example data
Lower Temperature: Reduced to 0.05 for more consistent extraction results
Real-time Progress: Enhanced progress tracking with terminal logging and ETA calculations

Bug Fixes:

Fixed Streamlit UI: Resolved nested expander errors in table display
Improved Error Handling: Better graceful recovery from processing failures
Enhanced Data Filtering: More effective removal of placeholder content

Performance Improvements:

Faster Processing: Optimized batch processing with intelligent auto-configuration
Better Reliability: Enhanced retry mechanisms with exponential backoff
Comprehensive Testing: All test suites pass including full document processing

Test Suite Validation:

All 8 test categories pass (README examples, integrations, schemas, CLI, full documents)
367 total pages processed across 4 different document types
Performance benchmarks updated with latest results
Cost estimates and processing speeds validated

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
example_docs		example_docs
groq_pdf_vision		groq_pdf_vision
tests		tests
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
setup.py		setup.py
snippet.txt		snippet.txt

Folders and files

Latest commit

History

Repository files navigation

Groq PDF Vision

Table of Contents

Key Features

Requirements

Installation

Configuration

Environment Variables

API Key Setup

Quick Start

With Progress Tracking (Recommended for Large PDFs)

Usage

Python SDK

Basic Extraction

Async Processing

Custom Schemas

Command Line Interface

Web Interface

Table Extraction Capabilities

Supported Table Formats

Enhanced Processing Features

Example Usage

Schema Building

Basic Usage

Custom Field Examples

Schema Building Helpers

Integration Examples

Flask Application

FastAPI Application

Batch Processing

Example Schema

Using the Example Schema

Schema Design Tips

API Reference

Core Functions

extract_pdf(pdf_path, **kwargs)

extract_pdf_async(pdf_path, **kwargs)

Schema Helper Functions

create_base_schema(include_images=True, include_tables=True)

add_custom_fields(base_schema, custom_fields)

create_entity_extraction_fields(entity_types)

Utility Functions

validate_schema(schema)

get_pdf_info(pdf_path)

Error Handling

Output Structure

Performance

Optimized Processing

Real-World Benchmarks

Auto-Configuration System

Performance Optimizations

Testing

Comprehensive Test Suite

Test Categories

Basic Tests (~30 seconds)

Full Document Tests (~15-20 minutes)

Performance Benchmarks

Individual Test Execution

Test Coverage

Prerequisites for Testing

Cost-Aware Testing

Complete Documentation

Recent Improvements

Version 1.1.0 - Enhanced Table Processing & UI Fixes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`extract_pdf(pdf_path, **kwargs)`

`extract_pdf_async(pdf_path, **kwargs)`

`create_base_schema(include_images=True, include_tables=True)`

`add_custom_fields(base_schema, custom_fields)`

`create_entity_extraction_fields(entity_types)`

`validate_schema(schema)`

`get_pdf_info(pdf_path)`

Packages