Transform PDF documents into structured data using Groq's vision models. Extract tables, images, text, and metadata with AI processing and receive structured JSON output with automatic error handling and retry mechanisms.
- Key Features
- Requirements
- Installation
- Configuration
- Quick Start
- Usage
- Table Extraction Capabilities
- Schema Building
- Integration Examples
- Example Schema
- API Reference
- Output Structure
- Performance
- Testing
- Recent Improvements
- High-Speed Processing: Powered by Groq's optimized inference infrastructure
- Comprehensive Data Extraction: Tables, text, images, charts, and document metadata
- Automatic Configuration: Processing parameters optimized based on document characteristics
- Batch Processing: Efficient handling of multi-page documents with retry mechanisms
- Flexible Schema System: Customizable extraction schemas for domain-specific requirements
- Multiple Interfaces: Command-line interface, Python API, and web application
- Advanced Table Processing: Support for both array-based and object-based table structures
- Error Handling: Comprehensive retry logic with exponential backoff
- Progress Tracking: Real-time processing status with estimated completion times
- Python 3.8+
- Groq API key
- Dependencies:
groq,pypdfium2,streamlit(for web interface)
# Create and activate a virtual environment (recommended)
python -m venv groq_pdf_vision_env
source groq_pdf_vision_env/bin/activate # On Windows: groq_pdf_vision_env\Scripts\activate
# Clone and install from source
pip install -e .# Required: Groq API key
export GROQ_API_KEY="your-api-key-here"
# Optional: Custom model (default: meta-llama/llama-4-scout-17b-16e-instruct)
export GROQ_MODEL="meta-llama/llama-4-scout-17b-16e-instruct"- Get your API key from console.groq.com
- Set it as an environment variable:
export GROQ_API_KEY="your-api-key-here"from groq_pdf_vision import extract_pdf
# Extract data from a PDF (limit pages for testing)
result = extract_pdf("document.pdf", start_page=1, end_page=5)
# Access page-level results
for page in result["page_results"]:
print(f"Page {page['page_number']}: {page['content'][:100]}...")
# Access accumulated data
print(f"Total pages: {len(result['page_results'])}")
print(f"Images found: {len(result['accumulated_data']['image_descriptions'])}")Synchronous with Progress:
from groq_pdf_vision import extract_pdf
def progress_callback(message, current, total):
percentage = (current / total) * 100
print(f"Processing [{current}/{total}] ({percentage:.1f}%) {message}")
result = extract_pdf(
"large_document.pdf",
progress_callback=progress_callback
)
print(f"Completed in {result['metadata']['processing_time_seconds']:.1f} seconds")Async with Progress:
import asyncio
from groq_pdf_vision import extract_pdf_async
def progress_callback(message, current, total):
percentage = (current / total) * 100
print(f"Processing [{current}/{total}] ({percentage:.1f}%) {message}")
async def main():
result, metadata = await extract_pdf_async(
"large_document.pdf",
progress_callback=progress_callback
)
print(f"Completed in {metadata['processing_time_seconds']:.1f} seconds")
asyncio.run(main())import os
from groq_pdf_vision import extract_pdf
# Set your API key
os.environ["GROQ_API_KEY"] = "your-api-key-here"
# Extract from PDF
result = extract_pdf("financial_report.pdf", save_results=True)
# Process results
for page in result["page_results"]:
print(f"Page {page['page_number']}:")
print(f" Content: {len(page['content'])} characters")
print(f" Images: {len(page['image_descriptions'])} found")
print(f" Tables: {len(page['tables_data'])} found")import asyncio
from groq_pdf_vision import extract_pdf_async
async def process_document():
result, metadata = await extract_pdf_async("large_document.pdf")
print(f"Processed in {metadata['processing_time_seconds']:.1f} seconds")
return result
result = asyncio.run(process_document())from groq_pdf_vision import extract_pdf
from groq_pdf_vision.schema_helpers import create_base_schema, add_custom_fields
# Use the default comprehensive schema (recommended for most cases)
result = extract_pdf("document.pdf")
# Create a custom schema by extending the base
base_schema = create_base_schema()
custom_fields = {
"product_names": {
"type": "array",
"items": {"type": "string"},
"description": "Product names mentioned"
},
"prices": {
"type": "array",
"items": {"type": "string"},
"description": "Prices and costs mentioned"
}
}
custom_schema = add_custom_fields(base_schema, custom_fields)
result = extract_pdf("catalog.pdf", schema=custom_schema)
# Or define a completely custom schema
minimal_schema = {
"type": "object",
"properties": {
"page_number": {"type": "integer"},
"summary": {"type": "string"},
"key_points": {
"type": "array",
"items": {"type": "string"}
}
}
}
result = extract_pdf("document.pdf", schema=minimal_schema)# Basic processing with default comprehensive schema
groq-pdf document.pdf --save
# Specific page range
groq-pdf document.pdf --start-page 1 --end-page 10
# Custom schema file
groq-pdf document.pdf --schema my_schema.json
# Inline JSON schema
groq-pdf document.pdf --schema '{"type":"object","properties":{"summary":{"type":"string"}}}'
# Get document info and processing estimates
groq-pdf document.pdf --info-onlyLaunch the Streamlit web interface:
streamlit run app.pyThen open http://localhost:8501 for drag-and-drop PDF processing.
The library features advanced table extraction that handles multiple data structures:
Array-based Tables (Traditional):
{
"headers": ["Column 1", "Column 2", "Column 3"],
"rows": [
["Value 1", "Value 2", "Value 3"],
["Value 4", "Value 5", "Value 6"]
]
}Object-based Tables (Financial/Complex):
{
"headers": ["Share capital", "Subscribed", "Callable", "Total"],
"rows": [
{
"Subscribed share capital": "2,288,500",
"Callable share capital": "(1,601,950)",
"Total": "885,722"
}
]
}- Smart Structure Detection: Automatically handles both array and object-based row formats
- Intelligent Column Mapping: Maps dictionary keys to table headers with fuzzy matching
- Data Quality Filtering: Removes placeholder and example data automatically
- Consistent Output: Converts all formats to standardized DataFrames for display
- Error Recovery: Graceful handling of malformed or incomplete table data
from groq_pdf_vision import extract_pdf
# Extract tables from financial documents
result = extract_pdf("financial_report.pdf")
# Access table data
for table in result["accumulated_data"]["tables_data"]:
print(f"Table: {table['table_title']}")
print(f"Structure: {len(table['headers'])} columns × {len(table['rows'])} rows")
# Both formats work seamlessly
if table['headers'] and table['rows']:
# Process table data regardless of internal structure
headers = table['headers']
rows = table['rows'] # Can be arrays or objectsThe library provides flexible schema building helpers to extract exactly what you need:
from groq_pdf_vision import extract_pdf
from groq_pdf_vision.schema_helpers import create_base_schema, add_custom_fields
# Default schema works for most documents
result = extract_pdf("document.pdf")# Financial document extraction
base = create_base_schema()
financial_fields = {
"financial_figures": {
"type": "array",
"items": {"type": "string"},
"description": "Revenue, profit, costs, and other financial amounts"
},
"companies_mentioned": {
"type": "array",
"items": {"type": "string"},
"description": "Company names and organizations"
}
}
financial_schema = add_custom_fields(base, financial_fields)
# Research document extraction
research_fields = {
"methodology": {"type": "string", "description": "Research methodology"},
"findings": {
"type": "array",
"items": {"type": "string"},
"description": "Key findings and results"
}
}
research_schema = add_custom_fields(base, research_fields)
# Product catalog extraction
product_fields = {
"product_names": {
"type": "array",
"items": {"type": "string"},
"description": "Product names mentioned"
},
"specifications": {
"type": "array",
"items": {"type": "string"},
"description": "Technical specifications"
}
}
product_schema = add_custom_fields(base, product_fields)from groq_pdf_vision.schema_helpers import (
create_base_schema,
add_custom_fields,
create_entity_extraction_fields,
create_list_field,
create_object_field
)
# Build a schema step by step
schema = create_base_schema(include_images=True, include_tables=False)
# Add entity extraction for any domain
entity_fields = create_entity_extraction_fields(["person", "company", "location"])
schema = add_custom_fields(schema, entity_fields)
# Add custom list fields
contact_fields = create_list_field("contact_emails", "Email addresses found")
schema = add_custom_fields(schema, contact_fields)from flask import Flask, request, jsonify
from groq_pdf_vision import extract_pdf_async
import asyncio
app = Flask(__name__)
@app.route('/process-pdf', methods=['POST'])
def process_pdf():
file = request.files['file']
filepath = f"uploads/{file.filename}"
file.save(filepath)
async def process():
return await extract_pdf_async(filepath)
result, metadata = asyncio.run(process())
return jsonify({
"pages": len(result["page_results"]),
"processing_time": metadata["processing_time_seconds"],
"data": result["accumulated_data"]
})from fastapi import FastAPI, UploadFile, File
from groq_pdf_vision import extract_pdf_async
import tempfile
app = FastAPI()
@app.post("/process-pdf/")
async def process_pdf(file: UploadFile = File(...)):
with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
result, metadata = await extract_pdf_async(tmp_path)
return {
"filename": file.filename,
"pages_processed": len(result["page_results"]),
"processing_time": metadata["processing_time_seconds"]
}import asyncio
from pathlib import Path
from groq_pdf_vision import extract_pdf_async
async def process_batch(input_dir, output_dir):
pdf_files = list(Path(input_dir).glob("*.pdf"))
for pdf_file in pdf_files:
print(f"Processing {pdf_file.name}")
result, metadata = await extract_pdf_async(
str(pdf_file),
save_results=True,
output_filename=f"{output_dir}/{pdf_file.stem}_results.json"
)
print(f" Completed: {len(result['page_results'])} pages in {metadata['processing_time_seconds']:.1f}s")
# Process all PDFs in a directory
asyncio.run(process_batch("./input", "./output"))The example_docs/ directory contains a comprehensive example schema (example_custom_schema.json) that demonstrates various field types and extraction patterns.
Method 1: Load JSON Schema Directly
import json
from groq_pdf_vision import extract_pdf
# Load the example schema
with open('example_docs/example_custom_schema.json', 'r') as f:
schema = json.load(f)
# Use it for extraction
result = extract_pdf("example_docs/example.pdf", schema=schema)Method 2: Build with Schema Helpers (Recommended)
from groq_pdf_vision import extract_pdf
from groq_pdf_vision.schema_helpers import create_base_schema, add_custom_fields
# Start with the base schema
base = create_base_schema()
# Add your custom fields
custom_fields = {
"document_type": {
"type": "string",
"description": "Type of document (financial, technical, academic, etc.)"
},
"key_findings": {
"type": "array",
"items": {"type": "string"},
"description": "Most important findings or insights from this page"
},
"sentiment": {
"type": "string",
"description": "Overall sentiment of the page content"
}
}
# Combine them
schema = add_custom_fields(base, custom_fields)
result = extract_pdf("example_docs/example.pdf", schema=schema)- Start Simple - Begin with
create_base_schema()and add only what you need - Clear Descriptions - Good field descriptions help the AI understand what to extract
- Appropriate Types - Use arrays for lists, objects for structured data, strings for text
- Required Fields - Always include
page_numberandcontentas required - Test Iteratively - Start with a few pages to test your schema before processing large documents
Extract data from a PDF file synchronously with comprehensive error handling and progress tracking.
Syntax:
extract_pdf(
pdf_path: str,
schema: Optional[Dict] = None,
start_page: Optional[int] = None,
end_page: Optional[int] = None,
save_results: bool = False,
output_filename: Optional[str] = None,
progress_callback: Optional[Callable] = None
) -> Dict[str, Any]Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
pdf_path |
str |
Yes | - | Path to the PDF file to process. Must be a valid file path. |
schema |
Dict |
No | None |
Custom JSON schema for extraction. If None, uses comprehensive default schema. |
start_page |
int |
No | 1 |
Starting page number (1-indexed). Must be ≥ 1. |
end_page |
int |
No | None |
Ending page number (1-indexed). If None, processes to end of document. |
save_results |
bool |
No | False |
Whether to save results to JSON file automatically. |
output_filename |
str |
No | None |
Custom output filename. If None and save_results=True, auto-generates filename. |
progress_callback |
Callable |
No | None |
Callback function for progress updates: callback(message, current, total) |
Returns:
- Type:
Dict[str, Any] - Structure: Complete extraction results containing:
page_results: List of per-page extraction dataaccumulated_data: Aggregated data across all pagesprocessing_stats: Performance and timing informationmetadata: Processing configuration and timestamps
Example Usage:
# Basic extraction
result = extract_pdf("document.pdf")
# With custom schema and page range
custom_schema = {"type": "object", "properties": {"summary": {"type": "string"}}}
result = extract_pdf(
"document.pdf",
schema=custom_schema,
start_page=1,
end_page=10,
save_results=True
)
# With progress tracking
def progress_handler(message, current, total):
print(f"[{current}/{total}] {message}")
result = extract_pdf("document.pdf", progress_callback=progress_handler)Exceptions:
FileNotFoundError: PDF file does not existValueError: Invalid page range or malformed schemaPermissionError: Insufficient permissions to read PDF or write outputAPIError: Groq API authentication or rate limit issues
Extract data from a PDF file asynchronously for better performance with large documents.
Syntax:
async extract_pdf_async(
pdf_path: str,
schema: Optional[Dict] = None,
start_page: Optional[int] = None,
end_page: Optional[int] = None,
save_results: bool = False,
output_filename: Optional[str] = None,
progress_callback: Optional[Callable] = None
) -> Tuple[Dict[str, Any], Dict[str, Any]]Parameters:
Same as extract_pdf() above.
Returns:
- Type:
Tuple[Dict[str, Any], Dict[str, Any]] - Structure:
[0]: Extraction results (same structure asextract_pdf)[1]: Metadata dictionary with processing statistics and configuration
Example Usage:
import asyncio
async def process_document():
result, metadata = await extract_pdf_async("large_document.pdf")
print(f"Processed {metadata['total_pages']} pages in {metadata['processing_time_seconds']:.1f}s")
return result
# Run async processing
result = asyncio.run(process_document())Exceptions:
Same as extract_pdf() above.
Create a comprehensive base schema suitable for most document types.
Syntax:
create_base_schema(
include_images: bool = True,
include_tables: bool = True
) -> Dict[str, Any]Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
include_images |
bool |
True |
Whether to include image analysis fields in schema |
include_tables |
bool |
True |
Whether to include table extraction fields in schema |
Returns:
- Type:
Dict[str, Any] - Description: JSON schema with standard fields for page content, metadata, and optional image/table analysis
Example Usage:
# Full schema with all features
schema = create_base_schema()
# Text-only schema (faster processing)
text_schema = create_base_schema(include_images=False, include_tables=False)Add custom extraction fields to an existing schema.
Syntax:
add_custom_fields(
base_schema: Dict[str, Any],
custom_fields: Dict[str, Any]
) -> Dict[str, Any]Parameters:
| Parameter | Type | Description |
|---|---|---|
base_schema |
Dict |
Base schema to extend (typically from create_base_schema()) |
custom_fields |
Dict |
Dictionary of custom field definitions following JSON Schema format |
Returns:
- Type:
Dict[str, Any] - Description: Extended schema with custom fields merged into base schema
Example Usage:
base = create_base_schema()
custom_fields = {
"financial_figures": {
"type": "array",
"items": {"type": "string"},
"description": "Revenue, profit, and cost figures mentioned"
},
"risk_factors": {
"type": "array",
"items": {"type": "string"},
"description": "Risk factors and warnings identified"
}
}
schema = add_custom_fields(base, custom_fields)Generate schema fields for extracting specific entity types.
Syntax:
create_entity_extraction_fields(
entity_types: List[str]
) -> Dict[str, Any]Parameters:
| Parameter | Type | Description |
|---|---|---|
entity_types |
List[str] |
List of entity types to extract. Supported: ["person", "company", "location", "date", "money", "product", "technology", "email"] |
Returns:
- Type:
Dict[str, Any] - Description: Schema fields for extracting specified entity types
Example Usage:
# Extract people and companies
entity_fields = create_entity_extraction_fields(["person", "company", "location"])
schema = add_custom_fields(create_base_schema(), entity_fields)Validate a JSON schema for compatibility with the extraction system.
Syntax:
validate_schema(schema: Dict[str, Any]) -> boolParameters:
| Parameter | Type | Description |
|---|---|---|
schema |
Dict |
JSON schema to validate |
Returns:
- Type:
bool - Description:
Trueif schema is valid, raises exception if invalid
Exceptions:
ValueError: Schema format is invalid or incompatible
Get metadata and information about a PDF file.
Syntax:
get_pdf_info(pdf_path: str) -> Dict[str, Any]Parameters:
| Parameter | Type | Description |
|---|---|---|
pdf_path |
str |
Path to PDF file |
Returns:
- Type:
Dict[str, Any] - Structure:
{ "total_pages": int, "file_size_mb": float, "estimated_processing_time": float, "recommended_batch_size": int, "estimated_cost": float }
Example Usage:
info = get_pdf_info("large_document.pdf")
print(f"Document has {info['total_pages']} pages")
print(f"Estimated processing time: {info['estimated_processing_time']:.1f} minutes")
print(f"Estimated cost: ${info['estimated_cost']:.2f}")All functions may raise the following common exceptions:
| Exception | Description | Common Causes |
|---|---|---|
FileNotFoundError |
PDF file not found | Invalid file path, file moved/deleted |
PermissionError |
Insufficient file permissions | Read-only files, permission restrictions |
ValueError |
Invalid parameter values | Negative page numbers, malformed schema |
APIError |
Groq API issues | Invalid API key, rate limits, service unavailable |
ProcessingError |
PDF processing failures | Corrupted PDF, unsupported format |
Example Error Handling:
try:
result = extract_pdf("document.pdf", start_page=1, end_page=10)
except FileNotFoundError:
print("PDF file not found. Please check the file path.")
except ValueError as e:
print(f"Invalid parameters: {e}")
except APIError as e:
print(f"API error: {e}. Check your API key and rate limits.")
except Exception as e:
print(f"Unexpected error: {e}"){
"source_pdf": "document.pdf",
"page_results": [
{
"page_number": 1,
"content": "Extracted text content...",
"image_descriptions": [
{
"image_type": "chart",
"description": "Bar chart showing quarterly revenue",
"location": "center",
"content_relation": "Supports revenue discussion"
}
],
"tables_data": [
{
"table_content": "Q1: $1M, Q2: $1.2M...",
"table_structure": "2x4 table with headers"
}
]
}
],
"accumulated_data": {
"total_content": "All extracted text...",
"all_image_descriptions": [...],
"all_tables_data": [...],
"visual_summary": "Document contains 5 charts and 3 tables"
},
"processing_stats": {
"total_pages": 10,
"pages_with_images": 3,
"pages_with_tables": 5,
"processing_time_seconds": 45.2
}
}- Processing Speed: 1,300-2,000+ tokens/second (optimized batch processing)
- Intelligent Auto-Configuration: Batch sizes automatically scale based on document size
- Reliability: Retry mechanisms with graceful error handling across test scenarios
- Memory Usage: Optimized for large documents up to 200+ pages
- Enhanced Table Extraction: Supports both array-based and object-based table row structures
- Improved Consistency: Lower temperature settings (0.05) for more reliable extraction results
- Real-time Progress: Live progress tracking with terminal logging and ETA calculations
| Document Type | Pages | Processing Time | Throughput |
|---|---|---|---|
| Government Reports | 85-118 pages | 1.8-3.4 minutes | 1,700+ tok/sec |
| Financial Documents | 76-88 pages | 2.4-2.9 minutes | 1,400+ tok/sec |
| Technical Documents | 50+ pages | 1-2 minutes | 1,500+ tok/sec |
- Small PDFs (≤10 pages): batch_size=2, high quality processing
- Medium PDFs (11-50 pages): batch_size=3, balanced processing
- Large PDFs (51-200 pages): batch_size=4, efficient batch processing
- Enterprise PDFs (>200 pages): batch_size=5, maximum batch efficiency
- 50% fewer API calls through intelligent batching
- 38% faster processing with optimized auto-configuration
- Real-time progress tracking with ETA calculations
- Automatic retry logic with exponential backoff
- Enhanced table extraction supporting multiple data structures
- Improved data filtering to remove placeholder content
- Lower temperature settings for more consistent results
The repository includes a comprehensive test suite with both quick validation and full document stress testing.
Quick Test (Recommended):
cd tests
python run_all_tests.pyExpected Output:
Running Basic Tests...
README Examples: All Python SDK examples working
Integration Tests: Flask/FastAPI patterns verified
Schema Tests: Custom schema functionality working
CLI Tests: All command-line options working
Results: 4/4 basic test suites passed (completed in ~30 seconds)
Would you like to run full document tests? (y/N): N
Skipping full document tests (saves 15-20 minutes and ~$1.00 in API costs)
All basic tests passed! The library is working correctly.
- README Examples (
test_readme_examples.py): All Python SDK examples - Integration Tests (
test_flask_integration.py): Flask/FastAPI patterns - Schema Tests (
test_example_schema.py): Custom schema functionality - CLI Tests: All command-line options and parameters
- Vision 2030 (85 pages): Saudi government document processing
- Example Financial (76 pages): Financial document with heavy table content
- Americas Children (118 pages): US government statistical report
- Fed Economic Wellbeing (88 pages): Federal Reserve economic research
Our test suite includes comprehensive performance validation:
| Document | Pages | Time | Tokens | Speed | Cost Est. |
|---|---|---|---|---|---|
| Vision 2030 | 85 | 1.9 min | 230K | 2,033 tok/sec | ~$0.46 |
| Example Financial | 76 | 3.3 min | 187K | 1,354 tok/sec | ~$0.37 |
| Americas Children | 118 | 3.3 min | 352K | 1,761 tok/sec | ~$0.70 |
| Fed Economic | 88 | 3.0 min | 259K | 1,433 tok/sec | ~$0.52 |
Run specific test categories:
# Basic functionality tests
python tests/test_readme_examples.py
python tests/test_flask_integration.py
python tests/test_example_schema.py
# Full document stress tests (API costs apply)
python tests/test_vision2030_full_async.py
python tests/test_americas_children_full_async.py
python tests/test_fed_economic_wellbeing_full_async.pyThe test suite verifies:
- All README Python SDK examples work correctly
- All CLI commands and options function properly
- Schema creation, validation, and custom field usage
- Synchronous and asynchronous processing modes
- Progress callbacks and real-time updates
- Integration patterns (Flask, FastAPI, batch processing)
- File I/O, path handling, and error conditions
- Performance benchmarks and auto-configuration
- Large document processing (up to 118 pages)
- Enhanced table extraction with multiple data structures
- Streamlit UI functionality without nested expander errors
- Data quality filtering and placeholder content removal
- API Key:
export GROQ_API_KEY="your-key-here" - Package Installation:
pip install -e . - Virtual Environment (recommended)
The test runner includes built-in cost awareness:
- Basic Tests: Free validation using small document excerpts
- Full Tests: User confirmation with clear cost warnings ($0.20-0.35 per test)
- Default Behavior: Skips expensive tests unless explicitly confirmed
- Graceful Handling: Missing test files don't break the test suite
For detailed performance analysis, auto-configuration results, and quality metrics, see tests/README.md.
Major Enhancements:
- Enhanced Table Extraction: Now supports both array-based and object-based table row structures
- Improved Data Quality: Better filtering of placeholder and example data
- Lower Temperature: Reduced to 0.05 for more consistent extraction results
- Real-time Progress: Enhanced progress tracking with terminal logging and ETA calculations
Bug Fixes:
- Fixed Streamlit UI: Resolved nested expander errors in table display
- Improved Error Handling: Better graceful recovery from processing failures
- Enhanced Data Filtering: More effective removal of placeholder content
Performance Improvements:
- Faster Processing: Optimized batch processing with intelligent auto-configuration
- Better Reliability: Enhanced retry mechanisms with exponential backoff
- Comprehensive Testing: All test suites pass including full document processing
Test Suite Validation:
- All 8 test categories pass (README examples, integrations, schemas, CLI, full documents)
- 367 total pages processed across 4 different document types
- Performance benchmarks updated with latest results
- Cost estimates and processing speeds validated
