Skip to content

OpenEnergyPlatform/municipal-heat-planning-pdf-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenEnergyPlatform

Open Energy Family – municipal-heat-planning-pdf-processing

Tools and scripts developed to support the MHPO development by automating data extraction and processing of municipal heat planning documents are stored here.

Key Dependencies and Models

Inference Runtime

Tool Purpose
Ollama Local model serving and inference runtime for all LLM and vision-language model interactions

Models

Model Provider Parameters Usage in Pipeline
PP-DocLayoutV3 PaddlePaddle Document layout detection (Stage 2): identifies tables, figures, titles, headers, footers, and other structural elements in PDF pages
gpt-oss:120b OpenAI (open-weight) 120B LLM-based section refinement (Stage 4): cleans extraction artefacts, normalizes titles and captions, removes directory pages, converts bibliographies to BibTeX
Qwen3-VL Alibaba / Qwen 32B (Q8) Vision-language model for image processing: converts table images to structured Markdown, generates detailed textual descriptions of figures, and produces captions where missing

Python Libraries

Library Purpose
PyMuPDF (fitz) PDF text extraction (rawdict mode) and page rendering to PNG
Pillow Image handling, cropping detected layout regions
NumPy Array operations for image crop processing
Transformers Loading and running the PP-DocLayoutV3 layout detection model
ollama (Python) Python client for Ollama API (chat, model management)
httpx HTTP client with timeout control, used internally by the Ollama Python library
spaCy NLP processing for text analysis and entity recognition

Collaboration

Everyone is invited to develop this repository with good intentions.

About

Tools and scripts developped to automate data extraction from municipal heat planning documents and processing are stored here.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages