This project seems to use docling parsing for LLMWare. The overall objective is to bring docling's accuracy to llmware RAG framework for parsin gtasks.
- Processes PDF documents using the Docling library
- Extracts text, tables, and images with proper formatting
- Generates hierarchical breadcrumb paths based on section headers
- Saves images as external files instead of embedding them in JSON
- Filters out furniture elements (headers, footers) from context snippets
- Outputs in various formats: JSON, Markdown, HTML, CSV
This version includes important fixes for three major issues:
-
External Image Storage: Images are now saved as external files instead of being embedded as base64 data in the JSON. This significantly reduces the size of the output JSON file and improves performance.
-
Complete Element Information: All document elements (text, tables, images) are now properly identified and included in the element map, ensuring no content is missed.
-
Improved Breadcrumbs and Context:
- Breadcrumbs now correctly represent the document's hierarchical structure based on section headers
- Furniture elements (headers, footers, page numbers) are filtered out from context snippets to provide cleaner, more relevant context
- Python 3.7+
- Docling library (included)
-
Clone the repository:
git clone https://github.com/yourusername/docling_parse.git cd docling_parse -
Install requirements:
pip install -r requirements.txt
Use the run_parser.py script for the simplest way to run the parser:
python run_parser.py input.pdf output_dir --format jsonpdf_path: Path to the input PDF file (required)output_dir: Directory for output files (required)--format: Output format (choices: json, md, html, csv; default: json)--log_level: Logging verbosity level (choices: DEBUG, INFO, WARNING, ERROR, CRITICAL; default: INFO)
For more advanced options, you can use the main parser directly:
python parse_main.py --pdf_path input.pdf --output_dir output --output_format jsonAdditional options:
--include_metadata/--no_metadata: Include/exclude metadata in output--include_page_breaks/--no_page_breaks: Include/exclude page break markers--include_captions/--no_captions: Include/exclude captions for tables and images--image_base_url: Base URL for image links in output--config_file: Path to additional configuration file
The parser generates several output files:
docling_document.json: Raw output from the Docling libraryfixed_document.json: The document with metadata fixes applied (external image references, proper breadcrumbs, filtered context)document.json(or other format based on--format): The final formatted output
You can set the following environment variables in a .env file:
DOCLING_PDF_PATH: Default path to input PDF fileDOCLING_OUTPUT_DIR: Default directory for output filesDOCLING_LOG_LEVEL: Default logging verbosityDOCLING_CONFIG_FILE: Default path to a configuration fileDOCLING_OUTPUT_FORMAT: Default output formatDOCLING_IMAGE_BASE_URL: Default base URL for image linksDOCLING_INCLUDE_METADATA: Whether to include metadata in outputDOCLING_INCLUDE_PAGE_BREAKS: Whether to include page break markersDOCLING_INCLUDE_CAPTIONS: Whether to include captions for tables and images
Run the tests to verify that the parser is working correctly:
cd tests
python -m test_integrationThis will run integration tests that verify:
- Images are properly saved as external files
- All element types are correctly identified
- Breadcrumbs are generated properly and furniture is filtered from context
parse_main.py: Main entry point for the applicationsrc/: Directory containing the parser modulesjson_metadata_fixer.py: Module to fix metadata issues in the parsed documentcontent_extractor.py: Extract content from different element typesmetadata_extractor.py: Extract and format metadataelement_map_builder.py: Build a map of elements from the documentpdf_image_extractor.py: Extract images from PDF documentsparse_helper.py: Helper functions for parsingoutput_formatter.py: Format output in different formats
tests/: Directory containing testsrun_parser.py: Simple script to run the parser
Contributions are welcome! Please feel free to submit a Pull Request.