Conversation
966859f to
d4d5549
Compare
core/doc_parser.py
Outdated
| if image: | ||
| # Get bounding box if available | ||
| bbox = None | ||
| page_height = 792.0 # Default letter size |
There was a problem hiding this comment.
I would suggest calling this DEFAULT_PAGE_HEIGHT and DEFAULT_PAGE_WIDTH and then when used below it is more logical (instead of replacing it with the if/else)
core/doc_parser.py
Outdated
| # Store image binary data | ||
| with open(image_path, 'rb') as fp: | ||
| image_binary = fp.read() | ||
| image_id = f"docling_stitched_pages_{'_'.join(map(str, pages))}" |
There was a problem hiding this comment.
Should we have a different image_id here? (not to mention docling - just call it by the original caption without the parts for example)?
| logger.info("Failed to retrieve image") | ||
|
|
||
| # Apply multi-page image stitching if enabled | ||
| if self.stitch_config.enabled and image_fragments: |
There was a problem hiding this comment.
This code is now becoming quite complex
- Does it make sense to refactor the stitching stuff into a helper function
- Why don't we check for stitch_config.enabled at the start and not look for image fragments at all if it's set to False?
core/doc_parser.py
Outdated
|
|
||
| # Find the chunked element with closest position for context extraction | ||
| best_context_idx = 0 | ||
| for chunk_idx, chunk_elem in enumerate(elements): |
There was a problem hiding this comment.
Isn't this code a bit repetitive - it's also the same above
ofermend
left a comment
There was a problem hiding this comment.
This is a great start Adeel and I'm glad it's also working properly. But please see my comments - it seems quite a lot of duplicate code and I'm concerned about maintainability - can we simplify and remove repetition while maintaining the functionality?
|
@ofermend please review |
|
|
||
| ### Filters | ||
|
|
||
| - **Page limit**: Maximum 2 consecutive pages stitched together (configurable) |
There was a problem hiding this comment.
Why is the default "2" - maybe make the max "3" if it's not too expensive in performance?
| ### Overlap Detection | ||
|
|
||
| When stitching images, the system: | ||
| 1. Searches for overlapping pixels between the bottom of the first image and top of the second |
There was a problem hiding this comment.
Commenting here but this may require code changes elsewhere: why does it only check for overlapping pixels between the bottom of the first and top of 2nd? I've seen examples where the stiching is from right of first image to left of second image. or other stitching ways. Can we support all modes?
| if doc.core_properties.title: | ||
| title = doc.core_properties.title.strip() | ||
| if title: # Only return non-empty titles | ||
| logger.info(f"Extracted DOCX document title: '{title}' from file {filename}") |
There was a problem hiding this comment.
Why are these log msgs removed? Maybe move to DEBUG level instead?
| if image_summary: | ||
| metadata = { | ||
| 'element_type': 'image', | ||
| 'pages': pages, |
There was a problem hiding this comment.
If we change to "pages" instead of "page" for the metadata, that might break any downstream processing that depends on "page". Why not pick the page of the first fragment as the page for the whole image, and then you can also add "pages" in addition to be the set of all pages?
| logger.error(f"Error parsing HTML table: {err}. Skipping...") | ||
| continue | ||
|
|
||
| def _calculate_element_position(self, element: Any, index: int = 0) -> Tuple[int, int]: |
There was a problem hiding this comment.
Can u pls explain why these changes are needed in unstructured? I thought this change should only impact Docling - is this additional changes beyond the multi-image? If so - can u pls explain what they do?
There was a problem hiding this comment.
These changes are not related to the image stitching. I did refactor the code to remove the duplication and redundancy on the multiple places.
What it does:
- Extracts page number from element metadata:
page_num = getattr(element.metadata, 'page_number', 1) or 1
- Gets the page_number attribute from element metadata
- Defaults to 1 if not present or if value is None - Calculates position using a standard formula:
position = page_num * 1000 + index
- Formula explanation: page_num * 1000 creates "buckets" of 1000 positions per page
- Adding index places the element within its page's bucket
- Example: Element at index 5 on page 3 → position = 3000 + 5 = 3005
- This ensures elements are sorted by page first, then by order within the page - Returns both values as a tuple:
return page_num, position
- Convenient for callers who need both values
Usage examples:
Get base position for a page (index defaults to 0)
page_num, base_position = self._calculate_element_position(element)
base_position = page_num * 1000
Get specific position for an element within a page
page_num, position = self._calculate_element_position(element, idx=5)
position = page_num * 1000 + 5
| @@ -0,0 +1,320 @@ | |||
| """ | |||
There was a problem hiding this comment.
Can we please add more unit tests esp for this new model?
Adds automatic detection and stitching of images/diagrams that span multiple pages in PDF documents for docling only.
page with similar dimensions
thing on the next page (excluding headers/footers)
for full-page diagrams