adding multipage images support by adeelehsan · Pull Request #291 · vectara/vectara-ingest

adeelehsan · 2025-10-17T10:57:19Z

Adds automatic detection and stitching of images/diagrams that span multiple pages in PDF documents for docling only.

Caption Intelligence: Looks for figure captions like "Figure 3-1" followed by another image on the next
page with similar dimensions
Positional Awareness: Detects when an image is the last thing on one page and another image is the first
thing on the next page (excluding headers/footers)
Visual Matching: Checks if images have similar widths and horizontal alignment, with special handling
for full-page diagrams

ofermend · 2025-10-17T16:36:48Z

core/doc_parser.py

+                if image:
+                    # Get bounding box if available
+                    bbox = None
+                    page_height = 792.0  # Default letter size


I would suggest calling this DEFAULT_PAGE_HEIGHT and DEFAULT_PAGE_WIDTH and then when used below it is more logical (instead of replacing it with the if/else)

ofermend · 2025-10-17T16:44:58Z

core/doc_parser.py

+                # Store image binary data
+                with open(image_path, 'rb') as fp:
+                    image_binary = fp.read()
+                image_id = f"docling_stitched_pages_{'_'.join(map(str, pages))}"


Should we have a different image_id here? (not to mention docling - just call it by the original caption without the parts for example)?

ofermend · 2025-10-17T16:46:41Z

core/doc_parser.py

+                    logger.info("Failed to retrieve image")
+
+        # Apply multi-page image stitching if enabled
+        if self.stitch_config.enabled and image_fragments:


This code is now becoming quite complex

Does it make sense to refactor the stitching stuff into a helper function

Why don't we check for stitch_config.enabled at the start and not look for image fragments at all if it's set to False?

ofermend · 2025-10-17T16:49:46Z

core/doc_parser.py

+
+                    # Find the chunked element with closest position for context extraction
+                    best_context_idx = 0
+                    for chunk_idx, chunk_elem in enumerate(elements):


Isn't this code a bit repetitive - it's also the same above

ofermend

This is a great start Adeel and I'm glad it's also working properly. But please see my comments - it seems quite a lot of duplicate code and I'm concerned about maintainability - can we simplify and remove repetition while maintaining the functionality?

adeelehsan · 2025-10-19T05:20:10Z

@ofermend please review

ofermend · 2025-10-19T15:32:58Z

docs/multipage-image-stitching.md

+
+### Filters
+
+- **Page limit**: Maximum 2 consecutive pages stitched together (configurable)


Why is the default "2" - maybe make the max "3" if it's not too expensive in performance?

ofermend · 2025-10-19T15:34:48Z

docs/multipage-image-stitching.md

+### Overlap Detection
+
+When stitching images, the system:
+1. Searches for overlapping pixels between the bottom of the first image and top of the second


Commenting here but this may require code changes elsewhere: why does it only check for overlapping pixels between the bottom of the first and top of 2nd? I've seen examples where the stiching is from right of first image to left of second image. or other stitching ways. Can we support all modes?

ofermend · 2025-10-19T17:38:43Z

core/doc_parser.py

        if doc.core_properties.title:
            title = doc.core_properties.title.strip()
            if title:  # Only return non-empty titles
-                logger.info(f"Extracted DOCX document title: '{title}' from file {filename}")


Why are these log msgs removed? Maybe move to DEBUG level instead?

ofermend · 2025-10-19T17:42:58Z

core/doc_parser.py

+            if image_summary:
+                metadata = {
+                    'element_type': 'image',
+                    'pages': pages,


If we change to "pages" instead of "page" for the metadata, that might break any downstream processing that depends on "page". Why not pick the page of the first fragment as the page for the whole image, and then you can also add "pages" in addition to be the set of all pages?

ofermend · 2025-10-19T17:45:51Z

core/doc_parser.py

                    logger.error(f"Error parsing HTML table: {err}. Skipping...")
                    continue

+    def _calculate_element_position(self, element: Any, index: int = 0) -> Tuple[int, int]:


Can u pls explain why these changes are needed in unstructured? I thought this change should only impact Docling - is this additional changes beyond the multi-image? If so - can u pls explain what they do?

These changes are not related to the image stitching. I did refactor the code to remove the duplication and redundancy on the multiple places.

What it does:

Extracts page number from element metadata:
page_num = getattr(element.metadata, 'page_number', 1) or 1
- Gets the page_number attribute from element metadata
- Defaults to 1 if not present or if value is None

Calculates position using a standard formula:
position = page_num * 1000 + index
- Formula explanation: page_num * 1000 creates "buckets" of 1000 positions per page
- Adding index places the element within its page's bucket
- Example: Element at index 5 on page 3 → position = 3000 + 5 = 3005
- This ensures elements are sorted by page first, then by order within the page

Returns both values as a tuple:
return page_num, position
- Convenient for callers who need both values

Usage examples:

Get base position for a page (index defaults to 0)

page_num, base_position = self._calculate_element_position(element)

base_position = page_num * 1000

Get specific position for an element within a page

page_num, position = self._calculate_element_position(element, idx=5)

position = page_num * 1000 + 5

ofermend · 2025-10-19T17:47:08Z

core/image_stitcher.py

@@ -0,0 +1,320 @@
+"""


Can we please add more unit tests esp for this new model?

adeelehsan requested review from aamirbutt and ofermend October 17, 2025 11:05

adding multipage images support

d4d5549

adeelehsan force-pushed the multipage-images branch from 966859f to d4d5549 Compare October 17, 2025 11:08

ofermend reviewed Oct 17, 2025

View reviewed changes

ofermend requested changes Oct 17, 2025

View reviewed changes

adeelehsan added 3 commits October 18, 2025 10:48

updating

657170c

updating

dd4267c

refactoring the code

b86ae19

ofermend reviewed Oct 19, 2025

View reviewed changes

core/image_stitcher.py

@@ -0,0 +1,320 @@

"""

Copy link

Collaborator

ofermend Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add more unit tests esp for this new model?

adeelehsan requested a review from waqqas-vectara October 20, 2025 19:04


		### Filters

		- Page limit: Maximum 2 consecutive pages stitched together (configurable)

Conversation

adeelehsan commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ofermend left a comment

Choose a reason for hiding this comment

Uh oh!

adeelehsan commented Oct 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Get base position for a page (index defaults to 0)

base_position = page_num * 1000

Get specific position for an element within a page

position = page_num * 1000 + 5

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adeelehsan commented Oct 17, 2025 •

edited

Loading