Skip to content

adding multipage images support#291

Open
adeelehsan wants to merge 4 commits intomainfrom
multipage-images
Open

adding multipage images support#291
adeelehsan wants to merge 4 commits intomainfrom
multipage-images

Conversation

@adeelehsan
Copy link
Contributor

@adeelehsan adeelehsan commented Oct 17, 2025

Adds automatic detection and stitching of images/diagrams that span multiple pages in PDF documents for docling only.

  1. Caption Intelligence: Looks for figure captions like "Figure 3-1" followed by another image on the next
    page with similar dimensions
  2. Positional Awareness: Detects when an image is the last thing on one page and another image is the first
    thing on the next page (excluding headers/footers)
  3. Visual Matching: Checks if images have similar widths and horizontal alignment, with special handling
    for full-page diagrams

if image:
# Get bounding box if available
bbox = None
page_height = 792.0 # Default letter size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest calling this DEFAULT_PAGE_HEIGHT and DEFAULT_PAGE_WIDTH and then when used below it is more logical (instead of replacing it with the if/else)

# Store image binary data
with open(image_path, 'rb') as fp:
image_binary = fp.read()
image_id = f"docling_stitched_pages_{'_'.join(map(str, pages))}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a different image_id here? (not to mention docling - just call it by the original caption without the parts for example)?

logger.info("Failed to retrieve image")

# Apply multi-page image stitching if enabled
if self.stitch_config.enabled and image_fragments:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is now becoming quite complex

  1. Does it make sense to refactor the stitching stuff into a helper function
  2. Why don't we check for stitch_config.enabled at the start and not look for image fragments at all if it's set to False?


# Find the chunked element with closest position for context extraction
best_context_idx = 0
for chunk_idx, chunk_elem in enumerate(elements):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this code a bit repetitive - it's also the same above

Copy link
Collaborator

@ofermend ofermend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start Adeel and I'm glad it's also working properly. But please see my comments - it seems quite a lot of duplicate code and I'm concerned about maintainability - can we simplify and remove repetition while maintaining the functionality?

@adeelehsan
Copy link
Contributor Author

@ofermend please review


### Filters

- **Page limit**: Maximum 2 consecutive pages stitched together (configurable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the default "2" - maybe make the max "3" if it's not too expensive in performance?

### Overlap Detection

When stitching images, the system:
1. Searches for overlapping pixels between the bottom of the first image and top of the second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here but this may require code changes elsewhere: why does it only check for overlapping pixels between the bottom of the first and top of 2nd? I've seen examples where the stiching is from right of first image to left of second image. or other stitching ways. Can we support all modes?

if doc.core_properties.title:
title = doc.core_properties.title.strip()
if title: # Only return non-empty titles
logger.info(f"Extracted DOCX document title: '{title}' from file {filename}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these log msgs removed? Maybe move to DEBUG level instead?

if image_summary:
metadata = {
'element_type': 'image',
'pages': pages,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change to "pages" instead of "page" for the metadata, that might break any downstream processing that depends on "page". Why not pick the page of the first fragment as the page for the whole image, and then you can also add "pages" in addition to be the set of all pages?

logger.error(f"Error parsing HTML table: {err}. Skipping...")
continue

def _calculate_element_position(self, element: Any, index: int = 0) -> Tuple[int, int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u pls explain why these changes are needed in unstructured? I thought this change should only impact Docling - is this additional changes beyond the multi-image? If so - can u pls explain what they do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are not related to the image stitching. I did refactor the code to remove the duplication and redundancy on the multiple places.

What it does:

  1. Extracts page number from element metadata:
    page_num = getattr(element.metadata, 'page_number', 1) or 1
    - Gets the page_number attribute from element metadata
    - Defaults to 1 if not present or if value is None
  2. Calculates position using a standard formula:
    position = page_num * 1000 + index
    - Formula explanation: page_num * 1000 creates "buckets" of 1000 positions per page
    - Adding index places the element within its page's bucket
    - Example: Element at index 5 on page 3 → position = 3000 + 5 = 3005
    - This ensures elements are sorted by page first, then by order within the page
  3. Returns both values as a tuple:
    return page_num, position
    - Convenient for callers who need both values

Usage examples:

Get base position for a page (index defaults to 0)

page_num, base_position = self._calculate_element_position(element)

base_position = page_num * 1000

Get specific position for an element within a page

page_num, position = self._calculate_element_position(element, idx=5)

position = page_num * 1000 + 5

@@ -0,0 +1,320 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add more unit tests esp for this new model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants