Skip to content

Commit

Permalink
fix: Fix for the crash when encountering WMF images in pptx and docx (#…
Browse files Browse the repository at this point in the history
…837)

* Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms

Signed-off-by: Maksym Lysak <[email protected]>

* Updated faq

Signed-off-by: Maksym Lysak <[email protected]>

---------

Signed-off-by: Maksym Lysak <[email protected]>
Co-authored-by: Maksym Lysak <[email protected]>
  • Loading branch information
maxmnemonic and Maksym Lysak authored Jan 30, 2025
1 parent d01a2e7 commit fea0a99
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 7 deletions.
9 changes: 4 additions & 5 deletions docling/backend/mspowerpoint_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,13 +271,12 @@ def handle_title(self, shape, parent_slide, slide_ind, doc):
return

def handle_pictures(self, shape, parent_slide, slide_ind, doc):
# Get the image bytes
image = shape.image
image_bytes = image.blob
im_dpi, _ = image.dpi

# Open it with PIL
try:
# Get the image bytes
image = shape.image
image_bytes = image.blob
im_dpi, _ = image.dpi
pil_image = Image.open(BytesIO(image_bytes))

# shape has picture
Expand Down
4 changes: 2 additions & 2 deletions docling/backend/msword_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -520,11 +520,11 @@ def get_docx_image(element, drawing_blip):
image_data = image_part.blob # Get the binary image data
return image_data

image_data = get_docx_image(element, drawing_blip)
image_bytes = BytesIO(image_data)
level = self.get_level()
# Open the BytesIO object with PIL to create an Image
try:
image_data = get_docx_image(element, drawing_blip)
image_bytes = BytesIO(image_data)
pil_image = Image.open(image_bytes)
doc.add_picture(
parent=self.parents[level - 1],
Expand Down
8 changes: 8 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,3 +151,11 @@ This is a collection of FAQ collected from the user questions on <https://github
pipeline_options = PdfPipelineOptions()
pipeline_options.ocr_options.lang = ["fr", "de", "es", "en"] # example of languages for EasyOCR
```


??? Some images are missing from MS Word and Powerpoint"

### Some images are missing from MS Word and Powerpoint

The image processing library used by Docling is able to handle embedded WMF images only on Windows platform.
If you are on other operaring systems, these images will be ignored.

0 comments on commit fea0a99

Please sign in to comment.