Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix for the crash when encountering WMF images in pptx and docx #837

Merged
merged 2 commits into from
Jan 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions docling/backend/mspowerpoint_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,13 +271,12 @@ def handle_title(self, shape, parent_slide, slide_ind, doc):
return

def handle_pictures(self, shape, parent_slide, slide_ind, doc):
# Get the image bytes
image = shape.image
image_bytes = image.blob
im_dpi, _ = image.dpi

# Open it with PIL
try:
# Get the image bytes
image = shape.image
image_bytes = image.blob
im_dpi, _ = image.dpi
pil_image = Image.open(BytesIO(image_bytes))

# shape has picture
Expand Down
4 changes: 2 additions & 2 deletions docling/backend/msword_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -520,11 +520,11 @@ def get_docx_image(element, drawing_blip):
image_data = image_part.blob # Get the binary image data
return image_data

image_data = get_docx_image(element, drawing_blip)
image_bytes = BytesIO(image_data)
level = self.get_level()
# Open the BytesIO object with PIL to create an Image
try:
image_data = get_docx_image(element, drawing_blip)
image_bytes = BytesIO(image_data)
pil_image = Image.open(image_bytes)
doc.add_picture(
parent=self.parents[level - 1],
Expand Down
8 changes: 8 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,3 +151,11 @@ This is a collection of FAQ collected from the user questions on <https://github
pipeline_options = PdfPipelineOptions()
pipeline_options.ocr_options.lang = ["fr", "de", "es", "en"] # example of languages for EasyOCR
```


??? Some images are missing from MS Word and Powerpoint"

### Some images are missing from MS Word and Powerpoint

The image processing library used by Docling is able to handle embedded WMF images only on Windows platform.
If you are on other operaring systems, these images will be ignored.