The PDF file extracted many black images, but these black images are not actually present in the PDF file. #30963

shy20221121 · 2026-01-14T06:09:32Z

shy20221121
Jan 14, 2026

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:)
Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

The PDF file extracted many black images, but these black images are not actually present in the PDF file.

How to turn off automatic extraction of image content from PDFs ？

dify version : 1.11.3

2. Additional context or comments

No response

shy20221121 · 2026-01-14T06:13:54Z

dosubot[bot]
bot Jan 14, 2026

Dify currently extracts all images from PDFs automatically using the pypdfium2 library, and there is no built-in option or environment variable to disable this behavior in v1.11.3. The black images you're seeing are likely artifacts from the extraction process, not actual images in your PDF see implementation.

To prevent image extraction, you would need to modify the code. In api/core/rag/extractor/pdf_extractor.py, you can comment out or remove the lines in the parse method that call self._extract_images(page) and append the result to content. For example:

# image_content = self._extract_images(page)
# if image_content:
#     content += "\n" + image_content

This will skip image extraction and only extract text from PDFs.

If you want a more flexible solution, you could add a flag to the PdfExtractor class to control image extraction, and set it when instantiating the extractor.

Let me know if you need step-by-step guidance for this code change or want help making it configurable.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

2 replies

shy20221121 Jan 14, 2026
Author

@dosu
does that will be optimization on the next version ?

dosubot[bot] bot Jan 14, 2026

There are currently no plans or ongoing work to add a configuration option or optimization for disabling PDF image extraction in upcoming Dify versions. Image extraction is always enabled by default, and turning it off would require a code change in the extractor logic see details. If you want, I can guide you on how to implement a workaround in your deployment.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The PDF file extracted many black images, but these black images are not actually present in the PDF file. #30963

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The PDF file extracted many black images, but these black images are not actually present in the PDF file. #30963

Uh oh!

shy20221121 Jan 14, 2026

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

2. Additional context or comments

Replies: 1 comment · 2 replies

Uh oh!

dosubot[bot] bot Jan 14, 2026

Uh oh!

shy20221121 Jan 14, 2026 Author

Uh oh!

dosubot[bot] bot Jan 14, 2026

shy20221121
Jan 14, 2026

Replies: 1 comment 2 replies

dosubot[bot]
bot Jan 14, 2026

shy20221121 Jan 14, 2026
Author