Extraction of text #2626
-
|
page=doc[0] I have used the above code to extract text from page 1 of the below PDF. BNL-76953-2006-CP is present visually only once but while extracting spans of text, could see Can you please let me know the reason? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 15 replies
-
|
Clicking on your link doesn't do anything - cannot look at the file. |
Beta Was this translation helpful? Give feedback.
-
|
32542.pdf |
Beta Was this translation helpful? Give feedback.
-
|
Actually - if I were to mmake such a PDF, I would create
I would never ever write standard text underneath such a field - as it has happened here! What purpose does that have?! But as usual in PDF: Murphy's Law, what is possible, will happen earlier or later. |
Beta Was this translation helpful? Give feedback.
-
|
Can you let me know what the rectangles on page 7 represent in the below PDF. Are they graphics? |
Beta Was this translation helpful? Give feedback.

No, they are no graphics, but so-called inline images: all image information is part of the page's
/Contentsobject.Because they have no xref, you don't see them using
page.get_images().But PyMuPDF doesn't let you down!
Information about all page images:
page.get_image_info(...). To extract them, use text extraction:page.get_text("dict", ...).