Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Markitdown failed to convert pdf that contains image #217

Open
Drjunchenfeng opened this issue Dec 26, 2024 · 2 comments
Open

[bug] Markitdown failed to convert pdf that contains image #217

Drjunchenfeng opened this issue Dec 26, 2024 · 2 comments

Comments

@Drjunchenfeng
Copy link

cn_dissertation_1st_page.pdf
In trying to analyze the attached file with

    result = md.convert(file_path)
    return result.text_content

I got the following error

Traceback (most recent call last):
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/markitdown/_markitdown.py", line 1239, in _convert
    res = converter.convert(local_path, **_kwargs)
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/markitdown/_markitdown.py", line 490, in convert
    text_content=pdfminer.high_level.extract_text(local_path),
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 64, in __init__
    resolve1(mediabox_param) for mediabox_param in self.attrs["MediaBox"]
KeyError: 'MediaBox'

I am running it on Python 3.10 in MacOS 15.2 (24C101)

@Drjunchenfeng
Copy link
Author

atter consulting with o1 and tinkering with it. I realize that it is because I am using pymupdf to reconstruct the pdf page and thus missing this meta info.

@l-lumin
Copy link
Contributor

l-lumin commented Dec 26, 2024

check out #139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants