Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

magic_pdf.tools.cli:parse_doc:96 - code=8: invalid key in dict #595

Closed
hwf1324 opened this issue Sep 12, 2024 · 6 comments
Closed

magic_pdf.tools.cli:parse_doc:96 - code=8: invalid key in dict #595

hwf1324 opened this issue Sep 12, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@hwf1324
Copy link

hwf1324 commented Sep 12, 2024

Description of the bug | 错误描述

在处理文档的后期会遇到此错误。
且每次处理此文档时都能复现。

注:此文档页数较多,处理时间较长。

How to reproduce the bug | 如何复现

处理文件:

magic-pdf -p COM本质论.pdf -o output

PDF:COM本质论.pdf

以下是错误输出:

2024-09-12 00:42:38.097 | INFO     | magic_pdf.para.para_split_v2:__connect_middle_align_text:684 - 2.0615528128088303
2024-09-12 00:43:01.422 | ERROR    | magic_pdf.tools.cli:parse_doc:96 - code=8: invalid key in dict
Traceback (most recent call last):

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
           │         │     └ {'__name__': '__main__', '__doc__': None, '__package__': '', '__loader__': <zipimporter object "C:\Users\hwf1324\.conda\envs\...
           │         └ <code object <module> at 0x000001C38A827EC0, file "C:\Users\hwf1324\.conda\envs\MinerU\Scripts\magic-pdf.exe\__main__.py", li...
           └ <function _run_code at 0x000001C38A811510>

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
         │     └ {'__name__': '__main__', '__doc__': None, '__package__': '', '__loader__': <zipimporter object "C:\Users\hwf1324\.conda\envs\...<code object <module> at 0x000001C38A827EC0, file "C:\Users\hwf1324\.conda\envs\MinerU\
Scripts\magic-pdf.exe\__main__.py", li...

  File "C:\Users\hwf1324\.conda\envs\MinerU\Scripts\magic-pdf.exe\__main__.py", line 7, in <module>
    sys.exit(cli())
    │   │    └ <Command cli>
    │   └ <built-in function exit><module 'sys' (built-in)>

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x000001C38ACA5240><Command cli>

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x000001C38A878D00>
         │    └ <function Command.invoke at 0x000001C38ACA5CF0><Command cli>

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'path': 'COM本质论.pdf', 'output_dir': 'output', 'method': 'auto', 'debug_able': False, 'start_page_id': 0, 'end_page_id': None}
           │   │      │    │           └ <click.core.Context object at 0x000001C38A878D00>
           │   │      │    └ <function cli at 0x000001C3E15D08B0>
           │   │      └ <Command cli>
           │   └ <function Context.invoke at 0x000001C38ACA4A60><click.core.Context object at 0x000001C38A878D00>

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
                       │       └ {'path': 'COM本质论.pdf', 'output_dir': 'output', 'method': 'auto', 'debug_able': False, 'start_page_id': 0, 'end_page_id': None}
                       └ ()

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\magic_pdf\tools\cli.py", line 102, in cli
    parse_doc(path)
    │         └ 'COM本质论.pdf'
    └ <function cli.<locals>.parse_doc at 0x000001C38A863910>

> File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\magic_pdf\tools\cli.py", line 84, in parse_doc
    do_parse(
    └ <function do_parse at 0x000001C38E712B00>

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\magic_pdf\tools\common.py", line 88, in do_parse
    draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
    │                │         │          │             └ 'COM本质论'
    │                │         │          └ 'output\\COM本质论\\auto'
    │                │         └ b'%PDF-1.4\n%adultpdf.com\n1 0 obj\r<< \r/Type /Page \r/Parent 3008 0 R \r/MediaBox [ 0 0 468 710 ] \r/Resources 2 0 R \r/Con...
    │                └ [{'preproc_blocks': [{'type': 'image', 'bbox': [33, 34, 444, 264], 'blocks': []}, {'type': 'title', 'bbox': [163, 273, 449, 2...
    └ <function draw_layout_bbox at 0x000001C38E712830>

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\magic_pdf\libs\draw_bbox.py", line 157, in draw_layout_bbox
    pdf_docs.save(f'{out_path}/{filename}_layout.pdf')
    │        └ <function Document.save at 0x000001C38D7CB520>
    └ Document('', <memory, doc# 6>)

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\pymupdf\__init__.py", line 5452, in save
    mupdf.pdf_save_document(pdf, filename, opts)
    │     │                 │    │         └ (do_incremental=0 do_pretty=0 do_ascii=0 do_compress=0 do_compress_images=0 do_compress_fonts=0 do_decompress=0 do_garbage=0 ...
    │     │                 │    └ 'output\\COM本质论\\auto/COM本质论_layout.pdf'
    │     │                 └ <pymupdf.mupdf.PdfDocument; proxy of <Swig Object of type 'mupdf::PdfDocument *' at 0x000001C40D2611D0> >
    │     └ <function pdf_save_document at 0x000001C38D8F9360><module 'pymupdf.mupdf' from 'C:\\Users\\hwf1324\\.conda\\envs\\MinerU\\lib\\site-packages\\pymupdf\\mupdf.py'>

  File "C:\Users\hwf1324\.conda\envs\MinerU\lib\site-packages\pymupdf\mupdf.py", line 50693, in pdf_save_document
    return _mupdf.pdf_save_document(doc, filename, opts)
           │      │                 │    │         └ (do_incremental=0 do_pretty=0 do_ascii=0 do_compress=0 do_compress_images=0 do_compress_fonts=0 do_decompress=0 do_garbage=0 ...
           │      │                 │    └ 'output\\COM本质论\\auto/COM本质论_layout.pdf'
           │      │                 └ <pymupdf.mupdf.PdfDocument; proxy of <Swig Object of type 'mupdf::PdfDocument *' at 0x000001C40D2611D0> >
           │      └ <built-in function pdf_save_document><module 'pymupdf._mupdf' from 'C:\\Users\\hwf1324\\.conda\\envs\\MinerU\\lib\\site-packages\\pymupdf\\_mupdf.pyd'>

pymupdf.mupdf.FzErrorSyntax: code=8: invalid key in dict

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cuda

@hwf1324 hwf1324 added the bug Something isn't working label Sep 12, 2024
@myhloli
Copy link
Collaborator

myhloli commented Sep 12, 2024

问题和#572 一致
推测是pdf文件损坏,程序运行完画框的时候写出失败,
可以使用浏览器重新打印pdf到本地再试下,我这边试了下重新打印就没问题了。

@myhloli
Copy link
Collaborator

myhloli commented Sep 12, 2024

在0.8.x以上版本测试可以加入-e 5 这个参数,指定只解析前几页,不需要解析完整本,加快测试速度。

@hwf1324
Copy link
Author

hwf1324 commented Sep 12, 2024

奇怪,我把开放问题的筛选删掉之后搜索错误代码尽然没搜到那个Issue?

@hwf1324
Copy link
Author

hwf1324 commented Sep 12, 2024

好的,我之后试下。

@hwf1324 hwf1324 closed this as completed Sep 12, 2024
@myhloli
Copy link
Collaborator

myhloli commented Sep 12, 2024

奇怪,我把开放问题的筛选删掉之后搜索错误代码尽然没搜到那个Issue?

错误代码不完全一致,但是表现出的结果都是写出失败,应该是一类问题。

@myhloli
Copy link
Collaborator

myhloli commented Nov 14, 2024

#957 fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants