You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
用你这个pdf测了,用了tablemaster和rapidtable两种模型均没有遇到 IndexError: index 0 is out of bounds for axis 0 with size 0 这个错误
页脚被识别成表格的问题是layout模型识别的结果,这个短期内应该没法解决,如果您的文档都是这种格式,可以尝试自己先在pdf文档页脚位置批量贴白色色块盖住这些页脚再尝试识别
Description of the bug | 错误描述
解析pdf时报错
', '', '', '', '', '', '', '', '</e...app-1 | 2024-11-06 10:42:24.790 | INFO | magic_pdf.model.pdf_extract_kit:call:490 - table time: 0.0
app-1 | │ │ │ │ └ b'%PDF-1.7\n%\xe4\xe3\xcf\xd2\n4 0 obj\n<</Type/XObject\n/Subtype/Form\n/FormType 1\n/Matrix[1 0 0 1 0 0]\n/BBox[0 0 595 841]...
app-1 | │ │ │ └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f19c9e5f370>
app-1 | │ │ └ <function doc_analyze at 0x7f1b4175c160>
app-1 | │ └ []
app-1 | └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f19c9e5f370>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 166, in doc_analyze
app-1 | result = custom_model(img)
app-1 | │ └ array([[[255, 255, 255],
app-1 | │ [255, 255, 255],
app-1 | │ [255, 255, 255],
app-1 | │ ...,
app-1 | │ [255, 255, 255],
app-1 | │ [255...
app-1 | └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f19c9e5dcc0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 468, in call
app-1 | html_code = self.table_model.img2html(new_image)
app-1 | │ │ │ └ <PIL.Image.Image image mode=RGB size=1283x457 at 0x7F19C9E5FA60>
app-1 | │ │ └ <function ppTableModel.img2html at 0x7f1a144c48b0>
app-1 | │ └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f19e1bfbb20>
app-1 | └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f19c9e5dcc0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/ppTableModel.py", line 42, in img2html
app-1 | pred_res, _ = self.table_sys(image)
app-1 | │ │ └ array([[[255, 255, 255],
app-1 | │ │ [255, 255, 255],
app-1 | │ │ [255, 255, 255],
app-1 | │ │ ...,
app-1 | │ │ [ 67, 67, 67],
app-1 | │ │ [ 67...
app-1 | │ └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f19e1bfbd00>
app-1 | └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f19e1bfbb20>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 100, in call
app-1 | pred_html = self.match(structure_res, dt_boxes, rec_res)
app-1 | │ │ │ │ └ []
app-1 | │ │ │ └ array([], dtype=float64)
app-1 | │ │ └ (['', '', '
app-1 | │ └ <ppstructure.table.table_master_match.TableMasterMatcher object at 0x7f19c9d3fb80>
app-1 | └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f19e1bfbd00>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 949, in call
app-1 | match_results = self.match()
app-1 | │ └ <function Matcher.match at 0x7f1a1448cd30>
app-1 | └ <ppstructure.table.table_master_match.TableMasterMatcher object at 0x7f19c9d3fb80>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 769, in match
app-1 | get_bboxes_list(end2end_result, structure_master_result)
app-1 | │ │ └ {'text': ',,,,,,,,,,,,<e...
app-1 | │ └ []
app-1 | └ <function get_bboxes_list at 0x7f1a1448c3a0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 302, in get_bboxes_list
app-1 | xywh_bbox = xyxy2xywh(src_bboxes)
app-1 | │ └ array([], dtype=float64)
app-1 | └ <function xyxy2xywh at 0x7f1a14693d00>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 71, in xyxy2xywh
app-1 | new_bboxes[0] = bboxes[0] + (bboxes[2] - bboxes[0]) / 2
app-1 | │ │ │ └ array([], dtype=float64)
app-1 | │ │ └ array([], dtype=float64)
app-1 | │ └ array([], dtype=float64)
app-1 | └ array([], dtype=float64)
app-1 |
app-1 | IndexError: index 0 is out of bounds for axis 0 with size 0
app-1 | INFO: 10.0.104.3:53724 - "POST /pdf_parse?parse_method=ocr&is_json_md_dump=True&output_dir=output HTTP/1.1" 500 Internal Server Error
How to reproduce the bug | 如何复现
这是需要解析的pdf的两张截图,不方便整体上传
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.9.x
Device mode | 设备模式
cuda
The text was updated successfully, but these errors were encountered: