Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add skip no bbox logic #95

Merged
merged 2 commits into from
Apr 30, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions magic_pdf/pdf_parse_union_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,13 @@ def parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter,
img_blocks, table_blocks, discarded_blocks, text_blocks, title_blocks,
interline_equations, page_w, page_h)

'''如果当前页面没有bbox则跳过'''
if len(all_bboxes) == 0:
logger.warning(f"skip this page, not found bbox, page_id: {page_id}")
return ocr_construct_page_component_v2([], [], page_id, page_w, page_h, [],
[], [], interline_equations, discarded_blocks,
need_drop, drop_reason)

"""在切分之前,先检查一下bbox是否有左右重叠的情况,如果有,那么就认为这个pdf暂时没有能力处理好,这种左右重叠的情况大概率是由于pdf里的行间公式、表格没有被正确识别出来造成的 """

while True: # 循环检查左右重叠的情况,如果存在就删除掉较小的那个bbox,直到不存在左右重叠的情况
Expand Down Expand Up @@ -178,6 +185,7 @@ def parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter,
need_drop, drop_reason)
return page_info


def pdf_parse_union(pdf_bytes,
model_list,
imageWriter,
Expand Down Expand Up @@ -225,3 +233,7 @@ def pdf_parse_union(pdf_bytes,
}

return new_pdf_info_dict


if __name__ == '__main__':
pass
Loading