Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 1.1.0 #1614

Merged
merged 77 commits into from
Jan 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
9f12c39
Update pdf_parse_union_core_v2.py
myhloli Jan 14, 2025
1b0ef29
Merge pull request #1534 from myhloli/dev
myhloli Jan 14, 2025
ee9340e
build(deps): add upper version limit for PyMuPDF
myhloli Jan 14, 2025
bbe0276
Merge pull request #1535 from myhloli/doclayoutyolo-fix
myhloli Jan 14, 2025
c20e9a1
feat(layout): improve title block handling and layout detection
myhloli Jan 14, 2025
902dcd2
refactor(BatchAnalyze): comment out image rotation logic in doclayout…
myhloli Jan 14, 2025
206fcb3
Merge pull request #1537 from myhloli/doclayoutyolo-fix
myhloli Jan 14, 2025
bbd8695
feat(post_proc): enhance title block processing with average line height
myhloli Jan 14, 2025
60054fe
Merge pull request #1538 from myhloli/doclayoutyolo-fix
myhloli Jan 14, 2025
f37b14b
refactor(pre_proc): adjust IOU threshold for character overlap detection
myhloli Jan 15, 2025
e139e5b
Merge pull request #1542 from myhloli/dev
myhloli Jan 15, 2025
916ced9
docs(magic_pdf): update llm_aided.py prompt for title list optimization
myhloli Jan 15, 2025
1954573
Merge pull request #1543 from myhloli/dev
myhloli Jan 15, 2025
1a549a0
fix(language): remove invalid UTF-16 surrogate pairs from input text
myhloli Jan 15, 2025
fbc7611
Merge pull request #1546 from myhloli/dev
myhloli Jan 15, 2025
75a61fc
update logo
myhloli Jan 15, 2025
316f3b9
Merge pull request #1547 from myhloli/dev
myhloli Jan 15, 2025
84f808f
build(docker): update doclayout-yolo dependency
myhloli Jan 15, 2025
f405cc2
Merge pull request #1549 from myhloli/dev
myhloli Jan 15, 2025
f350222
feat(model): improve batch analysis logic and support npu
myhloli Jan 15, 2025
852ae37
Merge pull request #1550 from myhloli/dev
myhloli Jan 15, 2025
8570e00
refactor(magic_pdf): improve title block merging logic
myhloli Jan 15, 2025
2aea5d6
Merge pull request #1551 from myhloli/dev
myhloli Jan 15, 2025
f209dde
fix(magic_pdf): correct end page index and improve error handling
myhloli Jan 16, 2025
d08fe27
Merge pull request #1553 from myhloli/dev
myhloli Jan 16, 2025
cebaffb
docs(README): update demo badges
myhloli Jan 16, 2025
443966d
docs(README): update demo badges
myhloli Jan 16, 2025
46ce94e
Merge branch 'opendatalab:dev' into dev
myhloli Jan 16, 2025
63c267f
Merge pull request #1554 from myhloli/dev
myhloli Jan 16, 2025
79c8a5c
feat(table): upgrade RapidTable to1.0.3 and add sub-model support
myhloli Jan 16, 2025
61f75b3
build(docker): update rapid-table dependency
myhloli Jan 16, 2025
452a9c0
refactor(model): update batch analyze logic for rapid table model
myhloli Jan 16, 2025
230191c
Merge pull request #1556 from myhloli/dev
myhloli Jan 16, 2025
48c2051
docs(README): update WeChat group link
myhloli Jan 16, 2025
fd5427a
Merge pull request #1557 from myhloli/dev
myhloli Jan 16, 2025
e64d4fe
refactor(table): add device configuration for Unitable model
myhloli Jan 17, 2025
af3ec55
Merge pull request #1567 from myhloli/dev
myhloli Jan 17, 2025
59502e5
refactor(model): update config version check to 1.1.1
myhloli Jan 17, 2025
9da857d
Merge pull request #1569 from myhloli/dev
myhloli Jan 17, 2025
db8be97
fix(magic_pdf): limit batch ratio for GPU memory
myhloli Jan 17, 2025
b894b78
Merge pull request #1570 from myhloli/dev
myhloli Jan 17, 2025
d986e39
feat(llm_aided): add reasonability check and fine-tuning guidelines
myhloli Jan 17, 2025
48a4337
Merge pull request #1571 from myhloli/dev
myhloli Jan 17, 2025
fbf1c4b
Fix ocr utills
moria97 Jan 20, 2025
96d5bc0
Merge pull request #1578 from moria97/fix-ocr
myhloli Jan 20, 2025
ba6c17a
feat(pdf_parse): remove tilted lines for better text extraction
myhloli Jan 20, 2025
f473028
Merge pull request #1580 from myhloli/dev
myhloli Jan 20, 2025
b3d60b9
fix(ocr): improve ONNX model initialization and error handling
myhloli Jan 20, 2025
58d28c9
Merge pull request #1582 from myhloli/dev
myhloli Jan 20, 2025
2a3a006
fix(models): update unimernet_small model path
myhloli Jan 21, 2025
6be3277
Merge pull request #1591 from myhloli/dev
myhloli Jan 21, 2025
49d140c
perf(model): adjust batch size for layout and formula detection
myhloli Jan 21, 2025
052a4d7
perf(magic_pdf): optimize batch ratio calculation for GPU
myhloli Jan 21, 2025
e74a296
refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Updat…
myhloli Jan 21, 2025
636d78a
Merge pull request #1593 from myhloli/dev
myhloli Jan 21, 2025
037736f
perf(magic_pdf): adjust batch ratio calculation for GPU memory
myhloli Jan 21, 2025
08a9558
Merge pull request #1594 from myhloli/dev
myhloli Jan 21, 2025
55447c8
perf(magic_pdf): optimize batch processing for GPU
myhloli Jan 21, 2025
b6710b9
fix(magic_pdf): correct batch ratio conditions for GPU memory
myhloli Jan 21, 2025
98c0568
Merge pull request #1595 from myhloli/dev
myhloli Jan 21, 2025
1d08865
refactor(pdf_parse): uncomment char bbox validation logic
myhloli Jan 22, 2025
c38060d
fix(boxbase): handle cases where bounding box area is zero
myhloli Jan 22, 2025
c7a3a68
Merge pull request #1601 from myhloli/dev
myhloli Jan 22, 2025
10e848b
feat(pdf_parse_union_core_v2): add timing log for LLM aided processes
myhloli Jan 22, 2025
4fe89d5
docs(readme):update readme for 1.1.0
myhloli Jan 22, 2025
5115d00
Merge pull request #1602 from myhloli/dev
myhloli Jan 22, 2025
af4b209
docs(url): update Miners links in header
myhloli Jan 22, 2025
dc41636
Merge pull request #1606 from myhloli/dev
myhloli Jan 22, 2025
2eb50da
Merge pull request #1607 from opendatalab/dev
myhloli Jan 22, 2025
6ff18b1
feat(table-config): add sub_model configuration for rapid_table
myhloli Jan 23, 2025
f101826
Merge pull request #1612 from myhloli/dev
myhloli Jan 23, 2025
235f341
Merge pull request #1613 from opendatalab/dev
myhloli Jan 23, 2025
ab263aa
docs(readme): update changelog for v1.1.0 release- Update model capab…
myhloli Jan 23, 2025
5c4c79e
Merge pull request #1616 from myhloli/dev
myhloli Jan 23, 2025
30ac4d0
docs(README): update online demo links and enhance documentation read…
myhloli Jan 23, 2025
24f352f
Merge pull request #1617 from myhloli/dev
myhloli Jan 23, 2025
adcace4
Merge pull request #1618 from opendatalab/dev
myhloli Jan 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions README.md

Large diffs are not rendered by default.

19 changes: 14 additions & 5 deletions README_zh-CN.md

Large diffs are not rendered by default.

7 changes: 3 additions & 4 deletions docker/ascend_npu/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
boto3>=1.28.43
Brotli>=1.1.0
click>=8.1.7
PyMuPDF>=1.24.9
PyMuPDF>=1.24.9,<=1.24.14
loguru>=0.6.0
numpy>=1.21.6,<2.0.0
fast-langdetect>=0.2.3,<0.3.0
Expand All @@ -17,10 +17,9 @@ paddlepaddle==3.0.0b1
struct-eqtable==0.3.2
einops
accelerate
doclayout_yolo==0.0.2
rapidocr-paddle
rapidocr-onnxruntime
rapid_table==0.3.0
doclayout-yolo==0.0.2
rapid-table>=1.0.3,<2.0.0
doclayout-yolo==0.0.2b1
openai
detectron2
7 changes: 3 additions & 4 deletions docker/china/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
boto3>=1.28.43
Brotli>=1.1.0
click>=8.1.7
PyMuPDF>=1.24.9
PyMuPDF>=1.24.9,<=1.24.14
loguru>=0.6.0
numpy>=1.21.6,<2.0.0
fast-langdetect>=0.2.3,<0.3.0
Expand All @@ -16,10 +16,9 @@ paddleocr==2.7.3
struct-eqtable==0.3.2
einops
accelerate
doclayout_yolo==0.0.2
rapidocr-paddle
rapidocr-onnxruntime
rapid_table==0.3.0
doclayout-yolo==0.0.2
rapid-table>=1.0.3,<2.0.0
doclayout-yolo==0.0.2b1
openai
detectron2
7 changes: 3 additions & 4 deletions docker/global/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
boto3>=1.28.43
Brotli>=1.1.0
click>=8.1.7
PyMuPDF>=1.24.9
PyMuPDF>=1.24.9,<=1.24.14
loguru>=0.6.0
numpy>=1.21.6,<2.0.0
fast-langdetect>=0.2.3,<0.3.0
Expand All @@ -16,10 +16,9 @@ paddleocr==2.7.3
struct-eqtable==0.3.2
einops
accelerate
doclayout_yolo==0.0.2
rapidocr-paddle
rapidocr-onnxruntime
rapid_table==0.3.0
doclayout-yolo==0.0.2
rapid-table>=1.0.3,<2.0.0
doclayout-yolo==0.0.2b1
openai
detectron2
3 changes: 2 additions & 1 deletion magic-pdf.template.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
},
"table-config": {
"model": "rapid_table",
"sub_model": "slanet_plus",
"enable": true,
"max_time": 400
},
Expand All @@ -39,5 +40,5 @@
"enable": false
}
},
"config_version": "1.1.0"
"config_version": "1.1.1"
}
7 changes: 5 additions & 2 deletions magic_pdf/libs/boxbase.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,10 +185,13 @@ def calculate_iou(bbox1, bbox2):
bbox1_area = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
bbox2_area = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])

if any([bbox1_area == 0, bbox2_area == 0]):
return 0

# Compute the intersection over union by taking the intersection area
# and dividing it by the sum of both areas minus the intersection area
iou = intersection_area / float(bbox1_area + bbox2_area -
intersection_area)
iou = intersection_area / float(bbox1_area + bbox2_area - intersection_area)

return iou


Expand Down
16 changes: 14 additions & 2 deletions magic_pdf/libs/draw_bbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,12 +362,24 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
for page in pdf_info:
page_line_list = []
for block in page['preproc_blocks']:
if block['type'] in [BlockType.Text, BlockType.Title, BlockType.InterlineEquation]:
if block['type'] in [BlockType.Text]:
for line in block['lines']:
bbox = line['bbox']
index = line['index']
page_line_list.append({'index': index, 'bbox': bbox})
if block['type'] in [BlockType.Image, BlockType.Table]:
elif block['type'] in [BlockType.Title, BlockType.InterlineEquation]:
if 'virtual_lines' in block:
if len(block['virtual_lines']) > 0 and block['virtual_lines'][0].get('index', None) is not None:
for line in block['virtual_lines']:
bbox = line['bbox']
index = line['index']
page_line_list.append({'index': index, 'bbox': bbox})
else:
for line in block['lines']:
bbox = line['bbox']
index = line['index']
page_line_list.append({'index': index, 'bbox': bbox})
elif block['type'] in [BlockType.Image, BlockType.Table]:
for sub_block in block['blocks']:
if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
if len(sub_block['virtual_lines']) > 0 and sub_block['virtual_lines'][0].get('index', None) is not None:
Expand Down
9 changes: 9 additions & 0 deletions magic_pdf/libs/language.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,20 @@
from fast_langdetect import detect_language


def remove_invalid_surrogates(text):
# 移除无效的 UTF-16 代理对
return ''.join(c for c in text if not (0xD800 <= ord(c) <= 0xDFFF))


def detect_lang(text: str) -> str:

if len(text) == 0:
return ""

text = text.replace("\n", "")
text = remove_invalid_surrogates(text)

# print(text)
try:
lang_upper = detect_language(text)
except:
Expand All @@ -37,3 +45,4 @@ def detect_lang(text: str) -> str:
print(detect_lang("<html>This is a test</html>"))
print(detect_lang("这个是中文测试。"))
print(detect_lang("<html>这个是中文测试。</html>"))
print(detect_lang("〖\ud835\udc46\ud835〗这是个包含utf-16的中文测试"))
Loading
Loading