Releases · opendatalab/MinerU

23 Jan 10:01

myhloli

magic_pdf-1.1.0-released

19f72c2

magic_pdf-1.1.0-released Latest

Latest

What's Changed

In this version we have focused on improving parsing accuracy and efficiency:

Model capability upgrade (requires re-executing the model download process to obtain incremental updates of model files)
- The layout recognition model has been upgraded to the latest doclayout_yolo(2501) model, improving layout recognition accuracy.
- The formula parsing model has been upgraded to the latest unimernet(2501) model, improving formula recognition accuracy.
Performance optimization
- On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.

在这个版本我们重点提升了解析的精度与效率：

模型能力升级（需重新执行模型下载流程以获得模型文件的增量更新）
- 布局识别模型升级到最新的doclayout_yolo(2501)模型，提升了layout识别精度
- 公式解析模型升级到最新的unimernet(2501)模型，提升了公式识别精度
性能优化
- 在配置满足一定条件（显存16GB+）的设备上，通过优化资源占用和重构处理流水线，整体解析速度提升50%以上

New Contributors

@moria97 made their first contribution in #1578

Full Changelog: magic_pdf-1.0.1-released...magic_pdf-1.1.0-released

Contributors

moria97

Assets 3

10 Jan 10:50

myhloli

magic_pdf-1.0.1-released

24b77bb

magic_pdf-1.0.1-released

What's Changed

New API Interface
- For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.
- For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.
Enhanced Compatibility
- By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.
- We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. Ascend NPU Acceleration
Automatic Language Identification
- By introducing a new language recognition model, setting the lang configuration to auto during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.
Other Changes
- Supported MPS acceleration on Apple silicon chips for certain supported tasks (such as layout detection and formula detection).
- Convert the OCR model to ONNX format to improve OCR performance on ARM CPUs.

New Contributors

@IMSUVEN made their first contribution in #1281
@pangguosheng1106 made their first contribution in #1325
@beholder91 made their first contribution in #1479

Full Changelog: magic_pdf-0.10.6-released...magic_pdf-1.0.1-released

Contributors

pangguosheng1106, IMSUVEN, and beholder91

Assets 3

11 Dec 10:58

myhloli

magic_pdf-0.10.6-released

613074b

magic_pdf-0.10.6-released

What's Changed

perf(model): optimize model initialization by @myhloli in #1198
fix: update notify by @dt-yy in #1201
fix(model): simplify model initialization logic by @myhloli in #1207
feat: update test case by @dt-yy in #1209
build(deps): specify minimum version for ultralytics by @myhloli in #1212
Refactor/add user api by @icecraft in #1178
fix(dict2md): add space for inline equations in CJK contexts by @myhloli in #1222
fix: 1. ocr txt mode error 2. lose pdf_parse_type field by @icecraft in #1224
fix: add parse_pdf_type and version by @icecraft in #1228
fix: unicode decode error by @icecraft in #1231
fix(detect_invalid_chars):fix the stack error caused by multiple memory releases in PyMuPDF by @myhloli in #1252
fix: dup classify pdf type by @icecraft in #1258
feat(layout): improve layout detection for DocLayout_YOLO model by @myhloli in #1259
refactor(draw_bbox): remove redundant '_line_sort' suffix from output filename by @myhloli in #1263
build(docker): add torch and torchvision dependencies by @myhloli in #1264

Full Changelog: magic_pdf-0.10.5-released...magic_pdf-0.10.6-released

Contributors

myhloli, icecraft, and dt-yy

Assets 3

02 Dec 06:16

myhloli

magic_pdf-0.10.5-released

c175001

magic_pdf-0.10.5-released

What's Changed

fix: 修复文件名错误 by @LollipopsAndWine in #1154
refactor(para): adjust line height multiplier for block splitting by @myhloli in #1156
fix(pre_proc): prevent errors when imageWriter is None by @myhloli in #1166

Full Changelog: magic_pdf-0.10.4-released...magic_pdf-0.10.5-released

Contributors

myhloli and LollipopsAndWine

Assets 3

29 Nov 18:50

myhloli

magic_pdf-0.10.4-released

b03a7fa

magic_pdf-0.10.4-released

What's Changed

fix(mkcontent): optimize paragraph text merging and language detection by @myhloli in #1152

Full Changelog: magic_pdf-0.10.3-released...magic_pdf-0.10.4-released

Contributors

myhloli

Assets 3

29 Nov 08:05

myhloli

magic_pdf-0.10.3-released

b3fbedf

magic_pdf-0.10.3-released

What's Changed

fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text by @myhloli in #1132
refactor(para): improve language detection and block splitting by @myhloli in #1134
feat(pdf_parse): filter out skewed text lines by @myhloli in #1135
refactor(ocr): improve text processing and span handling by @myhloli in #1136
refactor(pdf_check): improve character detection using PyMuPDF by @myhloli in #1137
feat(pdf_parse): add line start flag detection and optimize line stop flag logic by @myhloli in #1138
fix(ocr_mkcontent): handle empty paragraphs on pages by @myhloli in #1139
refactor(pdf_parse): adjust character-axis alignment algorithm by @myhloli in #1140
refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment by @myhloli in #1141

Full Changelog: magic_pdf-0.10.2-released...magic_pdf-0.10.3-released

Contributors

myhloli

Assets 3

27 Nov 10:33

myhloli

magic_pdf-0.10.2-released

8afff9a

magic_pdf-0.10.2-released

What's Changed

fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block. by @myhloli in #1082
refactor(txt_spans_extract_v2): optimize span processing and OCR logic by @myhloli in #1086
feat(ocr): filter out low confidence ocr results by @myhloli in #1088
feat(pdf_parse): add OCR score to span data by @myhloli in #1089
fix: test_rag by @icecraft in #1105
perf(image_processing): reduce maximum image size for analysis by @myhloli in #1106
fix: test_tools unittest by @icecraft in #1104
refactor(libs): remove unused imports and functions by @myhloli in #1112
Feat/add s3 read write example by @icecraft in #1117

Full Changelog: magic_pdf-0.10.1-released...magic_pdf-0.10.2-released

Contributors

myhloli and icecraft

Assets 3

25 Nov 03:41

myhloli

magic_pdf-0.10.1-released

4dcf31b

magic_pdf-0.10.1-released

What's Changed

Fix/demo by @icecraft in #1071
feat(demo): add visualization bbox parameter and refactor parsing process by @myhloli in #1074
demo: batch process demo PDFs by @myhloli in #1075

Full Changelog: magic_pdf-0.10.0-released...magic_pdf-0.10.1-released

Contributors

myhloli and icecraft

Assets 3

22 Nov 09:54

myhloli

magic_pdf-0.10.0-released

158e556

magic_pdf-0.10.0-released

What's Changed

fix: 修复issue #715 by @LollipopsAndWine in #971
docs(README): update GPU hardware recommendations and table recognition options by @myhloli in #973
docs: improve GPU support list formatting in README_zh-CN.md by @myhloli in #974
docs: update feature description for table conversion by @myhloli in #975
docs: update readme by @myhloli in #977
update ci by @dt-yy in #986
test(unitest): Restore unit test cases by @myhloli in #998
refactor(tests): extract common test utilities into test_commons.py by @myhloli in #1001
feat(ocr): improve handling of angled text boxes by @myhloli in #1010
refactor(para): improve paragraph splitting logic by @myhloli in #1013
build(setup): add old_linux specific dependencies by @myhloli in #1016
refactor(para): adjust right margin threshold based on block width by @myhloli in #1018
fix: using new data api replace old rw api by @icecraft in #1006
delete unused pipeline file by @liugongjian in #1024
refactor: move some constants or enums defs to config folder by @icecraft in #1027
fix: remove test code by @icecraft in #1036
fix(tools): handle empty language string in common.py by @myhloli in #1045
refactor(ocr_dict_merge): add threshold parameter for line merging by @myhloli in #1046
fix(ocr_mkcontent): improve hyphen handling at line ends by @myhloli in #1047
fix(remove_overlaps_min_spans): optimize overlap detection in OCR span list modification by @myhloli in #1048
feat(ocr): improve text detection and OCR accuracy by @myhloli in #1049
refactor(txt_parse): improve text extraction accuracy with new algorithm by @myhloli in #1050
fix: use concrete class instead of abstract class by @icecraft in #1052
fix(pdf_parse): improve line stop flag detection accuracy by @myhloli in #1053
test: comment out assertions for metascan classify and meta scan tests by @myhloli in #1054
Add test cases to json compressor util by @liugongjian in #1056
refactor(para): improve line stop flag and remove unused debug mode by @myhloli in #1058
fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1060
refactor(model): move page total time logging to custom model analysis by @myhloli in #1061
fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1062
fix(pdf_parse): improve OCR result handling by @myhloli in #1064

New Contributors

@liugongjian made their first contribution in #1024

Full Changelog: magic_pdf-0.9.3-released...magic_pdf-0.10.0-released

Contributors

liugongjian, myhloli, and 3 other contributors

Assets 3

15 Nov 11:27

myhloli

magic_pdf-0.9.3-released

845a3ff

magic_pdf-0.9.3-released

What's Changed

feat(model): add xycut algorithm for block sorting by @myhloli in #898
refactor(pdf_parse): adjust line count threshold for layoutreader by @myhloli in #902
Feat/add en docs by @icecraft in #906
feat: using next_docs by @icecraft in #907
feat(table): integrate RapidTable model for table recognition by @myhloli in #910
fix(gradio-app): add missing file type in upload by @myhloli in #911
refactor(magic_pdf_parse_main): optimize model data handling and JSON output by @myhloli in #912
Modify the test directory by @DTwz in #913
test(table): improve ppTableModel test coverage by @myhloli in #914
feat(table): add RapidOCR support for RapidTable model by @myhloli in #915
新增DocLayout-YOLO超链接 by @qiangqiang199 in #889
fix: remove classes hierarchy diagram by @icecraft in #919
refactor(model download script) by @myhloli in #922
docs(readme): update table recognition configuration and documentation by @myhloli in #924
docs(README_ja-JP.md): update warning message and remove outdated content by @myhloli in #925
更新 para_split_v3.py by @hyastar in #923
Style/docs by @icecraft in #927
docs: rewrite zh_cn docs without translate by @icecraft in #928
fix: typo by @icecraft in #931
fix: 修复Dockerfile文件中download_models.py脚本路径问题 by @kimi360 in #938
build(Dockerfile): update model download script and dependencies by @myhloli in #941
fix(ocr_mkcontent): improve handling of single-character content #937 by @myhloli in #943
feat: tune docs by @icecraft in #948
fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs due to file corruption or non-standard format by forcing a re-print. by @myhloli in #957
refactor(model): rename and restructure model modules by @myhloli in #964
docs：update docs for 0.9.3 by @myhloli in #965
docs(README): update project references and translations by @myhloli in #967

New Contributors

@DTwz made their first contribution in #913
@qiangqiang199 made their first contribution in #889
@hyastar made their first contribution in #923
@kimi360 made their first contribution in #938

Full Changelog: magic_pdf-0.9.2-released...magic_pdf-0.9.3-released

Contributors

kimi360, myhloli, and 4 other contributors

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: opendatalab/MinerU

magic_pdf-1.1.0-released

What's Changed

New Contributors

Contributors

magic_pdf-1.0.1-released

What's Changed

New Contributors

Contributors

magic_pdf-0.10.6-released

What's Changed

Contributors

magic_pdf-0.10.5-released

What's Changed

Contributors

magic_pdf-0.10.4-released

What's Changed

Contributors

magic_pdf-0.10.3-released

What's Changed

Contributors

magic_pdf-0.10.2-released

What's Changed

Contributors

magic_pdf-0.10.1-released

What's Changed

Contributors

magic_pdf-0.10.0-released

What's Changed

New Contributors

Contributors

magic_pdf-0.9.3-released

What's Changed

New Contributors

Contributors