Skip to content

Releases: opendatalab/MinerU

magic_pdf-1.1.0-released

23 Jan 10:01
19f72c2
Compare
Choose a tag to compare

What's Changed

In this version we have focused on improving parsing accuracy and efficiency:

  • Model capability upgrade (requires re-executing the model download process to obtain incremental updates of model files)
    • The layout recognition model has been upgraded to the latest doclayout_yolo(2501) model, improving layout recognition accuracy.
    • The formula parsing model has been upgraded to the latest unimernet(2501) model, improving formula recognition accuracy.
  • Performance optimization
    • On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.

在这个版本我们重点提升了解析的精度与效率:

  • 模型能力升级(需重新执行模型下载流程以获得模型文件的增量更新)
    • 布局识别模型升级到最新的doclayout_yolo(2501)模型,提升了layout识别精度
    • 公式解析模型升级到最新的unimernet(2501)模型,提升了公式识别精度
  • 性能优化
    • 在配置满足一定条件(显存16GB+)的设备上,通过优化资源占用和重构处理流水线,整体解析速度提升50%以上

New Contributors

Full Changelog: magic_pdf-1.0.1-released...magic_pdf-1.1.0-released

magic_pdf-1.0.1-released

10 Jan 10:50
24b77bb
Compare
Choose a tag to compare

What's Changed

  • New API Interface
    • For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.
    • For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.
  • Enhanced Compatibility
    • By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.
    • We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. Ascend NPU Acceleration
  • Automatic Language Identification
    • By introducing a new language recognition model, setting the lang configuration to auto during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.
  • Other Changes
    • Supported MPS acceleration on Apple silicon chips for certain supported tasks (such as layout detection and formula detection).
    • Convert the OCR model to ONNX format to improve OCR performance on ARM CPUs.

New Contributors

Full Changelog: magic_pdf-0.10.6-released...magic_pdf-1.0.1-released

magic_pdf-0.10.6-released

11 Dec 10:58
613074b
Compare
Choose a tag to compare

What's Changed

  • perf(model): optimize model initialization by @myhloli in #1198
  • fix: update notify by @dt-yy in #1201
  • fix(model): simplify model initialization logic by @myhloli in #1207
  • feat: update test case by @dt-yy in #1209
  • build(deps): specify minimum version for ultralytics by @myhloli in #1212
  • Refactor/add user api by @icecraft in #1178
  • fix(dict2md): add space for inline equations in CJK contexts by @myhloli in #1222
  • fix: 1. ocr txt mode error 2. lose pdf_parse_type field by @icecraft in #1224
  • fix: add parse_pdf_type and version by @icecraft in #1228
  • fix: unicode decode error by @icecraft in #1231
  • fix(detect_invalid_chars):fix the stack error caused by multiple memory releases in PyMuPDF by @myhloli in #1252
  • fix: dup classify pdf type by @icecraft in #1258
  • feat(layout): improve layout detection for DocLayout_YOLO model by @myhloli in #1259
  • refactor(draw_bbox): remove redundant '_line_sort' suffix from output filename by @myhloli in #1263
  • build(docker): add torch and torchvision dependencies by @myhloli in #1264

Full Changelog: magic_pdf-0.10.5-released...magic_pdf-0.10.6-released

magic_pdf-0.10.5-released

02 Dec 06:16
c175001
Compare
Choose a tag to compare

What's Changed

Full Changelog: magic_pdf-0.10.4-released...magic_pdf-0.10.5-released

magic_pdf-0.10.4-released

29 Nov 18:50
b03a7fa
Compare
Choose a tag to compare

What's Changed

  • fix(mkcontent): optimize paragraph text merging and language detection by @myhloli in #1152

Full Changelog: magic_pdf-0.10.3-released...magic_pdf-0.10.4-released

magic_pdf-0.10.3-released

29 Nov 08:05
b3fbedf
Compare
Choose a tag to compare

What's Changed

  • fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text by @myhloli in #1132
  • refactor(para): improve language detection and block splitting by @myhloli in #1134
  • feat(pdf_parse): filter out skewed text lines by @myhloli in #1135
  • refactor(ocr): improve text processing and span handling by @myhloli in #1136
  • refactor(pdf_check): improve character detection using PyMuPDF by @myhloli in #1137
  • feat(pdf_parse): add line start flag detection and optimize line stop flag logic by @myhloli in #1138
  • fix(ocr_mkcontent): handle empty paragraphs on pages by @myhloli in #1139
  • refactor(pdf_parse): adjust character-axis alignment algorithm by @myhloli in #1140
  • refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment by @myhloli in #1141

Full Changelog: magic_pdf-0.10.2-released...magic_pdf-0.10.3-released

magic_pdf-0.10.2-released

27 Nov 10:33
8afff9a
Compare
Choose a tag to compare

What's Changed

  • fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block. by @myhloli in #1082
  • refactor(txt_spans_extract_v2): optimize span processing and OCR logic by @myhloli in #1086
  • feat(ocr): filter out low confidence ocr results by @myhloli in #1088
  • feat(pdf_parse): add OCR score to span data by @myhloli in #1089
  • fix: test_rag by @icecraft in #1105
  • perf(image_processing): reduce maximum image size for analysis by @myhloli in #1106
  • fix: test_tools unittest by @icecraft in #1104
  • refactor(libs): remove unused imports and functions by @myhloli in #1112
  • Feat/add s3 read write example by @icecraft in #1117

Full Changelog: magic_pdf-0.10.1-released...magic_pdf-0.10.2-released

magic_pdf-0.10.1-released

25 Nov 03:41
4dcf31b
Compare
Choose a tag to compare

What's Changed

Full Changelog: magic_pdf-0.10.0-released...magic_pdf-0.10.1-released

magic_pdf-0.10.0-released

22 Nov 09:54
158e556
Compare
Choose a tag to compare

What's Changed

  • fix: 修复issue #715 by @LollipopsAndWine in #971
  • docs(README): update GPU hardware recommendations and table recognition options by @myhloli in #973
  • docs: improve GPU support list formatting in README_zh-CN.md by @myhloli in #974
  • docs: update feature description for table conversion by @myhloli in #975
  • docs: update readme by @myhloli in #977
  • update ci by @dt-yy in #986
  • test(unitest): Restore unit test cases by @myhloli in #998
  • refactor(tests): extract common test utilities into test_commons.py by @myhloli in #1001
  • feat(ocr): improve handling of angled text boxes by @myhloli in #1010
  • refactor(para): improve paragraph splitting logic by @myhloli in #1013
  • build(setup): add old_linux specific dependencies by @myhloli in #1016
  • refactor(para): adjust right margin threshold based on block width by @myhloli in #1018
  • fix: using new data api replace old rw api by @icecraft in #1006
  • delete unused pipeline file by @liugongjian in #1024
  • refactor: move some constants or enums defs to config folder by @icecraft in #1027
  • fix: remove test code by @icecraft in #1036
  • fix(tools): handle empty language string in common.py by @myhloli in #1045
  • refactor(ocr_dict_merge): add threshold parameter for line merging by @myhloli in #1046
  • fix(ocr_mkcontent): improve hyphen handling at line ends by @myhloli in #1047
  • fix(remove_overlaps_min_spans): optimize overlap detection in OCR span list modification by @myhloli in #1048
  • feat(ocr): improve text detection and OCR accuracy by @myhloli in #1049
  • refactor(txt_parse): improve text extraction accuracy with new algorithm by @myhloli in #1050
  • fix: use concrete class instead of abstract class by @icecraft in #1052
  • fix(pdf_parse): improve line stop flag detection accuracy by @myhloli in #1053
  • test: comment out assertions for metascan classify and meta scan tests by @myhloli in #1054
  • Add test cases to json compressor util by @liugongjian in #1056
  • refactor(para): improve line stop flag and remove unused debug mode by @myhloli in #1058
  • fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1060
  • refactor(model): move page total time logging to custom model analysis by @myhloli in #1061
  • fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1062
  • fix(pdf_parse): improve OCR result handling by @myhloli in #1064

New Contributors

Full Changelog: magic_pdf-0.9.3-released...magic_pdf-0.10.0-released

magic_pdf-0.9.3-released

15 Nov 11:27
845a3ff
Compare
Choose a tag to compare

What's Changed

  • feat(model): add xycut algorithm for block sorting by @myhloli in #898
  • refactor(pdf_parse): adjust line count threshold for layoutreader by @myhloli in #902
  • Feat/add en docs by @icecraft in #906
  • feat: using next_docs by @icecraft in #907
  • feat(table): integrate RapidTable model for table recognition by @myhloli in #910
  • fix(gradio-app): add missing file type in upload by @myhloli in #911
  • refactor(magic_pdf_parse_main): optimize model data handling and JSON output by @myhloli in #912
  • Modify the test directory by @DTwz in #913
  • test(table): improve ppTableModel test coverage by @myhloli in #914
  • feat(table): add RapidOCR support for RapidTable model by @myhloli in #915
  • 新增DocLayout-YOLO超链接 by @qiangqiang199 in #889
  • fix: remove classes hierarchy diagram by @icecraft in #919
  • refactor(model download script) by @myhloli in #922
  • docs(readme): update table recognition configuration and documentation by @myhloli in #924
  • docs(README_ja-JP.md): update warning message and remove outdated content by @myhloli in #925
  • 更新 para_split_v3.py by @hyastar in #923
  • Style/docs by @icecraft in #927
  • docs: rewrite zh_cn docs without translate by @icecraft in #928
  • fix: typo by @icecraft in #931
  • fix: 修复Dockerfile文件中download_models.py脚本路径问题 by @kimi360 in #938
  • build(Dockerfile): update model download script and dependencies by @myhloli in #941
  • fix(ocr_mkcontent): improve handling of single-character content #937 by @myhloli in #943
  • feat: tune docs by @icecraft in #948
  • fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs due to file corruption or non-standard format by forcing a re-print. by @myhloli in #957
  • refactor(model): rename and restructure model modules by @myhloli in #964
  • docs:update docs for 0.9.3 by @myhloli in #965
  • docs(README): update project references and translations by @myhloli in #967

New Contributors

Full Changelog: magic_pdf-0.9.2-released...magic_pdf-0.9.3-released