Releases: opendatalab/MinerU
Releases · opendatalab/MinerU
magic_pdf-0.9.2-released
magic_pdf-0.9.1-released
What's Changed
- Feat/tune docs by @icecraft in #833
- fix(ocr_mkcontent): improve content handling for different languages and equation types by @myhloli in #839
- feat(list): improve list detection algorithm & fix(list): improve list identification accuracy by @myhloli in #843
- docs(tutorial): update magic-pdf command with output directory by @myhloli in #844
- feat(para_split_v3): improve list identification with block aspect ratio by @myhloli in #845
- fix(dict2md): improve text concatenation logic by @myhloli in #847
- Update pdf_extract_kit.py by @CiaranYoung in #853
- feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit by @myhloli in #854
- feat(model): add HTML minification to StructTableModel by @myhloli in #855
- chore: add .gitattributes to configure file linguist attributes by @myhloli in #856
- fix(merge_text): add ligature replacement functionality #305 #241 by @myhloli in #857
- chore: add CSS and SCSS files to linguist-vendored- Update .gitattributes to mark CSS and SCSS files as vendored by @myhloli in #858
- docs(README): update Colab demo link by @myhloli in #860
- fix(table): improve table image processing by @myhloli in #866
- docs(faq): add troubleshooting for illegal instruction error on Linux servers by @myhloli in #867
- feat: mineru_demo接口文档替换为链接 by @LollipopsAndWine in #871
- test(table): improve HTML validation for table extraction by @myhloli in #874
- docs: update arXiv paper link in README files by @myhloli in #875
- docs(README): update changelog for v0.9.1 release by @myhloli in #877
New Contributors
- @CiaranYoung made their first contribution in #853
Full Changelog: magic_pdf-0.9.0-released...magic_pdf-0.9.1-released
magic_pdf-0.9.0-released
What's Changed
- Update README_zh-CN.md (#404) by @drunkpig in #409
- feat: add dockerfile by @Lincyaw in #189
- fix(ocr_mkcontent): improve language detection and content formatting by @myhloli in #458
- fix(self_modify): merge detection boxes for optimized text region detection by @myhloli in #448
- fix(pdf-extract): adjust box threshold for OCR detection to fix issue about OCR mode lost some line by @myhloli in #447
- feat: rename the file generated by command line tools by @icecraft in #401
- fix(ocr_mkcontent): revise table caption output by @myhloli in #397
- build(docker): update docker build step by @myhloli in #471
- upload an introduction about chemical formula and update readme.md by @GDDGCZ518 in #489
- fix: remove the default value of output option in tools/cli.py and to… by @icecraft in #494
- feat: add test case by @dt-yy in #499
- fixes #492 decrease span threshold for block filling by @myhloli in #500
- fix(detect_all_bboxes): remove small overlapping blocks by merging by @myhloli in #501
- feat(cli&analyze&pipeline): add start_page and end_page args for pagination by @myhloli in #507
- Feat/support rag by @icecraft in #510
- feat(gradio): add app by gradio by @myhloli in #512
- fix: replace \u0002, \u0003 in common text by @drunkpig in #521
- fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. by @myhloli in #518
- fix(para): When an English line ends with a hyphen, do not add a space at the end. by @drunkpig in #523
- Release: Release 0.7.1 verison, update dev by @dt-yy in #527
- Hotfix readme 0.7.1 by @Focusshang in #529
- fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 by @papayalove in #542
- fix: typo error in markdown by @icecraft in #536
- fix(gradio): remove unused imports and simplify pdf display by @myhloli in #534
- Feat/support footnote in figure by @icecraft in #532
- refactor(pdf_extract_kit): implement singleton pattern for atomic models by @myhloli in #533
- feat: mineru_web by @LollipopsAndWine in #555
- features@add mineru gpu&web_api by @yanqiangmiffy in #568
- docs(models_download): update model download instructions to use python script by @myhloli in #560
- fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 by @papayalove in #574
- feat(ocr): supports minority languages by @myhloli in #577
- refactor(pdf_extract_kit): update model config and weight paths for UniMERNet-0.2.0 by @myhloli in #584
- feat(gradio_app): add web app with PDF processing as a project by @myhloli in #579
- fix: web_api by @LollipopsAndWine in #580
- Realese 0.8.0 by @drunkpig in #587
- fix: 1. resolve uncorrect pair relation of figure and footnote, 2. re… by @icecraft in #603
- fix: recovert the lang option in tools/cli.py by @icecraft in #604
- fix: solve conflicts by @myhloli in #607
- fix: remove useless files by @myhloli in #608
- feat(gradio_app): add examples accordion to the PDF conversion interface by @myhloli in #597
- feat(pipeline): pass language parameter for parsing and markdown conversion by @myhloli in #602
- feat(ocr_mkcontent): support drop reason in none_with_reason mode by @myhloli in #630
- feat(UNIPipe): change default drop_mode to NONE_WITH_REASON by @myhloli in #631
- refactor(pdf_extract): use Image.crop directly with layout detection by @myhloli in #635
- fix(pdf-extract): ensure model is set to evaluation mode before processing by @myhloli in #636
- fix(pdf_extract_kit):change unimernet base -> small by @myhloli in #639
- feat: add test case by @dt-yy in #645
- feat: 集成前端界面,配置一键启动 by @LollipopsAndWine in #668
- feat: 删除无用的文件,更新前端style by @LollipopsAndWine in #669
- docs: update project lists in README files to include web_api by @myhloli in #670
- feat:add layoutreader to sort blocks by @myhloli in #672
- refactor(model): improve timing information and performance by @myhloli in #690
- feat: add arXiv paper link to header and adjust PDF parsing logic by @myhloli in #693
- perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity by @myhloli in #694
- fix: caption or footnote match algorithm by @icecraft in #695
- fix: caption|footnote match algorithm by @icecraft in #696
- feat(layoutreader): support local model directory and improve model loading by @myhloli in #698
- feat(docs): automate model download and configuration by @myhloli in #699
- docs: add filename to wget command in model download scripts by @myhloli in #700
- docs: update CUDA acceleration guides and README content by @myhloli in #701
- Update README_Windows_CUDA_Acceleration_en_US.md by @myhloli in #706
- feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support by @myhloli in #716
- Update how_to_download_models_zh_cn.md by @myhloli in #717
- fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks by @myhloli in #718
- feat: manager docs with sphinx by @icecraft in #737
- feat(list&index block): detect and merge list and index blocks by @myhloli in #740
- refactor(para_split_v3): merge list and index block detection by @myhloli in #743
- fix(para_split_v3): refine list block detection in paragraph splitting by @myhloli in #744
- update example files by @myhloli in #747
- refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. by @myhloli in #753
- refactor(para): improve paragraph splitting algorithm by @myhloli in #765
- docs:Update the driver requirements on the Ubuntu system. by @myhloli in #766
- update:update config json by @myhloli in #769
- feat(model): add support for DocLayout-YOLO model by @myhloli in #773
- build(setup): add doclayout_yolo dependency by @myhloli in #774
- build(docker): add doclayout-yolo dependency by @myhloli in #776
- feat: add support for non-PDF file conversion to PDF by @myhloli in #777
- Feat/data api by @icecraft in #782
- Feat/new table caption match by @icecraft in #784
- refactor(parse_core): improve image and table block handling by @myhloli in #785
- refactor(ocr): adjust OCR processing parameters by @myhloli in #786
- fix: add init to magic_pdf.config by @myhloli in #788
- fix: add init to magic_pdf.utils by @myhloli in #789
- feat(draw_bbox): update bounding box drawing for tables and images by @myhloli in #791
- Add multi_gpu process project by @randydl in #79...
magic_pdf-0.8.1-update-docs
What's Changed
Full Changelog: magic_pdf-0.8.1-released...magic_pdf-0.8.1-update-docs
magic_pdf-0.8.1-released
What's Changed
fix:
- resolve uncorrect pair relation of figure and footnote
- resolve uncorrect pair relation of table and caption #590 by @icecraft in #599
Full Changelog: magic_pdf-0.8.0-released...magic_pdf-0.8.1-released
magic_pdf-0.8.0-released
What's Changed
feat:
- Add RAG API
- Integration of RAG into llama_index project
- Update Dockerfile
- Fine grained model singleton, reducing memory usage and accelerating initialization speed
- CLI and API add parsing range parameters, allowing customization of start and end pages
- Support image footnotes
bugfix:
- When removing the smaller overlapping block, retain the boundary information of that block
- Fill in the threshold of 0.6->0.3 for the span block
- The problem of losing low score lines in OCR DET stage
- Merge multiple spans of a single line in the OCR DET stage
- Optimization of English Adhesive Word Segmentation Logic
- Inaccurate layout box issue
- The problem of merging words after being broken by line breaks
- The final output result contains certain special characters
Full Changelog: magic_pdf-0.7.1-released...magic_pdf-0.8.0-released
magic_pdf-0.7.1-released
What's Changed
- feat: add tablemaster_paddle by @papayalove in #463
- (para_split_v2): index out of range issue of span_text first char by @papayalove in #396
Full Changelog: magic_pdf-0.7.0b1-released...magic_pdf-0.7.1-released
magic_pdf-0.7.0b1-released
What's Changed
- feat: add table recognition success detect by @papayalove in #354
- fix: #366 by @icecraft in #371
- fix&refactor(pdf-extract-kit): table recognition and ocr by @myhloli in #374
- fix(doc-analyze): adjust image scaling limit to 9000 pixels by @myhloli in #379
- feat(draw_bbox): add model bbox drawing functionality by @myhloli in #386
New Contributors
- @zuanzuanshao made their first contribution in #355
Full Changelog: magic_pdf-0.7.0a1-released...magic_pdf-0.7.0b1-released
magic_pdf-0.6.2b1-released
What's Changed
- Optimized model loading logic, now requiring only a single load during batch processing.
- Command-line interface now supports batch input.
- When import fails, prints complete error messages to facilitate troubleshooting.
- Fixed a bug where overlapping spans were incorrectly removed multiple times.
- Improved OCR recognition areas, doubling the OCR speed.
- Embedded language identification models within the whl package for easier offline deployment.
- Replaced interline_equation_blocks with interline_equations to enhance interline formula recognition capabilities in non-academic paper scenarios.
- Added page number indexing to the output results of content_list.
- Locked some dependency versions and adjusted the dependency installation logic to reduce conflicts and redundant installations, cutting down the number of packages by 30% and improving the initial installation success rate.
New Contributors
- @yzztin made their first contribution in #214
- @eltociear made their first contribution in #231
Full Changelog: magic_pdf-0.6.1-released...magic_pdf-0.6.2b1-released
magic_pdf-0.6.1-released
fix:Add two spaces at the end of an image or table row to ensure that…