Merge pull request #1602 from myhloli/dev

docs(readme):update readme for 1.1.0
opendatalab · Jan 22, 2025 · 5115d00 · 5115d00
2 parents c7a3a68 + 4fe89d5
commit 5115d00
Show file tree

Hide file tree

Showing 3 changed files with 18 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -47,6 +47,11 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
 </div>
 
 # Changelog
+- 2025/01/22 1.1.0 released. In this version we have focused on improving parsing accuracy and efficiency:
+  - Upgraded to the latest doclayout_yolo(2501) model, enhancing layout recognition accuracy.
+  - Upgraded to the latest unimernet(2501) model, improving formula recognition accuracy.
+  - On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.
+  - Added a new heading classification feature (testing version, enabled by default) to the online demo, which supports hierarchical classification of headings, thereby enhancing document structuring.
 - 2025/01/10 1.0.1 released. This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:
   - New API Interface
     - For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.
@@ -356,6 +361,7 @@ TODO
 - [x] Reading order based on the model  
 - [x] Recognition of `index` and `list` in the main text  
 - [x] Table recognition
+- [x] Heading Classification
 - [ ] Code block recognition in the main text
 - [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf)
 - [ ] Geometric shape recognition
@@ -365,7 +371,6 @@ TODO
 - Reading order is determined by the model based on the spatial distribution of readable content, and may be out of order in some areas under extremely complex layouts.
 - Vertical text is not supported.
 - Tables of contents and lists are recognized through rules, and some uncommon list formats may not be recognized.
-- Only one level of headings is supported; hierarchical headings are not currently supported.
 - Code blocks are not yet supported in the layout model.
 - Comic books, art albums, primary school textbooks, and exercises cannot be parsed well.
 - Table recognition may result in row/column recognition errors in complex tables.

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -46,6 +46,11 @@
 </div>
 
 # 更新记录
+- 2025/01/22 1.1.0 发布，在这个版本我们重点提升了解析的精度与效率：
+  - 升级了最新版的doclayout_yolo(2501)模型，提升了layout识别精度
+  - 升级了最新版的unimernet(2501)模型，提升了公式识别精度
+  - 在配置满足一定条件（显存16GB+）的设备上，通过优化资源占用和重构处理流水线，整体解析速度提升50%以上
+  - 在线demo上新增标题分级功能（测试版本，默认开启），支持对标题进行分级，提升文档结构化程度
 - 2025/01/10 1.0.1 发布，这是我们的第一个正式版本，在这个版本中，我们通过大量重构带来了全新的API接口和更广泛的兼容性，以及全新的自动语言识别功能：
   - 全新API接口 
     - 对于数据侧API，我们引入了Dataset类，旨在提供一个强大而灵活的数据处理框架。该框架当前支持包括图像（.jpg及.png）、PDF、Word（.doc及.docx）、以及PowerPoint（.ppt及.pptx）在内的多种文档格式，确保了从简单到复杂的数据处理任务都能得到有效的支持。
@@ -359,6 +364,7 @@ TODO
 - [x] 基于模型的阅读顺序  
 - [x] 正文中目录、列表识别  
 - [x] 表格识别
+- [x] 标题分级
 - [ ] 正文中代码块识别
 - [ ] [化学式识别](docs/chemical_knowledge_introduction/introduction.pdf)
 - [ ] 几何图形识别
@@ -368,7 +374,6 @@ TODO
 - 阅读顺序基于模型对可阅读内容在空间中的分布进行排序，在极端复杂的排版下可能会部分区域乱序
 - 不支持竖排文字
 - 目录和列表通过规则进行识别，少部分不常见的列表形式可能无法识别
-- 标题只有一级，目前不支持标题分级
 - 代码块在layout模型里还没有支持
 - 漫画书、艺术图册、小学教材、习题尚不能很好解析
 - 表格识别在复杂表格上可能会出现行/列识别错误

diff --git a/magic_pdf/pdf_parse_union_core_v2.py b/magic_pdf/pdf_parse_union_core_v2.py
@@ -957,17 +957,23 @@ def pdf_parse_union(
         formula_aided_config = llm_aided_config.get('formula_aided', None)
         if formula_aided_config is not None:
             if formula_aided_config.get('enable', False):
+                llm_aided_formula_start_time = time.time()
                 llm_aided_formula(pdf_info_dict, formula_aided_config)
+                logger.info(f'llm aided formula time: {round(time.time() - llm_aided_formula_start_time, 2)}')
         """文本优化"""
         text_aided_config = llm_aided_config.get('text_aided', None)
         if text_aided_config is not None:
             if text_aided_config.get('enable', False):
+                llm_aided_text_start_time = time.time()
                 llm_aided_text(pdf_info_dict, text_aided_config)
+                logger.info(f'llm aided text time: {round(time.time() - llm_aided_text_start_time, 2)}')
         """标题优化"""
         title_aided_config = llm_aided_config.get('title_aided', None)
         if title_aided_config is not None:
             if title_aided_config.get('enable', False):
+                llm_aided_title_start_time = time.time()
                 llm_aided_title(pdf_info_dict, title_aided_config)
+                logger.info(f'llm aided title time: {round(time.time() - llm_aided_title_start_time, 2)}')
 
     """dict转list"""
     pdf_info_list = dict_to_list(pdf_info_dict)