Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master->dev #1490

Merged
merged 32 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
26200c4
fix: remove deprecated demo code
Dec 12, 2024
3456b9e
Merge pull request #1278 from icecraft/fix/remove_demo
myhloli Dec 12, 2024
08e2df5
fix: AbsPipe initial error
Dec 17, 2024
35eb3bd
Merge pull request #1312 from icecraft/fix/abs_pipe
myhloli Dec 17, 2024
c4f252d
Add files via upload
myhloli Dec 19, 2024
81fcef8
@MatthewZMD has signed the CLA in opendatalab/MinerU#1379
github-actions[bot] Dec 30, 2024
7d9d8a2
@yzztin has signed the CLA in opendatalab/MinerU#1397
github-actions[bot] Jan 3, 2025
c35aa79
@utopia2077 has signed the CLA in opendatalab/MinerU#1412
github-actions[bot] Jan 5, 2025
3c07c62
Merge pull request #1426 from opendatalab/dev
myhloli Jan 6, 2025
71800fc
Merge pull request #1429 from opendatalab/dev
myhloli Jan 6, 2025
2cbabe9
Merge pull request #1431 from opendatalab/dev
myhloli Jan 6, 2025
43cdaa5
Delete magic_pdf/pipe/AbsPipe.py
myhloli Jan 6, 2025
a53a467
Delete next_docs/en/user_guide/quick_start/to_markdown.rst
myhloli Jan 6, 2025
580d013
Merge pull request #1437 from opendatalab/dev
myhloli Jan 7, 2025
11847b4
Merge pull request #1439 from opendatalab/dev
myhloli Jan 7, 2025
62bb89b
Merge pull request #1449 from opendatalab/dev
myhloli Jan 8, 2025
4fc835f
Merge pull request #1452 from opendatalab/dev
myhloli Jan 8, 2025
55a5d9a
Merge pull request #1454 from opendatalab/dev
myhloli Jan 9, 2025
6750b6c
Merge pull request #1460 from opendatalab/dev
myhloli Jan 9, 2025
7ed89fb
Merge pull request #1465 from opendatalab/dev
myhloli Jan 9, 2025
1b654fc
Merge branch 'master' into release-1.0.0
myhloli Jan 9, 2025
e778264
Merge pull request #1469 from opendatalab/dev
myhloli Jan 9, 2025
c6d3e87
Merge pull request #1471 from opendatalab/dev
myhloli Jan 9, 2025
72514b4
Merge pull request #1473 from opendatalab/dev
myhloli Jan 9, 2025
69ea74f
Merge pull request #1475 from opendatalab/dev
myhloli Jan 10, 2025
1916412
Merge pull request #1477 from opendatalab/dev
myhloli Jan 10, 2025
04f084a
@beholder91 has signed the CLA in opendatalab/MinerU#1479
github-actions[bot] Jan 10, 2025
1c9f994
Merge pull request #1483 from opendatalab/dev
myhloli Jan 10, 2025
4bb5439
Merge pull request #1427 from opendatalab/release-1.0.0
myhloli Jan 10, 2025
4638513
Merge pull request #1486 from opendatalab/dev
myhloli Jan 10, 2025
eb2e213
Merge pull request #1487 from opendatalab/release-1.0.0
myhloli Jan 10, 2025
2c4a586
Update version.py with new version
myhloli Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion magic_pdf/libs/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.6"
__version__ = "1.0.0"
42 changes: 31 additions & 11 deletions next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod

# args
pdf_file_name = "abc.pdf" # replace with the real pdf path
Expand All @@ -36,15 +37,22 @@
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)

## inference
infer_result = ds.apply(doc_analyze, ocr=True)
## inference
if ds.classify() == SupportedPdfParseMethod.OCR:
infer_result = ds.apply(doc_analyze, ocr=True)

## pipeline
pipe_result = infer_result.pipe_ocr_mode(image_writer)

else:
infer_result = ds.apply(doc_analyze, ocr=False)

## pipeline
pipe_result = infer_result.pipe_txt_mode(image_writer)

### draw model result on each page
infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))

## pipeline
pipe_result = infer_result.pipe_ocr_mode(image_writer)

### draw layout result on each page
pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))

Expand All @@ -54,6 +62,9 @@
### dump markdown
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)

### dump content list
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)


对象存储文件示例
^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -92,23 +103,32 @@
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)

## inference
infer_result = ds.apply(doc_analyze, ocr=True)
## inference
if ds.classify() == SupportedPdfParseMethod.OCR:
infer_result = ds.apply(doc_analyze, ocr=True)

## pipeline
pipe_result = infer_result.pipe_ocr_mode(image_writer)

else:
infer_result = ds.apply(doc_analyze, ocr=False)

## pipeline
pipe_result = infer_result.pipe_txt_mode(image_writer)

### draw model result on each page
infer_result.draw_model(os.path.join(local_dir, f'{name_without_suff}_model.pdf')) # dump to local

## pipeline
pipe_result = infer_result.pipe_ocr_mode(image_writer)

### draw layout result on each page
pipe_result.draw_layout(os.path.join(local_dir, f'{name_without_suff}_layout.pdf')) # dump to local

### draw spans result on each page
pipe_result.draw_span(os.path.join(local_dir, f'{name_without_suff}_spans.pdf')) # dump to local
pipe_result.draw_span(os.path.join(local_dir, f'{name_without_suff}_spans.pdf')) # dump to local

### dump markdown
pipe_result.dump_md(writer, f'{name_without_suff}.md', "unittest/tmp/images") # dump to remote s3

### dump content list
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)

前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
32 changes: 32 additions & 0 deletions signatures/version1/cla.json
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,38 @@
"created_at": "2024-11-19T07:28:12Z",
"repoId": 765083837,
"pullRequestNo": 1024
},
{
"name": "MatthewZMD",
"id": 12422335,
"comment_id": 2565021810,
"created_at": "2024-12-30T04:46:33Z",
"repoId": 765083837,
"pullRequestNo": 1379
},
{
"name": "yzztin",
"id": 99233593,
"comment_id": 2568773016,
"created_at": "2025-01-03T07:02:55Z",
"repoId": 765083837,
"pullRequestNo": 1397
},
{
"name": "utopia2077",
"id": 78017255,
"comment_id": 2571704177,
"created_at": "2025-01-05T17:57:17Z",
"repoId": 765083837,
"pullRequestNo": 1412
},
{
"name": "beholder91",
"id": 113708464,
"comment_id": 2581919559,
"created_at": "2025-01-10T06:58:05Z",
"repoId": 765083837,
"pullRequestNo": 1479
}
]
}
Loading