Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在magic_pdf_parse_main这个demo中,如何才能批量处理PDF文件 #513

Closed
chenliutiao opened this issue Aug 31, 2024 · 7 comments
Closed
Labels
enhancement New feature or request

Comments

@chenliutiao
Copy link

Is your feature request related to a problem? Please describe.
您的特性请求是否与某个问题相关?请描述。
如何在magic_pdf_parse_main.py这个demo中修改代码,实现本地pdf批量处理。使用命令行可以实现批量处理,但是我不知道api如何实现。

Describe the solution you'd like
描述您期望的解决方案
实现批量处理一个文件夹中的所有pdf文件

Describe alternatives you've considered
描述您已考虑的替代方案

Additional context
提供更多细节

@chenliutiao chenliutiao added the enhancement New feature or request label Aug 31, 2024
@HaoRenkk123
Copy link

+1同问

1 similar comment
@SaraiQX
Copy link

SaraiQX commented Sep 19, 2024

+1同问

@guozhetao
Copy link

import os
import subprocess
import csv
from tqdm import tqdm

设置要处理的文件夹路径

input_directory = '' # 替换为你的文件夹路径
csv_file_path = 'processing_results.csv' # 输出的CSV文件路径

收集所有PDF文件

pdf_files = []
for root, dirs, files in os.walk(input_directory):
for file in files:
if file.endswith('.pdf'):
pdf_files.append(os.path.join(root, file))

打开CSV文件进行写入

with open(csv_file_path, mode='w', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow(['File Name', 'Processed']) # 写入表头

# 遍历文件并处理
for pdf_path in tqdm(pdf_files, desc='Processing PDFs'):
    output_path = os.path.splitext(pdf_path)[0]  # 输出路径与原文件相同
    command = f'magic-pdf -p "{pdf_path}" -o "{output_path}" -m auto'

    try:
        subprocess.run(command, shell=True, check=True)  # 执行命令
        csv_writer.writerow([os.path.basename(pdf_path), True])  # 写入成功记录
    except subprocess.CalledProcessError:
        csv_writer.writerow([os.path.basename(pdf_path), False])  # 写入失败记录

print(f'Processing complete. Results saved in {csv_file_path}.')

@reneliury
Copy link

reneliury commented Oct 28, 2024

import os import subprocess import csv from tqdm import tqdm

设置要处理的文件夹路径

input_directory = '' # 替换为你的文件夹路径 csv_file_path = 'processing_results.csv' # 输出的CSV文件路径

收集所有PDF文件

pdf_files = [] for root, dirs, files in os.walk(input_directory): for file in files: if file.endswith('.pdf'): pdf_files.append(os.path.join(root, file))

打开CSV文件进行写入

with open(csv_file_path, mode='w', newline='') as csvfile: csv_writer = csv.writer(csvfile) csv_writer.writerow(['File Name', 'Processed']) # 写入表头

# 遍历文件并处理
for pdf_path in tqdm(pdf_files, desc='Processing PDFs'):
    output_path = os.path.splitext(pdf_path)[0]  # 输出路径与原文件相同
    command = f'magic-pdf -p "{pdf_path}" -o "{output_path}" -m auto'

    try:
        subprocess.run(command, shell=True, check=True)  # 执行命令
        csv_writer.writerow([os.path.basename(pdf_path), True])  # 写入成功记录
    except subprocess.CalledProcessError:
        csv_writer.writerow([os.path.basename(pdf_path), False])  # 写入失败记录

print(f'Processing complete. Results saved in {csv_file_path}.')

这样似乎每次都要重新init model,批量跑速度太慢了

@myhloli
Copy link
Collaborator

myhloli commented Oct 28, 2024

这样似乎每次都要重新init model,批量跑速度太慢了

其实也不会,目前的逻辑只有第一次init才是真的init,后面都是读的缓存

@banianzr
Copy link

这样似乎每次都要重新init model,批量跑速度太慢了

其实也不会,目前的逻辑只有第一次init才是真的init,后面都是读的缓存

您好,这个地方我不是很明白。当前demo里面每个文件都是单独做了一个pipe,然后pipe里do_parse的部分看上去是每次都要init(init里面去做ModelSingleton()的实例化然后init模型),这似乎没看到读缓存的部分?

@myhloli
Copy link
Collaborator

myhloli commented Nov 14, 2024

这样似乎每次都要重新init model,批量跑速度太慢了

其实也不会,目前的逻辑只有第一次init才是真的init,后面都是读的缓存

您好,这个地方我不是很明白。当前demo里面每个文件都是单独做了一个pipe,然后pipe里do_parse的部分看上去是每次都要init(init里面去做ModelSingleton()的实例化然后init模型),这似乎没看到读缓存的部分?

单例不会重复生成,所以每次去调用的都是之前init好的模型对象。

@myhloli myhloli closed this as completed Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants