Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0.1 PDF中带有Ligature特征的字体文字抽取错误 #1599

Closed
klizet opened this issue Jan 22, 2025 · 5 comments
Closed

1.0.1 PDF中带有Ligature特征的字体文字抽取错误 #1599

klizet opened this issue Jan 22, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@klizet
Copy link

klizet commented Jan 22, 2025

Description of the bug | 错误描述

典型案例如ff、fi、fl等字符在一些字体里被编码成一个字符,但一般的PDF解析库在从pdf的stream向文字抽取的过程中,有一个tounicode的逆向格式化动作,将以上字符恢复成两个独立字符;目前发现1.0.1会将以上连字字符恢复出的倒数最后几个字符丢弃(fi->f,ffi->f),从而导致相关文字发生信息恢复的错误。

0.9.2 无此问题。

How to reproduce the bug | 如何复现

magic-pdf.json 关于模型版本配置如下:

"models-dir": "...\\.cache\\huggingface\\hub\\models--opendatalab--PDF-Extract-Kit-1.0\\snapshots\\60416a2cabad3f7b7284b43ce37a99864484fba2/models",

Source:https://github.com/MiniMax-AI/MiniMax-01/blob/main/MiniMax-01.pdf

得到的markdown中,存在find -> fnd,different-> diferent, efficient->efcient 等错误。

Operating system | 操作系统

Windows

Python version | Python 版本

3.11

Software version | 软件版本 (magic-pdf --version)

1.0.x

Device mode | 设备模式

cuda

@klizet klizet added the bug Something isn't working label Jan 22, 2025
@myhloli
Copy link
Collaborator

myhloli commented Jan 22, 2025

这个文档和一般的使用连字符的不一样,他的字符构成是拆分成两个字符,通过控制后面的字符宽度为0来调整排版。

Image

我们在最新版中加入了一个移除无宽度字符的方案,导致这些无宽度的字符被移除,后续我们关闭无宽度字符的移除再测试一下。

@myhloli
Copy link
Collaborator

myhloli commented Jan 22, 2025

fixed at #1601

@klizet
Copy link
Author

klizet commented Jan 22, 2025

经测试,[#1601] 会导致程序无法正常运行,出现除0错误;应急便捷方案可能只需注释 continue;更完善的测试需要深入修改。

@myhloli
Copy link
Collaborator

myhloli commented Jan 22, 2025

你iou那里改了吗?1601里连带着除0的一起改了的吧

@klizet
Copy link
Author

klizet commented Jan 23, 2025

是的 ,是我疏忽了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants