Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why i can't convert chinese pdf successfully, all i got is garbled. #748

Open
madinshibo opened this issue Jan 14, 2025 · 3 comments
Open
Labels
question Further information is requested

Comments

@madinshibo
Copy link

Question

I got a chinese pdf, i want to convert to txt file ,but i got garbled. I already set lang=['en','ch_sim'] in EasyOcrOptions.

MY code is

from docling.document_converter import DocumentConverter     
from docling.datamodel.base_models import InputFormat  
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
from docling.document_converter import PdfFormatOption, DocumentConverter

ocr_options = EasyOcrOptions(lang=['en','ch_sim'])

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = ocr_options

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

source = "C:/Users/53559/PycharmProjects/pythonProject/动手学深度学习.pdf"  # document per local path or URL
# source = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source)
markdown_content = result.document.export_to_markdown()
print(markdown_content)
with open("C:/Users/53559/PycharmProjects/pythonProject/output4.txt", "w", encoding="utf-8") as file:
    file.write(markdown_content) 

docling 2.15.1
docling-core 2.14.0
docling-ibm-models 3.1.2
docling-parse 3.0.0

@madinshibo madinshibo added the question Further information is requested label Jan 14, 2025
@RuiZheZhangQ
Copy link

me too
docling may not good in this

@dolfim-ibm
Copy link
Contributor

do you know id your PDF document is programmatic or scanned? in the first case, we are just updating the set of supported fonts in the parser, and it might improve soon.

We also found that sometime, the previous parser might work better (rare cases). You could give it a try with

docling --pdf-backend=dlparse_v1 PDF_FILE

@RuiZheZhangQ
Copy link

I solved part of the problem. When your PDF is encrypted, you need to set some parameters such as: pipeline_options.ocr_options.lang = ["chi_sim"] pipeline_options.ocr_options.force_full_page_ocr = True
Like this, you can get better markdown, but it is still not satisfactory. In this regard, English conversion is obviously better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants