why i can't convert chinese pdf successfully, all i got is garbled. #748

madinshibo · 2025-01-14T15:15:46Z

Question

I got a chinese pdf, i want to convert to txt file ,but i got garbled. I already set lang=['en','ch_sim'] in EasyOcrOptions.

MY code is

from docling.document_converter import DocumentConverter     
from docling.datamodel.base_models import InputFormat  
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
from docling.document_converter import PdfFormatOption, DocumentConverter

ocr_options = EasyOcrOptions(lang=['en','ch_sim'])

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = ocr_options

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

source = "C:/Users/53559/PycharmProjects/pythonProject/动手学深度学习.pdf"  # document per local path or URL
# source = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source)
markdown_content = result.document.export_to_markdown()
print(markdown_content)
with open("C:/Users/53559/PycharmProjects/pythonProject/output4.txt", "w", encoding="utf-8") as file:
    file.write(markdown_content)

docling 2.15.1
docling-core 2.14.0
docling-ibm-models 3.1.2
docling-parse 3.0.0

The text was updated successfully, but these errors were encountered:

RuiZheZhangQ · 2025-01-15T09:55:28Z

me too
docling may not good in this

dolfim-ibm · 2025-01-15T17:26:51Z

do you know id your PDF document is programmatic or scanned? in the first case, we are just updating the set of supported fonts in the parser, and it might improve soon.

We also found that sometime, the previous parser might work better (rare cases). You could give it a try with

docling --pdf-backend=dlparse_v1 PDF_FILE

RuiZheZhangQ · 2025-01-16T08:29:41Z

I solved part of the problem. When your PDF is encrypted, you need to set some parameters such as: pipeline_options.ocr_options.lang = ["chi_sim"] pipeline_options.ocr_options.force_full_page_ocr = True
Like this, you can get better markdown, but it is still not satisfactory. In this regard, English conversion is obviously better.

madinshibo added the question Further information is requested label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why i can't convert chinese pdf successfully, all i got is garbled. #748

why i can't convert chinese pdf successfully, all i got is garbled. #748

madinshibo commented Jan 14, 2025

RuiZheZhangQ commented Jan 15, 2025

dolfim-ibm commented Jan 15, 2025

RuiZheZhangQ commented Jan 16, 2025

why i can't convert chinese pdf successfully, all i got is garbled. #748

why i can't convert chinese pdf successfully, all i got is garbled. #748

Comments

madinshibo commented Jan 14, 2025

Question

RuiZheZhangQ commented Jan 15, 2025

dolfim-ibm commented Jan 15, 2025

RuiZheZhangQ commented Jan 16, 2025