-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docx parsing headings with outline format numbered-style applied is not working properly. #795
Comments
I found the same problem parsing other file that combines both outline format and non-outline format in the same file. The non-outline format produces that the outline format output is neither generated correctly. Docling version
|
UpdateIt seems the problem is not the outline. Testing with different outlines and formats doesn't resolve the issue. The point is the label parsing:
Coding in local the changes suggested by @jbtelice worked for me, but it would be nice to include them in the core code. |
Hi!
Yep, indeed, that's what I mean (In the additional context section). In fact, the mechanism for assigning label_str and label_level is language-dependant if you rely on "style_id", and could be fixed if you take name instead of (label name seems to be language-invariant, but I didn't have the time to check it out carefully)
Thanks @MiguelAngelTorres for pointing out the reference. 👀 |
Same issue: #612 |
Thanks @jbtelice, @MiguelAngelTorres, |
Bug
msword_backend.py doesn't parse docx files with headings ("Heading 1, Heading 2", etc) properly. When Outline format is applied to a heading:
and has "Numbering style" activated as it follows:
the parser label that paragraph as a list (which is true) but no as a section_header ( which is true as well). That's ambiguous and it potentially prone to unnoticed errors( Imagine when you have to process a lot of docx files in which you are NOT allowed to edit...)
The consequences are:
Here is an screenshot of what I mean:
Clean headings (No outline format)
Headings with outline format
Steps to reproduce
Here are the docx samples:
bug_example.docx
bug_example.docx
bug_example_with_list_headings.docx
bug_example_with_list_headings.docx
Here is the output:
Docling version
Python version
Additional Context
Digging into the code, I notice some other things worth to explore too:
[split_text_and_number](
https://github.com/DS4SD/docling/blob/1976584be15b6dea7451001adbfd8b9dc3422235/docling/backend/msword_backend.py#L170) . That regex is not trimming the match.groups()
That implies wrong label parsing
Regarding style
line 200
¿Shouldn't be name instead of style_id?
Hope it helps,
Let me know if you need more information.
Have a nice day!
The text was updated successfully, but these errors were encountered: