-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong text output #99
Comments
I've bisected this on a document that is giving me the same problem and it seems to be due to a change that happened between 3.1.2 and 3.2.0. My document and the one above seem to have in common that their text objects consist of individual characters positioned absolutely, like this (you can use
|
Without actually building docling-parse here, if I had to guess, I'd bet that #91 is somehow responsible (but GitHub won't let me see the diff for some reason...) |
Confirmed that it is #91, when comparing
|
Interestingly enough there are no differences in the output for the individual glyphs, just the merged cells. So there isn't any issue in the actual positioning of the text, just the merging. This makes sense since it looks like that's what changed in #91 ... |
The glyphs are also in the right order in |
As I suspected it's this specific commit which caused the problem: 75a28e5 |
If I had to guess (my allergies to C++ are really flaring up here) then I would say that the problem is the function |
Can further confirm that switching back to @@ -63350,30 +63304,7 @@
740.399,
27.75,
740.399,
- "C hp t e 2 - Z N G RI NNE",
- -1,
- 4.445,
- "",
- "STANDARD",
- "/F5",
- "/BAAAAA+Open-Sans-Light",
- false,
- true
- ],
- [
- 38.137,
- 740.399,
- 160.864,
- 757.627,
- 38.137,
- 757.627,
- 160.864,
- 757.627,
- 160.864,
- 740.399,
- 38.137,
- 740.399,
- "a r 8 OIN ODAC",
+ "Chapter 28 - ZONING ORDINANCE",
-1,
4.445,
"",
|
I've gone through the differences between the two methods - clearly the Unfortunately the documents in question have a peculiarity which is that the bounding boxes of their text cells overlap. This is probably a bug in But if we look at the output for the first two characters in "Chapter 28", we see this in docling-parse's output (some lines snipped) - note how the [
27.75,
740.399,
38.528,
"C",
],
[
33.013,
740.399,
43.507,
"h",
], The correct bounding boxes (from {
"chars": "C",
"bbox": [
27.74999884375,
753.5077582758436,
33.01271874446999,
761.8628360717153
]
},
{
"chars": "h",
"bbox": [
33.01271875938978,
753.5077582758436,
38.13673131388924,
761.8628360717153
]
}, What this means in practice is that This means that the effect of "sanitization" is to partition the text into non-overlapping subsets of cells, with spaces in between them, as you can see if you look closely the results a little ways above:
This is clearly not what you want if you want to actually produce readable text ;-) PR |
Fixes: docling-project#99 Signed-off-by: David Huggins-Daines <[email protected]>
Fixes: docling-project#99 Signed-off-by: David Huggins-Daines <[email protected]>
Fixes: docling-project#99 Signed-off-by: David Huggins-Daines <[email protected]> Signed-off-by: David Huggins-Daines <[email protected]>
Fixes: docling-project#99 Signed-off-by: David Huggins-Daines <[email protected]> Signed-off-by: David Huggins-Daines <[email protected]>
In looking more closely at the problem, I think the underlying issue causing overlapping character cells is the handling of Type3 fonts, which might be internal to QPDF. If we look at the first page of the document in question, we see correctly sanitized text "Stafford County, VA Code of Ordinances", which uses the font All of the text with overlapping cells appears to be in |
Yes, the underlying problem here is that First note that the correct width for the "C" in "Chapter 28 - ZONING ORDINANCE" on the first page (the glyph noted above) should be If we look inside the font in question, we can see how the width and height should be calculated. First we get the width of the character in glyph space:
And now in text space:
And now in default user space (NOTE: this isn't right if there's rotation but it's sufficient for this example):
If we have the wrong font matrix (that is, if we ignore the
|
I would find it hard to believe that QPDF interprets Type3 fonts wrong, since doing so would completely break PDF rendering ;-) And yes indeed it looks like the problem is in docling-parse, here: https://github.com/docling-project/docling-parse/blob/main/src/v2/pdf_resources/page_font.h#L244 This looks quite wrong: |
Probably related, but with this PDF I also get weirdly spaced out text like: ## OVERVIEW OF THE SECURITISATIO N TRA N SACTIO N Even though it renders correctly in Mac OS Preview and copying the text from there, those spaces aren't present. |
Probably not the same bug in this case - those are TrueType fonts in that document, and the text isn't partitioned between two sanitized cells in the same way (I didn't look at the output of parse_v2.exe yet though...) It's still a bug in docling-parse, though, because there's no reason why it should insert spaces there (PLAYA does not, for instance, nor does the old reliable $ playa --page 11 --content-streams ../Prospectus-_2024-10-16.pdf
# ... snipped ...
/Span <</MCID 11/Lang (en-GB)>> BDC q
0.000008871 0 595.32 841.92 re
W* n
BT
/F8 11.04 Tf
1 0 0 1 56.88 657.58 Tm
0 g
0 G
[(O)-4(V)-5(E)4(R)15(V)-5(IEW)13( O)-4(F T)17(H)-4(E)4( SE)16(C)5(U)5(R)5(ITISA)6(T)4(IO)-6(N)5( T)4(R)5(A)5(N)5(SA)6(C)5(T)4(IO)-6(N)] TJ
ET
Q
EMC Can you file this as a separate bug? |
Stafford.County.-.VA.Zoning.Ordinance.pdf
The text was updated successfully, but these errors were encountered: