Wrong text output #99

PeterStaar-IBM · 2025-02-06T09:26:48Z

Stafford.County.-.VA.Zoning.Ordinance.pdf

dhdaines · 2025-02-27T20:55:47Z

I've bisected this on a document that is giving me the same problem and it seems to be due to a change that happened between 3.1.2 and 3.2.0. My document and the one above seem to have in common that their text objects consist of individual characters positioned absolutely, like this (you can use playa --content-streams to see this easily 😉)

BT
/F4 10.6599998 Tf
1 0 0 -1 34.015625 29 Tm
<0016> Tj
5.9285736 0 Td <0012> Tj
2.9616852 0 Td <0014> Tj
5.9285736 0 Td <001B> Tj
5.9285736 0 Td <0012> Tj
2.9616852 0 Td <0015> Tj
5.9285736 0 Td <0017> Tj
5.9285736 0 Td <000F> Tj
2.9616852 0 Td <0003> Tj
2.9616852 0 Td <0014> Tj
5.9285736 0 Td <0015> Tj
5.9285736 0 Td <001D> Tj
2.9616852 0 Td <0015> Tj
5.9285736 0 Td <0019> Tj
5.9285736 0 Td <0003> Tj
2.9616852 0 Td <0033> Tj
7.1101227 0 Td <0030> Tj
ET

~~Probably there is some bug in the way docling-parse is handling the text matrix that is causing the characters to end up in the wrong place when they are positioned on the page.~~ They are nonetheless in the correct reading order in the content stream.

dhdaines · 2025-02-27T21:01:27Z

Without actually building docling-parse here, if I had to guess, I'd bet that #91 is somehow responsible (but GitHub won't let me see the diff for some reason...)

dhdaines · 2025-02-27T21:34:46Z

Confirmed that it is #91, when comparing parse_v2.exe output between 760b932..v3.2.0 I get this sort of difference:

@@ -35424,7 +35512,139 @@
               648.32,
               37.074,
               648.32,
-              "Consult the following schematics for information about the various ports used by TrueSight Infrastructure Management. The port numbers",
+              "C os u t   t h  f o o i n  shm t i c  f o   i n o m t i o  aot   t h  vr os prs u e  b  T uS gt   I n r at r u t u e M ngm n .   Te pr t   nm e s",
+              -1,
+              4.149,
+              "",
+              "STANDARD",
+              "/F4",
+              "/AAAAAA+OpenSans",
+              false
+            ],
+            [

dhdaines · 2025-02-27T21:36:47Z

Interestingly enough there are no differences in the output for the individual glyphs, just the merged cells. So there isn't any issue in the actual positioning of the text, just the merging.

This makes sense since it looks like that's what changed in #91 ...

dhdaines · 2025-02-28T02:51:06Z

The glyphs are also in the right order in ["pages"][0]["original"]["cells"]["data"], and the x-positions are correct. So it is just the "sanitizing" that is not working correctly for some reason.

dhdaines · 2025-02-28T19:19:42Z

As I suspected it's this specific commit which caused the problem: 75a28e5

dhdaines · 2025-02-28T19:21:35Z

If I had to guess (my allergies to C++ are really flaring up here) then I would say that the problem is the function pdf_sanitator<PAGE_CELLS>::contract_cells_into_lines_v2

dhdaines · 2025-03-05T16:39:28Z

Can further confirm that switching back to contract_cells_into_lines_v1 in https://github.com/DS4SD/docling-parse/blob/main/src/v2/pdf_sanitators/cells.h#L153 solves the problem, for instance:

@@ -63350,30 +63304,7 @@
               740.399,
               27.75,
               740.399,
-              "C hp t e   2  -   Z N G RI NNE",
-              -1,
-              4.445,
-              "",
-              "STANDARD",
-              "/F5",
-              "/BAAAAA+Open-Sans-Light",
-              false,
-              true
-            ],
-            [
-              38.137,
-              740.399,
-              160.864,
-              757.627,
-              38.137,
-              757.627,
-              160.864,
-              757.627,
-              160.864,
-              740.399,
-              38.137,
-              740.399,
-              "a r 8 OIN ODAC",
+              "Chapter 28 - ZONING ORDINANCE",
               -1,
               4.445,
               "",

~~Can you explain the difference between the two contract_cells_into_lines methods? They are not quite obfuscated C++ but they are not exactly clearly written either.~~

dhdaines · 2025-03-05T17:42:53Z

I've gone through the differences between the two methods - clearly the v2 has been refactored to be much clearer and more general with respect to RTL and rotated text.

Unfortunately the documents in question have a peculiarity which is that the bounding boxes of their text cells overlap. This is probably a bug in docling-parse and not inherent to the actual PDF content, because if I plot the character cells corresponding to the Tj operators with PAVÉS we can see that (a) they are in the correct reading order, and (b) their actual bounding boxes, as defined by the font's character widths, do not overlap:

But if we look at the output for the first two characters in "Chapter 28", we see this in docling-parse's output (some lines snipped) - note how the x1 value for the first character (38.528) is actually greater than the x0 value for the second character (33.013);

The correct bounding boxes (from playa --text-objects, also trimmed) should be:

{
  "chars": "C",
  "bbox": [
    27.74999884375,
    753.5077582758436,
    33.01271874446999,
    761.8628360717153
  ]
},
{
  "chars": "h",
  "bbox": [
    33.01271875938978,
    753.5077582758436,
    38.13673131388924,
    761.8628360717153
  ]
},

What this means in practice is that pdf_resource<PAGE_CELL>.is_adjacent_to and pdf_resource<PAGE_CELL>.merge_with do not do the right thing, because they assume that cells do not overlap. ~~This means that spaces get inserted everywhere, and also, presumably, that cells get put in wacky orders (actually I am not totally sure why that happens).~~

This means that the effect of "sanitization" is to partition the text into non-overlapping subsets of cells, with spaces in between them, as you can see if you look closely the results a little ways above:

"C hp t e   2  -   Z N G RI NNE",
   "a r 8 OIN ODAC",

This is clearly not what you want if you want to actually produce readable text ;-)

PR ~~forthcoming~~ here: #105

Fixes: docling-project#99

Fixes: docling-project#99 Signed-off-by: David Huggins-Daines <[email protected]>

Fixes: docling-project#99 Signed-off-by: David Huggins-Daines <[email protected]> Signed-off-by: David Huggins-Daines <[email protected]>

dhdaines · 2025-03-25T14:40:12Z

In looking more closely at the problem, I think the underlying issue causing overlapping character cells is the handling of Type3 fonts, which might be internal to QPDF. If we look at the first page of the document in question, we see correctly sanitized text "Stafford County, VA Code of Ordinances", which uses the font AAAAAA+ArialMT, which is a subsetted TrueType font.

All of the text with overlapping cells appears to be in BAAAAA+Open-Sans-Light which is Type3.

dhdaines · 2025-03-25T15:44:23Z

Yes, the underlying problem here is that ~~someone (maybe qpdf, maybe~~ docling-parse~)~ is not actually looking at the FontMatrix of Type3 fonts, and just assumes that it is [0.001 0 0 0.001 0 0]. In this case, the Type3 font in question has a FontMatrix of [0.00048828125 0 0 -0.00048828125 0 0].

First note that the correct width for the "C" in "Chapter 28 - ZONING ORDINANCE" on the first page (the glyph noted above) should be 33.01 - 27.75 = 5.26. But docling-parse thinks it is 38.528 - 27.75 = 10.778.

If we look inside the font in question, we can see how the width and height should be calculated. First we get the width of the character in glyph space:

cid, chars = next(t.textstate.font.decode(t.args[0]))
cwidth = t.textstate.font.widths[cid]  # = 1290

And now in text space:

gwidth = cwidth * t.textstate.font.matrix[0]  # = 0.630
twidth = gwidth * t.textstate.fontsize * t.textstate.scaling / 100 # = 8.818

And now in default user space (NOTE: this isn't right if there's rotation but it's sufficient for this example):

width = twidth * t.ctm[0] # = 5.26

If we have the wrong font matrix (that is, if we ignore the FontMatrix and consider Type3 fonts to be in the same glyph space as Type1 fonts, namely [0.001 0 0 0.001 0 0]), then instead we get:

gwidth = cwidth * 0.001  # = 1.29
twidth = gwidth * t.textstate.fontsize * t.textstate.scaling / 100  # = 18.06
width = twidth * t.ctm[0] # = 10.778

dhdaines · 2025-03-25T15:53:57Z

I would find it hard to believe that QPDF interprets Type3 fonts wrong, since doing so would completely break PDF rendering ;-) And yes indeed it looks like the problem is in docling-parse, here:

https://github.com/docling-project/docling-parse/blob/main/src/v2/pdf_resources/page_font.h#L244

This looks quite wrong: double unit = 1.0; // 1000.0

daaain · 2025-03-25T22:07:49Z

Probably related, but with this PDF I also get weirdly spaced out text like:

## OVERVIEW OF THE SECURITISATIO N TRA N SACTIO N

Even though it renders correctly in Mac OS Preview and copying the text from there, those spaces aren't present.

dhdaines · 2025-03-26T11:16:02Z

Probably related, but with this PDF I also get weirdly spaced out text like:

OVERVIEW OF THE SECURITISATIO N TRA N SACTIO N

Even though it renders correctly in Mac OS Preview and copying the text from there, those spaces aren't present.

Probably not the same bug in this case - those are TrueType fonts in that document, and the text isn't partitioned between two sanitized cells in the same way (I didn't look at the output of parse_v2.exe yet though...)

It's still a bug in docling-parse, though, because there's no reason why it should insert spaces there (PLAYA does not, for instance, nor does the old reliable pdftotext tool from Poppler). It's a tagged PDF, with explicit spaces in the text objects:

$ playa --page 11 --content-streams ../Prospectus-_2024-10-16.pdf
# ... snipped ...
/Span <</MCID 11/Lang (en-GB)>> BDC q
0.000008871 0 595.32 841.92 re
W* n
BT
/F8 11.04 Tf
1 0 0 1 56.88 657.58 Tm
0 g
0 G
[(O)-4(V)-5(E)4(R)15(V)-5(IEW)13( O)-4(F T)17(H)-4(E)4( SE)16(C)5(U)5(R)5(ITISA)6(T)4(IO)-6(N)5( T)4(R)5(A)5(N)5(SA)6(C)5(T)4(IO)-6(N)] TJ
ET
Q
 EMC

Can you file this as a separate bug?

PeterStaar-IBM added the bug Something isn't working label Feb 6, 2025

PeterStaar-IBM self-assigned this Feb 6, 2025

PeterStaar-IBM mentioned this issue Feb 6, 2025

PDF to MD Conversion with Docling v2.18 is Incomprehensible docling-project/docling#888

Closed

dhdaines pushed a commit to dhdaines/docling-parse that referenced this issue Mar 5, 2025

fix: handle case of overlapping cells in contract_cells_into_lines_v2

950fb12

Fixes: docling-project#99

dhdaines linked a pull request Mar 5, 2025 that will close this issue

fix: handle case of overlapping cells in contract_cells_into_lines_v2 #105

Open

dhdaines pushed a commit to dhdaines/docling-parse that referenced this issue Mar 5, 2025

fix: handle case of overlapping cells in contract_cells_into_lines_v2

b2141c2

Fixes: docling-project#99 Signed-off-by: David Huggins-Daines <[email protected]>

dhdaines pushed a commit to dhdaines/docling-parse that referenced this issue Mar 5, 2025

fix: handle case of overlapping cells in contract_cells_into_lines_v2

5b3dae1

Fixes: docling-project#99 Signed-off-by: David Huggins-Daines <[email protected]>

dhdaines mentioned this issue Mar 6, 2025

docling_parse_v2 split/connect words docling-project/docling#952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong text output #99

Wrong text output #99

PeterStaar-IBM commented Feb 6, 2025

dhdaines commented Feb 27, 2025 •

edited

Loading

dhdaines commented Feb 27, 2025

dhdaines commented Feb 27, 2025

dhdaines commented Feb 27, 2025

dhdaines commented Feb 28, 2025

dhdaines commented Feb 28, 2025

dhdaines commented Feb 28, 2025

dhdaines commented Mar 5, 2025 •

edited

Loading

dhdaines commented Mar 5, 2025 •

edited

Loading

dhdaines commented Mar 25, 2025 •

edited

Loading

dhdaines commented Mar 25, 2025 •

edited

Loading

dhdaines commented Mar 25, 2025

daaain commented Mar 25, 2025

dhdaines commented Mar 26, 2025 •

edited

Loading

OVERVIEW OF THE SECURITISATIO N TRA N SACTIO N

Wrong text output #99

Wrong text output #99

Comments

PeterStaar-IBM commented Feb 6, 2025

dhdaines commented Feb 27, 2025 • edited Loading

dhdaines commented Feb 27, 2025

dhdaines commented Feb 27, 2025

dhdaines commented Feb 27, 2025

dhdaines commented Feb 28, 2025

dhdaines commented Feb 28, 2025

dhdaines commented Feb 28, 2025

dhdaines commented Mar 5, 2025 • edited Loading

dhdaines commented Mar 5, 2025 • edited Loading

dhdaines commented Mar 25, 2025 • edited Loading

dhdaines commented Mar 25, 2025 • edited Loading

dhdaines commented Mar 25, 2025

daaain commented Mar 25, 2025

dhdaines commented Mar 26, 2025 • edited Loading

OVERVIEW OF THE SECURITISATIO N TRA N SACTIO N

dhdaines commented Feb 27, 2025 •

edited

Loading

dhdaines commented Mar 5, 2025 •

edited

Loading

dhdaines commented Mar 5, 2025 •

edited

Loading

dhdaines commented Mar 25, 2025 •

edited

Loading

dhdaines commented Mar 25, 2025 •

edited

Loading

dhdaines commented Mar 26, 2025 •

edited

Loading