Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong text output #99

Open
PeterStaar-IBM opened this issue Feb 6, 2025 · 14 comments · May be fixed by #105
Open

Wrong text output #99

PeterStaar-IBM opened this issue Feb 6, 2025 · 14 comments · May be fixed by #105
Assignees
Labels
bug Something isn't working

Comments

@PeterStaar-IBM
Copy link
Contributor

Stafford.County.-.VA.Zoning.Ordinance.pdf

@dhdaines
Copy link

dhdaines commented Feb 27, 2025

I've bisected this on a document that is giving me the same problem and it seems to be due to a change that happened between 3.1.2 and 3.2.0. My document and the one above seem to have in common that their text objects consist of individual characters positioned absolutely, like this (you can use playa --content-streams to see this easily 😉)

BT
/F4 10.6599998 Tf
1 0 0 -1 34.015625 29 Tm
<0016> Tj
5.9285736 0 Td <0012> Tj
2.9616852 0 Td <0014> Tj
5.9285736 0 Td <001B> Tj
5.9285736 0 Td <0012> Tj
2.9616852 0 Td <0015> Tj
5.9285736 0 Td <0017> Tj
5.9285736 0 Td <000F> Tj
2.9616852 0 Td <0003> Tj
2.9616852 0 Td <0014> Tj
5.9285736 0 Td <0015> Tj
5.9285736 0 Td <001D> Tj
2.9616852 0 Td <0015> Tj
5.9285736 0 Td <0019> Tj
5.9285736 0 Td <0003> Tj
2.9616852 0 Td <0033> Tj
7.1101227 0 Td <0030> Tj
ET

Probably there is some bug in the way docling-parse is handling the text matrix that is causing the characters to end up in the wrong place when they are positioned on the page. They are nonetheless in the correct reading order in the content stream.

@dhdaines
Copy link

Without actually building docling-parse here, if I had to guess, I'd bet that #91 is somehow responsible (but GitHub won't let me see the diff for some reason...)

@dhdaines
Copy link

Confirmed that it is #91, when comparing parse_v2.exe output between 760b932..v3.2.0 I get this sort of difference:

@@ -35424,7 +35512,139 @@
               648.32,
               37.074,
               648.32,
-              "Consult the following schematics for information about the various ports used by TrueSight Infrastructure Management. The port numbers",
+              "C os u t   t h  f o o i n  shm t i c  f o   i n o m t i o  aot   t h  vr os prs u e  b  T uS gt   I n r at r u t u e M ngm n .   Te pr t   nm e s",
+              -1,
+              4.149,
+              "",
+              "STANDARD",
+              "/F4",
+              "/AAAAAA+OpenSans",
+              false
+            ],
+            [

@dhdaines
Copy link

Interestingly enough there are no differences in the output for the individual glyphs, just the merged cells. So there isn't any issue in the actual positioning of the text, just the merging.

This makes sense since it looks like that's what changed in #91 ...

@dhdaines
Copy link

The glyphs are also in the right order in ["pages"][0]["original"]["cells"]["data"], and the x-positions are correct. So it is just the "sanitizing" that is not working correctly for some reason.

@dhdaines
Copy link

As I suspected it's this specific commit which caused the problem: 75a28e5

@dhdaines
Copy link

If I had to guess (my allergies to C++ are really flaring up here) then I would say that the problem is the function pdf_sanitator<PAGE_CELLS>::contract_cells_into_lines_v2

@dhdaines
Copy link

dhdaines commented Mar 5, 2025

Can further confirm that switching back to contract_cells_into_lines_v1 in https://github.com/DS4SD/docling-parse/blob/main/src/v2/pdf_sanitators/cells.h#L153 solves the problem, for instance:

@@ -63350,30 +63304,7 @@
               740.399,
               27.75,
               740.399,
-              "C hp t e   2  -   Z N G RI NNE",
-              -1,
-              4.445,
-              "",
-              "STANDARD",
-              "/F5",
-              "/BAAAAA+Open-Sans-Light",
-              false,
-              true
-            ],
-            [
-              38.137,
-              740.399,
-              160.864,
-              757.627,
-              38.137,
-              757.627,
-              160.864,
-              757.627,
-              160.864,
-              740.399,
-              38.137,
-              740.399,
-              "a r 8 OIN ODAC",
+              "Chapter 28 - ZONING ORDINANCE",
               -1,
               4.445,
               "",

Can you explain the difference between the two contract_cells_into_lines methods? They are not quite obfuscated C++ but they are not exactly clearly written either.

@dhdaines
Copy link

dhdaines commented Mar 5, 2025

I've gone through the differences between the two methods - clearly the v2 has been refactored to be much clearer and more general with respect to RTL and rotated text.

Unfortunately the documents in question have a peculiarity which is that the bounding boxes of their text cells overlap. This is probably a bug in docling-parse and not inherent to the actual PDF content, because if I plot the character cells corresponding to the Tj operators with PAVÉS we can see that (a) they are in the correct reading order, and (b) their actual bounding boxes, as defined by the font's character widths, do not overlap:

Image

But if we look at the output for the first two characters in "Chapter 28", we see this in docling-parse's output (some lines snipped) - note how the x1 value for the first character (38.528) is actually greater than the x0 value for the second character (33.013);

            [
              27.75,
              740.399,
              38.528,
              "C",
            ],
            [
              33.013,
              740.399,
              43.507,
              "h",
            ],

The correct bounding boxes (from playa --text-objects, also trimmed) should be:

{
  "chars": "C",
  "bbox": [
    27.74999884375,
    753.5077582758436,
    33.01271874446999,
    761.8628360717153
  ]
},
{
  "chars": "h",
  "bbox": [
    33.01271875938978,
    753.5077582758436,
    38.13673131388924,
    761.8628360717153
  ]
},

What this means in practice is that pdf_resource<PAGE_CELL>.is_adjacent_to and pdf_resource<PAGE_CELL>.merge_with do not do the right thing, because they assume that cells do not overlap. This means that spaces get inserted everywhere, and also, presumably, that cells get put in wacky orders (actually I am not totally sure why that happens).

This means that the effect of "sanitization" is to partition the text into non-overlapping subsets of cells, with spaces in between them, as you can see if you look closely the results a little ways above:

"C hp t e   2  -   Z N G RI NNE",
   "a r 8 OIN ODAC",

This is clearly not what you want if you want to actually produce readable text ;-)

PR forthcoming here: #105

dhdaines pushed a commit to dhdaines/docling-parse that referenced this issue Mar 5, 2025
dhdaines pushed a commit to dhdaines/docling-parse that referenced this issue Mar 5, 2025
dhdaines pushed a commit to dhdaines/docling-parse that referenced this issue Mar 5, 2025
dhdaines pushed a commit to dhdaines/docling-parse that referenced this issue Mar 5, 2025
Fixes: docling-project#99
Signed-off-by: David Huggins-Daines <[email protected]>
Signed-off-by: David Huggins-Daines <[email protected]>
dhdaines pushed a commit to dhdaines/docling-parse that referenced this issue Mar 25, 2025
Fixes: docling-project#99
Signed-off-by: David Huggins-Daines <[email protected]>
Signed-off-by: David Huggins-Daines <[email protected]>
@dhdaines
Copy link

dhdaines commented Mar 25, 2025

In looking more closely at the problem, I think the underlying issue causing overlapping character cells is the handling of Type3 fonts, which might be internal to QPDF. If we look at the first page of the document in question, we see correctly sanitized text "Stafford County, VA Code of Ordinances", which uses the font AAAAAA+ArialMT, which is a subsetted TrueType font.

All of the text with overlapping cells appears to be in BAAAAA+Open-Sans-Light which is Type3.

@dhdaines
Copy link

dhdaines commented Mar 25, 2025

Yes, the underlying problem here is that someone (maybe qpdf, maybe docling-parse~)~ is not actually looking at the FontMatrix of Type3 fonts, and just assumes that it is [0.001 0 0 0.001 0 0]. In this case, the Type3 font in question has a FontMatrix of [0.00048828125 0 0 -0.00048828125 0 0].

First note that the correct width for the "C" in "Chapter 28 - ZONING ORDINANCE" on the first page (the glyph noted above) should be 33.01 - 27.75 = 5.26. But docling-parse thinks it is 38.528 - 27.75 = 10.778.

If we look inside the font in question, we can see how the width and height should be calculated. First we get the width of the character in glyph space:

cid, chars = next(t.textstate.font.decode(t.args[0]))
cwidth = t.textstate.font.widths[cid]  # = 1290

And now in text space:

gwidth = cwidth * t.textstate.font.matrix[0]  # = 0.630
twidth = gwidth * t.textstate.fontsize * t.textstate.scaling / 100 # = 8.818

And now in default user space (NOTE: this isn't right if there's rotation but it's sufficient for this example):

width = twidth * t.ctm[0] # = 5.26

If we have the wrong font matrix (that is, if we ignore the FontMatrix and consider Type3 fonts to be in the same glyph space as Type1 fonts, namely [0.001 0 0 0.001 0 0]), then instead we get:

gwidth = cwidth * 0.001  # = 1.29
twidth = gwidth * t.textstate.fontsize * t.textstate.scaling / 100  # = 18.06
width = twidth * t.ctm[0] # = 10.778

@dhdaines
Copy link

I would find it hard to believe that QPDF interprets Type3 fonts wrong, since doing so would completely break PDF rendering ;-) And yes indeed it looks like the problem is in docling-parse, here:

https://github.com/docling-project/docling-parse/blob/main/src/v2/pdf_resources/page_font.h#L244

This looks quite wrong: double unit = 1.0; // 1000.0

@daaain
Copy link

daaain commented Mar 25, 2025

Probably related, but with this PDF I also get weirdly spaced out text like:

## OVERVIEW OF THE SECURITISATIO N TRA N SACTIO N

Even though it renders correctly in Mac OS Preview and copying the text from there, those spaces aren't present.

@dhdaines
Copy link

dhdaines commented Mar 26, 2025

Probably related, but with this PDF I also get weirdly spaced out text like:

OVERVIEW OF THE SECURITISATIO N TRA N SACTIO N

Even though it renders correctly in Mac OS Preview and copying the text from there, those spaces aren't present.

Probably not the same bug in this case - those are TrueType fonts in that document, and the text isn't partitioned between two sanitized cells in the same way (I didn't look at the output of parse_v2.exe yet though...)

It's still a bug in docling-parse, though, because there's no reason why it should insert spaces there (PLAYA does not, for instance, nor does the old reliable pdftotext tool from Poppler). It's a tagged PDF, with explicit spaces in the text objects:

$ playa --page 11 --content-streams ../Prospectus-_2024-10-16.pdf
# ... snipped ...
/Span <</MCID 11/Lang (en-GB)>> BDC q
0.000008871 0 595.32 841.92 re
W* n
BT
/F8 11.04 Tf
1 0 0 1 56.88 657.58 Tm
0 g
0 G
[(O)-4(V)-5(E)4(R)15(V)-5(IEW)13( O)-4(F T)17(H)-4(E)4( SE)16(C)5(U)5(R)5(ITISA)6(T)4(IO)-6(N)5( T)4(R)5(A)5(N)5(SA)6(C)5(T)4(IO)-6(N)] TJ
ET
Q
 EMC

Can you file this as a separate bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants