Replies: 1 comment
-
| The PDF text is written in Helvetica Base-14 font with an array of explicitly given character widths, where the width of the space character is not given ... and therefore 0! In [8]: page.get_text("words")
Out[8]:
[(100.0,
  270.20001220703125,
  154.6719970703125,  # x1 of "Hello"
  303.1759948730469,
  'Hello',
  0,
  0,
  0),
 (154.6719970703125,  # x0 of "World"
  270.20001220703125,
  217.33599853515625,
  303.1759948730469,
  'World',
  0,
  1,
  0)]... we see that the end coordinate of "Hello" equals the start coordinate of "World" - which is correct. In [4]: blocks=page.get_text("dict")["blocks"]
In [5]: [s for b in blocks for l in b["lines"] for s in l["spans"]]
Out[5]:
[{'size': 24.0,
  'flags': 0,
  'font': 'Helvetica',
  'color': 0,
  'ascender': 1.0750000476837158,
  'descender': -0.29899999499320984,
  'text': 'Hello World',
  'origin': (100.0, 296.0),
  'bbox': (100.0, 270.20001220703125, 217.33599853515625, 303.1759948730469)}]Whereas version 1.24 gives us 2 spans: In [9]: blocks=page.get_text("dict")["blocks"]
In [10]: [s for b in blocks for l in b["lines"] for s in l["spans"]]
Out[10]:
[{'size': 24.0,
  'flags': 0,
  'font': 'Helvetica',
  'color': 0,
  'ascender': 1.0750000476837158,
  'descender': -0.29899999499320984,
  'text': 'Hello ',
  'origin': (100.0, 296.0),
  'bbox': (100.0, 270.20001220703125, 161.343994140625, 303.1759948730469)},
 {'size': 24.0,
  'flags': 0,
  'font': 'Helvetica',
  'color': 0,
  'ascender': 1.0750000476837158,
  'descender': -0.29899999499320984,
  'text': 'World',
  'origin': (154.6719970703125, 296.0),
  'bbox': (154.6719970703125,
   270.20001220703125,
   217.33599853515625,
   303.1759948730469)}]But however you view it, it is based on a design decision taken in MuPDF not in PyMuPDF. MuPDF's CLI tool also produces the following when executing  | 
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Description of the bug
File: Simple PDF 2.0 file.pdf (taken from PDF association GitHub page with example PDFs)
Since version v1.24.0 I see unexpected new line in the parsed text. Here is a text object of the PDF above:
How to reproduce the bug
To reproduce
Version 1.23.26:
Version 1.24.0:
Expected behaviour
I would say that the additional new line should not be there.
PyMuPDF version
1.24.1
Operating system
Linux
Python version
3.10
Beta Was this translation helpful? Give feedback.
All reactions