-
| Hi. I am very new to programming and PyMuPDF. I was wondering if it is possible to extract underline texts next to section numbers? For example: 1.1.1 Testing – This sentence is for testing. The bolded word represents underline text. I found some posts stating it might not be possible directly but is it possible to maybe search for underline words, highlight it, and then extract those words? | 
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 20 replies
-
| First of all: | 
Beta Was this translation helpful? Give feedback.
-
| It depends by what PDF mechanism the underlining took place. There are two alternatives: 
 Alternative 2 is rare and, although also possible with PyMuPDF, clearly beyond the reach of a beginner. In contrast to this, detecting and extracting text marked with annotations (alternative 1) is easy. Text marking annotations are of type underline, highlight, strikeout and squiggle (a zigzag underline): for annot in page.annots(types=(fitz.PDF_ANNOT_HIGHLIGHT,
          fitz.PDF_ANNOT_UNDERLINE,
          fitz.PDF_ANNOT_SQUIGGLY,
          fitz.PDF_ANNOT_STRIKE_OUT)):
    clip = annot.rect + (-2, -2, 2, 2)  # annot rectangle ... enlarged by 2 points in every direction
    text = page.get_textbox(clip)
    print(f"this text is marked: '{text}'.") | 
Beta Was this translation helpful? Give feedback.
-
| Here is some idea you could at least start with. Example PDF: export from a word document made from some Wikipedia article in German: In [1]: import fitz
In [2]: doc = fitz.open("test.pdf")
In [3]: page = doc[0]
In [4]: paths = page.get_drawings()  # get drawings on the page
In [5]: len(paths)  # see how many
Out[5]: 9
In [6]: # subselect things we may regard as lines
In [7]: lines = []
   ...: for p in paths:
   ...:     for item in p["items"]:
   ...:         if item[0] == "l":  # an actual line
   ...:             p1, p2 = item[1]
   ...:             if p1.y == p2.y:
   ...:                 lines.append((p1, p2))
   ...:         elif item[0] == "re":  # a rectangle: check if height is small
   ...:             r = item[1]
   ...:             if r.width > r.height and r.height <= 2:
   ...:                 lines.append((r.tl, r.tr))  # take top left / right points
In [8]: len(lines)  # confirm we got everything
Out[8]: 9
In [9]: # example:
In [10]: lines[0]
Out[10]:
(Point(336.9100036621094, 98.300048828125),
 Point(373.6300048828125, 98.300048828125))
In [11]: # make a list of words
In [12]: words = page.get_text("words", sort=True)
In [13]: # if underlined, the bottom left / right of a word
In [14]: # should not be too far away from left / right end of some line:
In [15]: for w in words:  # w[4] is the actual word string
    ...:     r = fitz.Rect(w[:4])  # first 4 items are the word bbox
    ...:     for p1, p2 in lines:  # check distances for start / end points
    ...:         if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
    ...:             print(f"Word '{w[4]}' is underlined!")
    ...:             break  # don't check more lines
Word 'Familie' is underlined!
Word 'Delfine' is underlined!
Word 'DNA-Analysen' is underlined!
Word 'Unterarten' is underlined!
Word 'Bartenwale' is underlined!
Word 'Spitzenprädatoren' is underlined!
Word 'Fressfeinde' is underlined!
Word 'Walfang' is underlined!
Word 'Delfinarien.' is underlined!
In [16]: # heureka!Be a little generous with distance checking here: E.g. the word extracted is 'Delfinarien.' (including the dot), but the underlining does not include the dot ... and more of that sort. | 
Beta Was this translation helpful? Give feedback.
-
| Another idea: | 
Beta Was this translation helpful? Give feedback.
-
| If however we have regular text and lines: drawn_lines=[...]  # your identified lines
blocks=page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
max_lineheight=0
for b in blocks:
    for l in b["lines"]:
        bbox=fitz.Rect(l["bbox"])
        if bbox.height > max_lineheight:
            max_lineheight = bbox.height
# we now have the max lineheight on this page
for p1, p2 in draw_lines:
    rect = fitz.Rect(p1.x, p1.y - max_lineheight, p2.x, p2.y) # the rectangle "above" a drawn line
    text = page.get_textbox(rect)
    print(f"Underlined: '{text}'.") | 
Beta Was this translation helpful? Give feedback.


If however we have regular text and lines: