Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle case of overlapping cells in contract_cells_into_lines_v2 #105

Closed
wants to merge 1 commit into from

Conversation

dhdaines
Copy link

@dhdaines dhdaines commented Mar 5, 2025

Fixes: #99

As noted there, the (nicely refactored!) implementation no longer correctly handles the case where text cells overlap.

The fact that they overlap in the first place is most likely a separate and more complex bug, but this fixes the symptoms, at least.

Copy link

mergify bot commented Mar 5, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dhdaines
Copy link
Author

dhdaines commented Mar 26, 2025

Here is a Python implementation of convex polygon intersection (based on https://www.gorillasun.de/blog/an-algorithm-for-polygon-intersections/). Updated to slim it down since we only care if they intersect, not where, so the point-in-polygon test is sufficient.

It's simple enough that I should be able to implement it in C++ without losing my patience with the compiler...

from typing import Iterator, Sequence, Tuple
Point = Tuple[float, float]

def edges(poly: Sequence[Point]) -> Iterator[Tuple[Point, Point]]:
    """Iterate over edges of a polygon."""
    for i in range(len(poly)):
        yield poly[i-1], poly[i]

def pnpoly(point: Point, poly: Sequence[Point]) -> bool:
    """Use even-odd rule to determine point-in-polygon."""
    x, y = point
    inside = False
    for (xi, yi), (xj, yj) in edges(poly):
        if (yi > y) is not (yj > y):
            if x < (xj - xi) * (y - yi) / (yj - yi) + xi:
                inside = not inside
    return inside
    
def polys_intersect(a: Sequence[Point], b: Sequence[Point]) -> bool:
    """Determine if two convex polygons intersect (including the case
    where one is contained entirely in the other)."""
    for point in a:
        if pnpoly(point, b):
            return True
    for point in b:
        if pnpoly(point, a):
            return True
    return False

@dhdaines dhdaines force-pushed the issue-99 branch 3 times, most recently from 65445ed to 0db1e30 Compare March 27, 2025 02:51
@dhdaines
Copy link
Author

So, I've implemented the overlap detection, but it doesn't actually fix the issue, because the character bboxes are still wrong for Type3 fonts. But also it causes some problems for instance in https://github.com/docling-project/docling-parse/blob/main/docs/visualisations/table_of_contents_01.pdf.page_2.word.png where the characters in the figures legitimately have overlapping bboxes.

I'm closing this PR for the moment, though you may wish to keep the code for future use. I think a fix for the underlying Type3 font problem should be fairly easy so I'll make a new PR for that.

@dhdaines dhdaines closed this Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong text output
3 participants