fix: improve HTML layer detection, various MD fixes #1241

vagenas · 2025-03-25T20:09:47Z

Primary bug being addressed:
During HTML parsing (which BTW is also triggered by Markdown parsing when HTML elements are present), in case there were headings (<h*> tags) but no <h1> tags, all content was being marked as furniture.

Now the heuristic is being adapted so that:

🆕 if any heading is present (regardless of level h1, h2 etc.), the first one of them marks the beginning of the body; before that it's furniture
else, i.e. if no headings are present at all, all content is marked as body

Markdown fixes:

properly propagate section header levels
improve handling of list subroots without text

mergify · 2025-03-25T20:10:22Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

PeterStaar-IBM

Nice!

ceberam

A couple of comments:

Following up with the discussion thread, would it be better to set the heuristic parameter to False by default? In this way, we ensure that no content is lost for users who are not aware of this feature
In the heuristic rule, I would switch to ContentLayer.BODY as soon as we see an h* tag. Currently this only happens once we see <h1> and the specifications recommend that you do not skip heading levels but just in case we see documents starting with, say, <h2> and skipping <h1>.

docling/backend/html_backend.py

Markdown fixes: - properly propagate section header levels - improve handling of list subroots without text Signed-off-by: Panos Vagenas <[email protected]>

vagenas added html issue related to html backend markdown issue related to markdown backend labels Mar 25, 2025

vagenas requested a review from cau-git March 25, 2025 20:09

vagenas force-pushed the fix-html-furniture branch from 24a73e3 to 333eb34 Compare March 25, 2025 21:19

dolfim-ibm previously approved these changes Mar 26, 2025

View reviewed changes

PeterStaar-IBM previously approved these changes Mar 26, 2025

View reviewed changes

ceberam reviewed Mar 26, 2025

View reviewed changes

docling/backend/html_backend.py Outdated Show resolved Hide resolved

fix: improve HTML furniture detection, various MD fixes

40c099e

Markdown fixes: - properly propagate section header levels - improve handling of list subroots without text Signed-off-by: Panos Vagenas <[email protected]>

vagenas dismissed stale reviews from PeterStaar-IBM and dolfim-ibm via 40c099e March 26, 2025 14:31

vagenas force-pushed the fix-html-furniture branch from 333eb34 to 40c099e Compare March 26, 2025 14:31

cau-git approved these changes Mar 26, 2025

View reviewed changes

ceberam approved these changes Mar 26, 2025

View reviewed changes

vagenas changed the title ~~fix(markdown): fix layer detection in case of mixed content~~ fix: improve HTML layer detection, various MD fixes Mar 26, 2025

vagenas merged commit 9210812 into main Mar 26, 2025
10 checks passed

vagenas deleted the fix-html-furniture branch March 26, 2025 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve HTML layer detection, various MD fixes #1241

fix: improve HTML layer detection, various MD fixes #1241

vagenas commented Mar 25, 2025 •

edited

Loading

mergify bot commented Mar 25, 2025 •

edited

Loading

PeterStaar-IBM left a comment

ceberam left a comment

fix: improve HTML layer detection, various MD fixes #1241

fix: improve HTML layer detection, various MD fixes #1241

Conversation

vagenas commented Mar 25, 2025 • edited Loading

mergify bot commented Mar 25, 2025 • edited Loading

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

ceberam left a comment

Choose a reason for hiding this comment

vagenas commented Mar 25, 2025 •

edited

Loading

mergify bot commented Mar 25, 2025 •

edited

Loading