Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve HTML layer detection, various MD fixes #1241

Merged
merged 1 commit into from
Mar 26, 2025
Merged

Conversation

vagenas
Copy link
Contributor

@vagenas vagenas commented Mar 25, 2025

Primary bug being addressed:
During HTML parsing (which BTW is also triggered by Markdown parsing when HTML elements are present), in case there were headings (<h*> tags) but no <h1> tags, all content was being marked as furniture.

Now the heuristic is being adapted so that:

  • 🆕 if any heading is present (regardless of level h1, h2 etc.), the first one of them marks the beginning of the body; before that it's furniture
  • else, i.e. if no headings are present at all, all content is marked as body

Markdown fixes:

  • properly propagate section header levels
  • improve handling of list subroots without text

@vagenas vagenas added html issue related to html backend markdown issue related to markdown backend labels Mar 25, 2025
@vagenas vagenas requested a review from cau-git March 25, 2025 20:09
Copy link

mergify bot commented Mar 25, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@vagenas vagenas force-pushed the fix-html-furniture branch from 24a73e3 to 333eb34 Compare March 25, 2025 21:19
dolfim-ibm
dolfim-ibm previously approved these changes Mar 26, 2025
PeterStaar-IBM
PeterStaar-IBM previously approved these changes Mar 26, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments:

  • Following up with the discussion thread, would it be better to set the heuristic parameter to False by default? In this way, we ensure that no content is lost for users who are not aware of this feature
  • In the heuristic rule, I would switch to ContentLayer.BODY as soon as we see an h* tag. Currently this only happens once we see <h1> and the specifications recommend that you do not skip heading levels but just in case we see documents starting with, say, <h2> and skipping <h1>.

Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text

Signed-off-by: Panos Vagenas <[email protected]>
@vagenas vagenas dismissed stale reviews from PeterStaar-IBM and dolfim-ibm via 40c099e March 26, 2025 14:31
@vagenas vagenas force-pushed the fix-html-furniture branch from 333eb34 to 40c099e Compare March 26, 2025 14:31
@vagenas vagenas changed the title fix(markdown): fix layer detection in case of mixed content fix: improve HTML layer detection, various MD fixes Mar 26, 2025
@vagenas vagenas merged commit 9210812 into main Mar 26, 2025
10 checks passed
@vagenas vagenas deleted the fix-html-furniture branch March 26, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
html issue related to html backend markdown issue related to markdown backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants