Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(markdown): add support for HTML content #855

Merged
merged 2 commits into from
Feb 3, 2025
Merged

Conversation

vagenas
Copy link
Contributor

@vagenas vagenas commented Jan 31, 2025

Addresses #734.

Copy link

mergify bot commented Jan 31, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@vagenas vagenas linked an issue Jan 31, 2025 that may be closed by this pull request
PeterStaar-IBM
PeterStaar-IBM previously approved these changes Jan 31, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@vagenas vagenas changed the title fix(markdown): add support for HTML content [skip ci] fix(markdown): add support for HTML content Feb 3, 2025
@vagenas vagenas marked this pull request as ready for review February 3, 2025 09:40
dolfim-ibm
dolfim-ibm previously approved these changes Feb 3, 2025
@dolfim-ibm
Copy link
Contributor

I suggest making a reminder issue about html features which cannot be converted to markdown, i.e. we will have to deal with tables having merged cells, which are only possible via html code blocks.

cau-git
cau-git previously approved these changes Feb 3, 2025
Signed-off-by: Panos Vagenas <[email protected]>
@vagenas vagenas dismissed stale reviews from cau-git and dolfim-ibm via f4b30fe February 3, 2025 10:21
@vagenas
Copy link
Contributor Author

vagenas commented Feb 3, 2025

I suggest making a reminder issue about html features which cannot be converted to markdown, i.e. we will have to deal with tables having merged cells, which are only possible via html code blocks.

@dolfim-ibm can you elaborate? Unless you are you referring to Markdown export in general, this PR does not involve HTML-to-Markdown; what it does (when needed) is Markdown-to-HTML & then HTML parsing, in order to streamline HTML content processing.

@vagenas vagenas merged commit 94751a7 into main Feb 3, 2025
10 checks passed
@vagenas vagenas deleted the parse-md-html-mix branch February 3, 2025 11:21
@dolfim-ibm
Copy link
Contributor

I suggest making a reminder issue about html features which cannot be converted to markdown, i.e. we will have to deal with tables having merged cells, which are only possible via html code blocks.

@dolfim-ibm can you elaborate? Unless you are you referring to Markdown export in general, this PR does not involve HTML-to-Markdown; what it does (when needed) is Markdown-to-HTML & then HTML parsing, in order to streamline HTML content processing.

I'm referring to Markdown documents using html <table> to represent tables which are not supported by the native markdown format, e.g. with merged cells.

We would enhance the export to markdown in the DoclingDocument with an option to potentially output html tables if we detect that the markdown format would be "too lossy". Something like tables_as_html: bool = False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Mixed Document Types
4 participants