Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enable markdown text formatting for docx #630

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SimJeg
Copy link

@SimJeg SimJeg commented Dec 19, 2024

Hi,

This PR adds markdown text formatting for docx documents (italic, bold, underline and hyperlinks). I included a new tests/data/docx/unit_test_formatting.docx document to illustrate it. Using the latest docling main the output of export_to_markdown is:

italic
bold
underline
hyperlink
italic and bold hyperlink
italic bold underline and hyperlink on the same line

with this PR it becomes:

italic
bold
underline
hyperlink
italic and bold hyperlink
italic bold underline and hyperlink on the same line

Copy link

mergify bot commented Dec 19, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@SimJeg SimJeg force-pushed the docx-markdown-formatting branch from a221428 to 7f9464b Compare December 19, 2024 11:09
@SimJeg
Copy link
Author

SimJeg commented Dec 19, 2024

Note: for underline I used the <u> / </u> tags that are not rendered on GitHub 😅

@SimJeg
Copy link
Author

SimJeg commented Dec 26, 2024

@maxmnemonic @PeterStaar-IBM do you need any additional info for this PR ?

@dolfim-ibm
Copy link
Contributor

@SimJeg this is an interesting feature, but we should introduce it with an option for enable/disable, because not all output formats will be compatible with markdown styling. There could also be some consideration on whether to propagate text styling in the Docling document format, but the option will be needed.

@SimJeg
Copy link
Author

SimJeg commented Jan 6, 2025

Hi @dolfim-ibm,

Indeed, a different function should be applied for HTML for instance. I can add an argument to the convert function (e.g. style=[None, "markdown", "htlm"]).

As there are several options to do this and I don't know very well docling API, I'll wait for your confirmation before pushing updates.

@SimJeg
Copy link
Author

SimJeg commented Jan 13, 2025

@dolfim-ibm any update on it ?

@dolfim-ibm
Copy link
Contributor

We actually are considering something similar to what you are proposing.

Adding the option for the format at convert time (with default None) is good, but we would like to have them in the PipelineOptions for the MS Word backend, since it will be something specific to it.

We will soon post more details, but the above is the general idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants