-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix single newline handling in MD backend #824
Conversation
Signed-off-by: Panos Vagenas <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
For reference, Marko (the Markdown parsing lib employed under the hood) handles new lines as follows:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this PR fixes the problem described in this issue, it may at the same time disregard hard line breaks, which could be important for the document author.
For instance, consider the following text (rendered here as text):
First line with no space after.
The first line continues.
First line with two spaces after.
And the next line.
According to the markdown syntax and, for instance, GitHub Flavored Markdown the first block represents a single line while the second block should have 2 lines:
First line with no space after. The first line continues.
First line with two spaces after.
And the next line.
However this docling fix puts the second block as a single line and thus remove the hard line break. Running docling --from md --to md
on the document above would result in:
First line with no space after. The first line continues.
First line with two spaces after. And the next line.
@ceberam I would anyway regard the hard line break only as a visual wrap / formatting, in which case it should not correspond to a new DoclingDocument text item / paragraph. To make it clear, in the discussed "rendering" example:
these are conceptually still only two paragraphs. In this sense, the produced MD export is as expected (given that DoclingDocument will capture the paragraphs):
|
Resolves #822.