Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Table Formatting in Converting Word Documents To HTML #791

Open
jrsperry opened this issue Jan 23, 2025 · 0 comments
Open

Incorrect Table Formatting in Converting Word Documents To HTML #791

jrsperry opened this issue Jan 23, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@jrsperry
Copy link

jrsperry commented Jan 23, 2025

Bug

Tables are not converted properly, there are repeating columns when converting docx to html.
...

Steps to reproduce

from docling.document_converter import DocumentConverter, InputFormat, HTMLFormatOption, DocumentStream
from io import BytesIO

converter = DocumentConverter()

with open('/Users/joshuasperry/Downloads/josh_test_document.docx', 'rb') as f:
    content = f.read()
    doc_stream = DocumentStream(
        stream=BytesIO(content),
        name='some-document.docx',
        content_type='application/vnd.openxmlformats-officedocument.wordprocessingml.document'
    )
    res = converter.convert(doc_stream)
    doc_html = res.document.export_to_html()

...

Docling version

Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.2
Docling Parse version: 3.1.0

...

Python version

3.10.16
...

NOTES:

I'm aware that the table has merged cells, and believe this is the source of the problem. This is similar to the documents that i ingest. the output html repeats the data over and over in the columns, which is wonky. Here's the included html as I can't attach html docs.

<!DOCTYPE html>
<html lang="en">
<head>
    <link rel="icon" type="image/png"
    href="https://ds4sd.github.io/docling/assets/logo.png"/>
    <meta charset="UTF-8">
    <title>
    Powered by Docling
    </title>
    <style>
    html {
    background-color: LightGray;
    }
    body {
    margin: 0 auto;
    width:800px;
    padding: 30px;
    background-color: White;
    font-family: Arial, sans-serif;
    box-shadow: 10px 10px 10px grey;
    }
    figure{
    display: block;
    width: 100%;
    margin: 0px;
    margin-top: 10px;
    margin-bottom: 10px;
    }
    img {
    display: block;
    margin: auto;
    margin-top: 10px;
    margin-bottom: 10px;
    max-width: 640px;
    max-height: 640px;
    }
    table {
    min-width:500px;
    background-color: White;
    border-collapse: collapse;
    cell-padding: 5px;
    margin: auto;
    margin-top: 10px;
    margin-bottom: 10px;
    }
    th, td {
    border: 1px solid black;
    padding: 8px;
    }
    th {
    font-weight: bold;
    }
    table tr:nth-child(even) td{
    background-color: LightGray;
    }
    </style>
    </head>
<p>Let’s make some annoying tables</p>
<p></p>
<p></p>
<table><tbody><tr><td>Summary</td><td>Some summary description</td><td>Some summary description</td></tr><tr><td>This is some text that will be repeated</td><td>This is some text that will be repeated</td><td>This is some text that will be repeated</td></tr><tr><td>Purpose</td><td>Some purpose description</td><td>Some purpose description</td></tr><tr><td>Second bundle of text to be repeated</td><td>Second bundle of text to be repeated</td><td>Second bundle of text to be repeated</td></tr><tr><td>Context</td><td>some context stuff</td><td>some context stuff</td></tr><tr><td>This is the 3rd section</td><td>This is the 3rd section</td><td>This is the 3rd section</td></tr><tr><td>Audience</td><td>Please provide the specific audience for your selected text.</td><td>Please provide the specific audience for your selected text.</td></tr><tr><td>So much audience</td><td>So much audience</td><td>So much audience</td></tr><tr><td>Appeals</td><td>stuff</td><td>stuff</td></tr><tr><td>stuff<br>even more stuff</td><td>stuff<br>even more stuff</td><td>stuff<br>even more stuff</td></tr><tr><td>Sources</td><td>Sources</td><td>So much stuff</td></tr><tr><td>Blarghhh stuff</td><td>Blarghhh stuff</td><td>Blarghhh stuff</td></tr></tbody></table>
<p></p>
</html>

josh_test_document.docx

@jrsperry jrsperry added the bug Something isn't working label Jan 23, 2025
@jrsperry jrsperry changed the title Incorrect Table Formatting in Word Documents Incorrect Table Formatting in Converting Word Documents To HTML Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant