feat: Expose equation exports (#869)

* pin new docling-core and exploit it via assembler changes Signed-off-by: Michele Dolfi <[email protected]> * update test results Signed-off-by: Michele Dolfi <[email protected]> * update with docling-core release Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Michele Dolfi <[email protected]>
DS4SD · Feb 3, 2025 · 6a76b49 · 6a76b49
1 parent 0cd81a8
commit 6a76b49
Show file tree

Hide file tree

Showing 19 changed files with 138 additions and 122 deletions.
diff --git a/docling/utils/glm_utils.py b/docling/utils/glm_utils.py
@@ -307,6 +307,10 @@ def to_docling_document(doc_glm, update_name_label=False) -> DoclingDocument:
                 current_list = None
 
                 doc.add_code(text=text, prov=prov)
+            elif label == DocItemLabel.FORMULA:
+                current_list = None
+
+                doc.add_text(label=DocItemLabel.FORMULA, text="", orig=text, prov=prov)
             else:
                 current_list = None
 

diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -26,7 +26,7 @@ packages = [{include = "docling"}]
 ######################
 python = "^3.9"
 pydantic = "^2.0.0"
-docling-core = {version = "^2.16.1", extras = ["chunking"]}
+docling-core = {extras = ["chunking"], version = "^2.17.0"}
 docling-ibm-models = "^3.3.0"
 deepsearch-glm = "^1.0.0"
 docling-parse = "^3.1.0"

diff --git a/tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt b/tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
@@ -106,12 +106,12 @@
 <text><location><page_6><loc_8><loc_70><loc_47><loc_80></location>The output features for each table cell are then fed into the feed-forward network (FFN). The FFN consists of a Multi-Layer Perceptron (3 layers with ReLU activation function) that predicts the normalized coordinates for the bounding box of each table cell. Finally, the predicted bounding boxes are classified based on whether they are empty or not using a linear layer.</text>
 <text><location><page_6><loc_8><loc_44><loc_47><loc_69></location>Loss Functions. We formulate a multi-task loss Eq. 2 to train our network. The Cross-Entropy loss (denoted as l$_{s}$ ) is used to train the Structure Decoder which predicts the structure tokens. As for the Cell BBox Decoder it is trained with a combination of losses denoted as l$_{box}$ . l$_{box}$ consists of the generally used l$_{1}$ loss for object detection and the IoU loss ( l$_{iou}$ ) to be scale invariant as explained in [25]. In comparison to DETR, we do not use the Hungarian algorithm [15] to match the predicted bounding boxes with the ground-truth boxes, as we have already achieved a one-toone match through two steps: 1) Our token input sequence is naturally ordered, therefore the hidden states of the table data cells are also in order when they are provided as input to the Cell BBox Decoder , and 2) Our bounding boxes generation mechanism (see Sec. 3) ensures a one-to-one mapping between the cell content and its bounding box for all post-processed datasets.</text>
 <text><location><page_6><loc_8><loc_41><loc_47><loc_43></location>The loss used to train the TableFormer can be defined as following:</text>
-<formula><location><page_6><loc_20><loc_35><loc_47><loc_38></location>l$_{box}$ = λ$_{iou}$l$_{iou}$ + λ$_{l}$$_{1}$ l = λl$_{s}$ + (1 - λ ) l$_{box}$ (1)</formula>
+<formula><location><page_6><loc_20><loc_35><loc_47><loc_38></location></formula>
 <text><location><page_6><loc_8><loc_32><loc_46><loc_33></location>where λ ∈ [0, 1], and λ$_{iou}$, λ$_{l}$$_{1}$ ∈$_{R}$ are hyper-parameters.</text>
 <section_header_level_1><location><page_6><loc_8><loc_28><loc_28><loc_30></location>5. Experimental Results</section_header_level_1>
 <section_header_level_1><location><page_6><loc_8><loc_26><loc_29><loc_27></location>5.1. Implementation Details</section_header_level_1>
 <text><location><page_6><loc_8><loc_19><loc_47><loc_25></location>TableFormer uses ResNet-18 as the CNN Backbone Network . The input images are resized to 448*448 pixels and the feature map has a dimension of 28*28. Additionally, we enforce the following input constraints:</text>
-<formula><location><page_6><loc_15><loc_14><loc_47><loc_17></location>Image width and height ≤ 1024 pixels Structural tags length ≤ 512 tokens. (2)</formula>
+<formula><location><page_6><loc_15><loc_14><loc_47><loc_17></location></formula>
 <text><location><page_6><loc_8><loc_10><loc_47><loc_13></location>Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved</text>
 <text><location><page_6><loc_50><loc_86><loc_89><loc_91></location>runtime performance and lower memory footprint of TableFormer. This allows to utilize input samples with longer sequences and images with larger dimensions.</text>
 <text><location><page_6><loc_50><loc_59><loc_89><loc_85></location>The Transformer Encoder consists of two "Transformer Encoder Layers", with an input feature size of 512, feed forward network of 1024, and 4 attention heads. As for the Transformer Decoder it is composed of four "Transformer Decoder Layers" with similar input and output dimensions as the "Transformer Encoder Layers". Even though our model uses fewer layers and heads than the default implementation parameters, our extensive experimentation has proved this setup to be more suitable for table images. We attribute this finding to the inherent design of table images, which contain mostly lines and text, unlike the more elaborate content present in other scopes (e.g. the COCO dataset). Moreover, we have added ResNet blocks to the inputs of the Structure Decoder and Cell BBox Decoder. This prevents a decoder having a stronger influence over the learned weights which would damage the other prediction task (structure vs bounding boxes), but learn task specific weights instead. Lastly our dropout layers are set to 0.5.</text>
@@ -122,7 +122,7 @@
 <text><location><page_6><loc_50><loc_10><loc_89><loc_14></location>We also share our baseline results on the challenging SynthTabNet dataset. Throughout our experiments, the same parameters stated in Sec. 5.1 are utilized.</text>
 <section_header_level_1><location><page_7><loc_8><loc_89><loc_27><loc_91></location>5.3. Datasets and Metrics</section_header_level_1>
 <text><location><page_7><loc_8><loc_83><loc_47><loc_88></location>The Tree-Edit-Distance-Based Similarity (TEDS) metric was introduced in [37]. It represents the prediction, and ground-truth as a tree structure of HTML tags. This similarity is calculated as:</text>
-<formula><location><page_7><loc_14><loc_78><loc_47><loc_81></location>TEDS ( T$_{a}$, T$_{b}$ ) = 1 - EditDist ( T$_{a}$, T$_{b}$ ) max ( | T$_{a}$ | , | T$_{b}$ | ) (3)</formula>
+<formula><location><page_7><loc_14><loc_78><loc_47><loc_81></location></formula>
 <text><location><page_7><loc_8><loc_73><loc_47><loc_77></location>where T$_{a}$ and T$_{b}$ represent tables in tree structure HTML format. EditDist denotes the tree-edit distance, and | T | represents the number of nodes in T .</text>
 <section_header_level_1><location><page_7><loc_8><loc_70><loc_28><loc_72></location>5.4. Quantitative Analysis</section_header_level_1>
 <text><location><page_7><loc_8><loc_50><loc_47><loc_69></location>Structure. As shown in Tab. 2, TableFormer outperforms all SOTA methods across different datasets by a large margin for predicting the table structure from an image. All the more, our model outperforms pre-trained methods. During the evaluation we do not apply any table filtering. We also provide our baseline results on the SynthTabNet dataset. It has been observed that large tables (e.g. tables that occupy half of the page or more) yield poor predictions. We attribute this issue to the image resizing during the preprocessing step, that produces downsampled images with indistinguishable features. This problem can be addressed by treating such big tables with a separate model which accepts a large input image size.</text>
@@ -304,7 +304,7 @@
 <list_item><location><page_12><loc_8><loc_29><loc_47><loc_33></location>3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.</list_item>
 <list_item><location><page_12><loc_8><loc_24><loc_47><loc_28></location>4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:</list_item>
 </unordered_list>
-<formula><location><page_12><loc_18><loc_17><loc_47><loc_21></location>alignment = arg min c { D$_{c}$ } D$_{c}$ = max { x$_{c}$ } - min { x$_{c}$ } (4)</formula>
+<formula><location><page_12><loc_18><loc_17><loc_47><loc_21></location></formula>
 <text><location><page_12><loc_8><loc_13><loc_47><loc_16></location>where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.</text>
 <unordered_list>
 <list_item><location><page_12><loc_8><loc_10><loc_47><loc_13></location>5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-</list_item>

diff --git a/tests/data/groundtruth/docling_v2/2203.01017v2.json b/tests/data/groundtruth/docling_v2/2203.01017v2.json