Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: New TableFormer model weights #1100

Merged
merged 3 commits into from
Mar 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docling/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ def convert(
table_mode: Annotated[
TableFormerMode,
typer.Option(..., help="The mode to use in the table structure model."),
] = TableFormerMode.FAST,
] = TableFormerMode.ACCURATE,
enrich_code: Annotated[
bool,
typer.Option(..., help="Enable the code enrichment model in the pipeline."),
Expand Down
2 changes: 1 addition & 1 deletion docling/datamodel/pipeline_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ class TableStructureOptions(BaseModel):
# are merged across table columns.
# False: Let table structure model define the text cells, ignore PDF cells.
)
mode: TableFormerMode = TableFormerMode.FAST
mode: TableFormerMode = TableFormerMode.ACCURATE


class OcrOptions(BaseModel):
Expand Down
2 changes: 1 addition & 1 deletion docling/models/table_structure_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ def download_models(
repo_id="ds4sd/docling-models",
force_download=force,
local_dir=local_dir,
revision="v2.1.0",
revision="v2.2.0",
)

return Path(download_path)
Expand Down
2 changes: 1 addition & 1 deletion docs/usage/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ doc_converter = DocumentConverter(
)
```

Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.

```python
from docling.datamodel.base_models import InputFormat
Expand Down
20 changes: 10 additions & 10 deletions tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</figure>
<table>
<location><page_1><loc_52><loc_62><loc_88><loc_71></location>
<row_0><col_0><col_header>3</col_0><col_1><col_header>1</col_1></row_0>
<row_0><col_0><col_header>1</col_0></row_0>
</table>
<paragraph><location><page_1><loc_52><loc_58><loc_79><loc_60></location>- b. Red-annotation of bounding boxes, Blue-predictions by TableFormer</paragraph>
<paragraph><location><page_1><loc_52><loc_46><loc_80><loc_47></location>- c. Structure predicted by TableFormer:</paragraph>
Expand All @@ -25,11 +25,11 @@
</figure>
<table>
<location><page_1><loc_52><loc_37><loc_88><loc_45></location>
<row_0><col_0><col_header>0</col_0><col_1><col_header>1</col_1><col_2><col_header>1</col_2><col_3><col_header>2 1</col_3><col_4><col_header>2 1</col_4><col_5><body></col_5></row_0>
<row_1><col_0><body>3</col_0><col_1><body>4</col_1><col_2><body>5 3</col_2><col_3><body>6</col_3><col_4><body>7</col_4><col_5><body></col_5></row_1>
<row_2><col_0><body>8</col_0><col_1><body>9</col_1><col_2><body>10</col_2><col_3><body>11</col_3><col_4><body>12</col_4><col_5><body>2</col_5></row_2>
<row_3><col_0><body></col_0><col_1><body>13</col_1><col_2><body>14</col_2><col_3><body>15</col_3><col_4><body>16</col_4><col_5><body>2</col_5></row_3>
<row_4><col_0><body></col_0><col_1><body>17</col_1><col_2><body>18</col_2><col_3><body>19</col_3><col_4><body>20</col_4><col_5><body>2</col_5></row_4>
<row_0><col_0><body>0</col_0><col_1><body>1 2 1</col_1><col_2><body>1 2 1</col_2><col_3><body>1 2 1</col_3><col_4><body>1 2 1</col_4></row_0>
<row_1><col_0><body>3</col_0><col_1><body>4 3</col_1><col_2><body>5</col_2><col_3><body>6</col_3><col_4><body>7</col_4></row_1>
<row_2><col_0><body>8 2</col_0><col_1><body>9</col_1><col_2><body>10</col_2><col_3><body>11</col_3><col_4><body>12</col_4></row_2>
<row_3><col_0><body>13</col_0><col_1><body></col_1><col_2><body>14</col_2><col_3><body>15</col_3><col_4><body>16</col_4></row_3>
<row_4><col_0><body>17</col_0><col_1><body>18</col_1><col_2><body></col_2><col_3><body>19</col_3><col_4><body>20</col_4></row_4>
</table>
<paragraph><location><page_1><loc_50><loc_16><loc_89><loc_26></location>Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.</paragraph>
<paragraph><location><page_1><loc_50><loc_10><loc_89><loc_16></location>The first problem is called table-location and has been previously addressed [30, 38, 19, 21, 23, 26, 8] with stateof-the-art object-detection networks (e.g. YOLO and later on Mask-RCNN [9]). For all practical purposes, it can be</paragraph>
Expand Down Expand Up @@ -138,9 +138,9 @@
<location><page_7><loc_50><loc_62><loc_87><loc_69></location>
<caption>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
<row_0><col_0><col_header>Model</col_0><col_1><col_header>Dataset</col_1><col_2><col_header>mAP</col_2><col_3><col_header>mAP (PP)</col_3></row_0>
<row_1><col_0><body>EDD+BBox</col_0><col_1><body>PubTabNet</col_1><col_2><body>79.2</col_2><col_3><body>82.7</col_3></row_1>
<row_2><col_0><body>TableFormer</col_0><col_1><body>PubTabNet</col_1><col_2><body>82.1</col_2><col_3><body>86.8</col_3></row_2>
<row_3><col_0><body>TableFormer</col_0><col_1><body>SynthTabNet</col_1><col_2><body>87.7</col_2><col_3><body>-</col_3></row_3>
<row_1><col_0><row_header>EDD+BBox</col_0><col_1><body>PubTabNet</col_1><col_2><body>79.2</col_2><col_3><body>82.7</col_3></row_1>
<row_2><col_0><row_header>TableFormer</col_0><col_1><body>PubTabNet</col_1><col_2><body>82.1</col_2><col_3><body>86.8</col_3></row_2>
<row_3><col_0><row_header>TableFormer</col_0><col_1><body>SynthTabNet</col_1><col_2><body>87.7</col_2><col_3><body>-</col_3></row_3>
</table>
<caption><location><page_7><loc_50><loc_57><loc_89><loc_60></location>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
<paragraph><location><page_7><loc_50><loc_34><loc_89><loc_54></location>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</paragraph>
Expand Down Expand Up @@ -179,7 +179,7 @@
<row_6><col_0><row_header>第 17 回人工知能学会全国大会 (2003)</col_0><col_1><body>208</col_1><col_2><body>5</col_2><col_3><body>203</col_3><col_4><body>152</col_4><col_5><body>244</col_5></row_6>
<row_7><col_0><row_header>自然言語処理研究会第 146 〜 155 回</col_0><col_1><body>98</col_1><col_2><body>2</col_2><col_3><body>96</col_3><col_4><body>150</col_4><col_5><body>232</col_5></row_7>
<row_8><col_0><row_header>WWW から収集した論文</col_0><col_1><body>107</col_1><col_2><body>73</col_2><col_3><body>34</col_3><col_4><body>147</col_4><col_5><body>96</col_5></row_8>
<row_9><col_0><body></col_0><col_1><body>945</col_1><col_2><body>294</col_2><col_3><body>651</col_3><col_4><body>1122</col_4><col_5><body>955</col_5></row_9>
<row_9><col_0><row_header>計</col_0><col_1><body>945</col_1><col_2><body>294</col_2><col_3><body>651</col_3><col_4><body>1122</col_4><col_5><body>955</col_5></row_9>
</table>
<caption><location><page_8><loc_62><loc_62><loc_90><loc_63></location>Text is aligned to match original for ease of viewing</caption>
<table>
Expand Down
2 changes: 1 addition & 1 deletion tests/data/groundtruth/docling_v1/2203.01017v2.json

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions tests/data/groundtruth/docling_v1/2203.01017v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@ The occurrence of tables in documents is ubiquitous. They often summarise quanti
Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
<!-- image -->

| 0 | 1 | 1 | 2 1 | 2 1 | |
|-----|-----|-----|-------|-------|----|
| 3 | 4 | 5 3 | 6 | 7 | |
| 8 | 9 | 10 | 11 | 12 | 2 |
| | 13 | 14 | 15 | 16 | 2 |
| | 17 | 18 | 19 | 20 | 2 |
| 0 | 1 2 1 | 1 2 1 | 1 2 1 | 1 2 1 |
|-----|---------|---------|---------|---------|
| 3 | 4 3 | 5 | 6 | 7 |
| 8 2 | 9 | 10 | 11 | 12 |
| 13 | | 14 | 15 | 16 |
| 17 | 18 | | 19 | 20 |

Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.

Expand Down Expand Up @@ -241,7 +241,7 @@ Text is aligned to match original for ease of viewing
| 第 17 回人工知能学会全国大会 (2003) | 208 | 5 | 203 | 152 | 244 |
| 自然言語処理研究会第 146 〜 155 回 | 98 | 2 | 96 | 150 | 232 |
| WWW から収集した論文 | 107 | 73 | 34 | 147 | 96 |
| | 945 | 294 | 651 | 1122 | 955 |
| | 945 | 294 | 651 | 1122 | 955 |

| | Shares (in millions) | Shares (in millions) | Weighted Average Grant Date Fair Value | Weighted Average Grant Date Fair Value |
|--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|
Expand Down
2 changes: 1 addition & 1 deletion tests/data/groundtruth/docling_v1/2203.01017v2.pages.json

Large diffs are not rendered by default.

18 changes: 9 additions & 9 deletions tests/data/groundtruth/docling_v1/2206.01062.doctags.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
<table>
<location><page_4><loc_16><loc_63><loc_84><loc_83></location>
<caption>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the [email protected] metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption>
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>% of Total</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
<row_1><col_0><col_header>class label</col_0><col_1><col_header>Count</col_1><col_2><col_header>Train</col_2><col_3><col_header>Test</col_3><col_4><col_header>Val</col_4><col_5><col_header>All</col_5><col_6><col_header>Fin</col_6><col_7><col_header>Man</col_7><col_8><col_header>Sci</col_8><col_9><col_header>Law</col_9><col_10><col_header>Pat</col_10><col_11><col_header>Ten</col_11></row_1>
<row_2><col_0><row_header>Caption</col_0><col_1><body>22524</col_1><col_2><body>2.04</col_2><col_3><body>1.77</col_3><col_4><body>2.32</col_4><col_5><body>84-89</col_5><col_6><body>40-61</col_6><col_7><body>86-92</col_7><col_8><body>94-99</col_8><col_9><body>95-99</col_9><col_10><body>69-78</col_10><col_11><body>n/a</col_11></row_2>
<row_3><col_0><row_header>Footnote</col_0><col_1><body>6318</col_1><col_2><body>0.60</col_2><col_3><body>0.31</col_3><col_4><body>0.58</col_4><col_5><body>83-91</col_5><col_6><body>n/a</col_6><col_7><body>100</col_7><col_8><body>62-88</col_8><col_9><body>85-94</col_9><col_10><body>n/a</col_10><col_11><body>82-97</col_11></row_3>
Expand Down Expand Up @@ -102,7 +102,7 @@
<table>
<location><page_6><loc_10><loc_56><loc_47><loc_75></location>
<row_0><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>MRCNN</col_2><col_3><col_header>MRCNN</col_3><col_4><col_header>FRCNN</col_4><col_5><col_header>YOLO</col_5></row_0>
<row_1><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>R50</col_2><col_3><col_header>R101</col_3><col_4><col_header>R101</col_4><col_5><col_header>v5x6</col_5></row_1>
<row_1><col_0><body></col_0><col_1><body></col_1><col_2><col_header>R50</col_2><col_3><col_header>R101</col_3><col_4><col_header>R101</col_4><col_5><col_header>v5x6</col_5></row_1>
<row_2><col_0><row_header>Caption</col_0><col_1><body>84-89</col_1><col_2><body>68.4</col_2><col_3><body>71.5</col_3><col_4><body>70.1</col_4><col_5><body>77.7</col_5></row_2>
<row_3><col_0><row_header>Footnote</col_0><col_1><body>83-91</col_1><col_2><body>70.9</col_2><col_3><body>71.8</col_3><col_4><body>73.7</col_4><col_5><body>77.2</col_5></row_3>
<row_4><col_0><row_header>Formula</col_0><col_1><body>83-85</col_1><col_2><body>60.1</col_2><col_3><body>63.4</col_3><col_4><body>63.5</col_4><col_5><body>66.2</col_5></row_4>
Expand Down Expand Up @@ -130,7 +130,7 @@
<paragraph><location><page_7><loc_9><loc_84><loc_48><loc_89></location>Table 3: Performance of a Mask R-CNN R50 network in [email protected] scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.</paragraph>
<table>
<location><page_7><loc_13><loc_63><loc_44><loc_81></location>
<row_0><col_0><col_header>Class-count</col_0><col_1><col_header>11</col_1><col_2><col_header>6</col_2><col_3><col_header>5</col_3><col_4><col_header>4</col_4></row_0>
<row_0><col_0><body>Class-count</col_0><col_1><col_header>11</col_1><col_2><col_header>6</col_2><col_3><col_header>5</col_3><col_4><col_header>4</col_4></row_0>
<row_1><col_0><row_header>Caption</col_0><col_1><body>68</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_1>
<row_2><col_0><row_header>Footnote</col_0><col_1><body>71</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_2>
<row_3><col_0><row_header>Formula</col_0><col_1><body>60</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_3>
Expand Down Expand Up @@ -178,17 +178,17 @@
<row_1><col_0><col_header>Training on</col_0><col_1><col_header>labels</col_1><col_2><col_header>PLN</col_2><col_3><col_header>DB</col_3><col_4><col_header>DLN</col_4></row_1>
<row_2><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Figure</col_1><col_2><body>96</col_2><col_3><body>43</col_3><col_4><body>23</col_4></row_2>
<row_3><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Sec-header</col_1><col_2><body>87</col_2><col_3><body>-</col_3><col_4><body>32</col_4></row_3>
<row_4><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Table</col_1><col_2><body>95</col_2><col_3><body>24</col_3><col_4><body>49</col_4></row_4>
<row_5><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Text</col_1><col_2><body>96</col_2><col_3><body>-</col_3><col_4><body>42</col_4></row_5>
<row_6><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>total</col_1><col_2><body>93</col_2><col_3><body>34</col_3><col_4><body>30</col_4></row_6>
<row_4><col_0><body></col_0><col_1><row_header>Table</col_1><col_2><body>95</col_2><col_3><body>24</col_3><col_4><body>49</col_4></row_4>
<row_5><col_0><body></col_0><col_1><row_header>Text</col_1><col_2><body>96</col_2><col_3><body>-</col_3><col_4><body>42</col_4></row_5>
<row_6><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>93</col_2><col_3><body>34</col_3><col_4><body>30</col_4></row_6>
<row_7><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>Figure</col_1><col_2><body>77</col_2><col_3><body>71</col_3><col_4><body>31</col_4></row_7>
<row_8><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>Table</col_1><col_2><body>19</col_2><col_3><body>65</col_3><col_4><body>22</col_4></row_8>
<row_9><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>total</col_1><col_2><body>48</col_2><col_3><body>68</col_3><col_4><body>27</col_4></row_9>
<row_10><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Figure</col_1><col_2><body>67</col_2><col_3><body>51</col_3><col_4><body>72</col_4></row_10>
<row_11><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Sec-header</col_1><col_2><body>53</col_2><col_3><body>-</col_3><col_4><body>68</col_4></row_11>
<row_12><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Table</col_1><col_2><body>87</col_2><col_3><body>43</col_3><col_4><body>82</col_4></row_12>
<row_13><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Text</col_1><col_2><body>77</col_2><col_3><body>-</col_3><col_4><body>84</col_4></row_13>
<row_14><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
<row_12><col_0><body></col_0><col_1><row_header>Table</col_1><col_2><body>87</col_2><col_3><body>43</col_3><col_4><body>82</col_4></row_12>
<row_13><col_0><body></col_0><col_1><row_header>Text</col_1><col_2><body>77</col_2><col_3><body>-</col_3><col_4><body>84</col_4></row_13>
<row_14><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
</table>
<paragraph><location><page_8><loc_9><loc_44><loc_48><loc_51></location>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</paragraph>
<paragraph><location><page_8><loc_9><loc_26><loc_48><loc_44></location>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</paragraph>
Expand Down
2 changes: 1 addition & 1 deletion tests/data/groundtruth/docling_v1/2206.01062.json

Large diffs are not rendered by default.

Loading