mkrssg
diff --git a/‎poetry.lock
+413-425 b/‎poetry.lock
+413-425
diff --git a/‎tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
+1-1 b/‎tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
+1-1
diff --git a/‎tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
+1-1 b/‎tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
+1-1
diff --git a/‎tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
+4-4 b/‎tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
+4-4
diff --git a/‎tests/data/groundtruth/docling_v2/2203.01017v2.md
+10 b/‎tests/data/groundtruth/docling_v2/2203.01017v2.md
+10
diff --git a/‎tests/data/groundtruth/docling_v2/2206.01062.md
+8 b/‎tests/data/groundtruth/docling_v2/2206.01062.md
+8
diff --git a/‎tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json
+1-1 b/‎tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json
+1-1
diff --git a/‎tests/data/groundtruth/docling_v2/2305.03393v1-pg9.pages.json
+1-1 b/‎tests/data/groundtruth/docling_v2/2305.03393v1-pg9.pages.json
+1-1
diff --git a/‎tests/data/groundtruth/docling_v2/code_and_formula.doctags.txt
+1-2 b/‎tests/data/groundtruth/docling_v2/code_and_formula.doctags.txt
+1-2
diff --git a/‎tests/data/groundtruth/docling_v2/code_and_formula.md
+2 b/‎tests/data/groundtruth/docling_v2/code_and_formula.md
+2
diff --git a/‎tests/data/groundtruth/docling_v2/redp5110_sampled.doctags.txt
+5-6 b/‎tests/data/groundtruth/docling_v2/redp5110_sampled.doctags.txt
+5-6
@@ -106,8 +106,8 @@
 <text><loc_252><loc_232><loc_445><loc_328>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</text>
 <otsl><loc_272><loc_341><loc_426><loc_406><fcel>Model<ched>Simple<ched>TEDS Complex<ched>All<nl><rhed>Tabula<fcel>78.0<fcel>57.8<fcel>67.9<nl><rhed>Traprange<fcel>60.8<fcel>49.9<fcel>55.4<nl><rhed>Camelot<fcel>80.0<fcel>66.0<fcel>73.0<nl><rhed>Acrobat Pro<fcel>68.9<fcel>61.8<fcel>65.3<nl><rhed>EDD<fcel>91.2<fcel>85.4<fcel>88.3<nl><rhed>TableFormer<fcel>95.4<fcel>90.1<fcel>93.6<nl><caption><loc_252><loc_415><loc_445><loc_435>Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.</caption></otsl>
 <page_footer><loc_241><loc_463><loc_245><loc_469>7</page_footer>
-<unordered_list><page_break>
-<list_item><loc_44><loc_50><loc_50><loc_55>a.</list_item>
+<page_break>
+<unordered_list><list_item><loc_44><loc_50><loc_50><loc_55>a.</list_item>
 <list_item><loc_54><loc_50><loc_408><loc_55>Red - PDF cells, Green - predicted bounding boxes, Blue - post-processed predictions matched to PDF cells</list_item>
 </unordered_list>
 <section_header_level_1><loc_44><loc_60><loc_232><loc_64>Japanese language (previously unseen by TableFormer):</section_header_level_1>
@@ -127,8 +127,8 @@
 <unordered_list><list_item><loc_256><loc_438><loc_445><loc_450>[1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-</list_item>
 </unordered_list>
 <page_footer><loc_241><loc_463><loc_245><loc_469>8</page_footer>
-<unordered_list><page_break>
-<list_item><loc_57><loc_48><loc_234><loc_74>end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</list_item>
+<page_break>
+<unordered_list><list_item><loc_57><loc_48><loc_234><loc_74>end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 , pages 213-229, Cham, 2020. Springer International Publishing. 5</list_item>
 <list_item><loc_45><loc_76><loc_234><loc_95>[2] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 , 2019. 3</list_item>
 <list_item><loc_45><loc_97><loc_234><loc_116>[3] Bertrand Couasnon and Aurelie Lemaitre. Recognition of Tables and Forms , pages 647-677. Springer London, London, 2014. 2</list_item>
 <list_item><loc_45><loc_118><loc_234><loc_143>[4] Herv´e D´ejean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), Apr. 2019. http://sac.founderit.com/. 2</list_item>
 
@@ -51,6 +51,8 @@ To meet the design criteria listed above, we developed a new model called TableF
 
 The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe
 
+$^{1}$https://github.com/IBM/SynthTabNet
+
 its results &amp; performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
 
 ## 2. Previous work and State of the Art
@@ -393,6 +395,8 @@ phan cell.
 
 Aditional images with examples of TableFormer predictions and post-processing can be found below.
 
+Figure 8: Example of a table with multi-line header.
+
 Figure 9: Example of a table with big empty distance between cells.
 
 <!-- image -->
@@ -403,8 +407,12 @@ Figure 10: Example of a complex table with empty cells.
 
 <!-- image -->
 
+Figure 11: Simple table with different style and empty cells.
+
 <!-- image -->
 
+Figure 12: Simple table predictions and post processing.
+
 Figure 13: Table predictions example on colorful table.
 
 <!-- image -->
@@ -425,6 +433,8 @@ Figure 15: Example with triangular table.
 
 <!-- image -->
 
+Figure 16: Example of how post-processing helps to restore mis-aligned bounding boxes prediction artifact.
+
 Figure 17: Example of long table. End-to-end example from initial PDF cells to prediction of bounding boxes, post processing and prediction of structure.
 
 <!-- image -->
@@ -53,6 +53,8 @@ In this paper, we present the DocLayNet dataset. It provides pageby-page layout
 - (3) Detailed Label Set : We define 11 class labels to distinguish layout features in high detail. PubLayNet provides 5 labels; DocBank provides 13, although not a superset of ours.
 - (4) Redundant Annotations : A fraction of the pages in the DocLayNet data set carry more than one human annotation.
 
+$^{1}$https://developer.ibm.com/exchanges/data/all/doclaynet
+
 This enables experimentation with annotation uncertainty and quality control analysis.
 
 - (5) Pre-defined Train-, Test- &amp; Validation-set : Like DocBank, we provide fixed train-, test- &amp; validation-sets to ensure proportional representation of the class-labels. Further, we prevent leakage of unique layouts across sets, which has a large effect on model accuracy scores.
@@ -85,6 +87,8 @@ We did not control the document selection with regard to language. The vast majo
 
 To ensure that future benchmarks in the document-layout analysis community can be easily compared, we have split up DocLayNet into pre-defined train-, test- and validation-sets. In this way, we can avoid spurious variations in the evaluation scores due to random splitting in train-, test- and validation-sets. We also ensured that less frequent labels are represented in train and test sets in equal proportions.
 
+$^{2}$e.g. AAPL from https://www.annualreports.com/
+
 Table 1 shows the overall frequency and distribution of the labels among the different sets. Importantly, we ensure that subsets are only split on full-document boundaries. This avoids that pages of the same document are spread over train, test and validation set, which can give an undesired evaluation advantage to models and lead to overestimation of their prediction accuracy. We will show the impact of this decision in Section 5.
 
 In order to accommodate the different types of models currently in use by the community, we provide DocLayNet in an augmented COCO format [16]. This entails the standard COCO ground-truth file (in JSON format) with the associated page images (in PNG format, 1025 × 1025 pixels). Furthermore, custom fields have been added to each COCO record to specify document category, original document filename and page number. In addition, we also provide the original PDF pages, as well as sidecar files containing parsed PDF text and text-cell coordinates (in JSON). All additional files are linked to the primary page images by their matching filenames.
@@ -125,6 +129,8 @@ Preparation work included uploading and parsing the sourced PDF documents in the
 
 Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula , List-item , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on
 
+$^{3}$https://arxiv.org/
+
 the textual content of an element, which goes beyond visual layout recognition, in particular outside the Scientific Articles category.
 
 At first sight, the task of visual document-layout interpretation appears intuitive enough to obtain plausible annotations in most cases. However, during early trial-runs in the core team, we observed many cases in which annotators use different annotation styles, especially for documents with challenging layouts. For example, if a figure is presented with subfigures, one annotator might draw a single figure bounding-box, while another might annotate each subfigure separately. The same applies for lists, where one might annotate all list items in one block or each list item separately. In essence, we observed that challenging layouts would be annotated in different but plausible ways. To illustrate this, we show in Figure 4 multiple examples of plausible but inconsistent annotations on the same pages.
@@ -146,6 +152,8 @@ Phase 3: Training. After a first trial with a small group of people, we realised
 
 05237a14f2524e3f53c8454b074409d05078038a6a36b770fcc8ec7e540deae0
 
+Figure 4: Examples of plausible annotation alternatives for the same page. Criteria in our annotation guideline can resolve cases A to C, while the case D remains ambiguous.
+
 were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.
 
 Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted
 
@@ -1,8 +1,7 @@
 <doctag><section_header_level_1><loc_109><loc_79><loc_258><loc_87>JavaScript Code Example</section_header_level_1>
 <text><loc_109><loc_94><loc_390><loc_183>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
 <text><loc_109><loc_185><loc_390><loc_213>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,</text>
-<code><loc_110><loc_231><loc_215><loc_257><_unknown_>function add(a, b) { return a + b; } console.log(add(3, 5));</code>
-<caption><loc_182><loc_221><loc_317><loc_226>Listing 1: Simple JavaScript Program</caption>
+<code><loc_110><loc_231><loc_215><loc_257><_unknown_>function add(a, b) { return a + b; } console.log(add(3, 5));<caption><loc_182><loc_221><loc_317><loc_226>Listing 1: Simple JavaScript Program</caption></code>
 <text><loc_109><loc_265><loc_390><loc_353>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
 <text><loc_109><loc_355><loc_390><loc_383>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,</text>
 <page_footer><loc_248><loc_439><loc_252><loc_445>1</page_footer>
 
@@ -8,6 +8,8 @@ Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie co
 function add(a, b) { return a + b; } console.log(add(3, 5));
 ```
 
+Listing 1: Simple JavaScript Program
+
 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
 
 Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,
 
@@ -179,12 +179,11 @@
 <list_item><loc_124><loc_232><loc_433><loc_238>-Any other person sees the entire TAX_ID as masked, for example, XXX-XX-XXXX.</list_item>
 <list_item><loc_124><loc_243><loc_433><loc_249>To implement this column mask, run the SQL statement that is shown in Example 3-9.</list_item>
 </unordered_list>
-<code><loc_112><loc_267><loc_430><loc_432><_unknown_>CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;</code>
-<caption><loc_112><loc_257><loc_288><loc_262>Example 3-9 Creating a mask on the TAX_ID column</caption>
+<code><loc_112><loc_267><loc_430><loc_432><_unknown_>CREATE MASK HR_SCHEMA.MASK_TAX_ID_ON_EMPLOYEES ON HR_SCHEMA.EMPLOYEES AS EMPLOYEES FOR COLUMN TAX_ID RETURN CASE WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'HR' ) = 1 THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER = EMPLOYEES . USER_ID THEN EMPLOYEES . TAX_ID WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'MGR' ) = 1 AND SESSION_USER <> EMPLOYEES . USER_ID THEN ( 'XXX-XX-' CONCAT QSYS2 . SUBSTR ( EMPLOYEES . TAX_ID , 8 , 4 ) ) WHEN VERIFY_GROUP_FOR_USER ( SESSION_USER , 'EMP' ) = 1 THEN EMPLOYEES . TAX_ID ELSE 'XXX-XX-XXXX' END ENABLE ;<caption><loc_112><loc_257><loc_288><loc_262>Example 3-9 Creating a mask on the TAX_ID column</caption></code>
 <page_footer><loc_282><loc_477><loc_428><loc_482>Chapter 3. Row and Column Access Control</page_footer>
 <page_footer><loc_438><loc_477><loc_447><loc_482>27</page_footer>
-<unordered_list><page_break>
-<list_item><loc_112><loc_45><loc_368><loc_51>3. Figure 3-10 shows the masks that are created in the HR_SCHEMA.</list_item>
+<page_break>
+<unordered_list><list_item><loc_112><loc_45><loc_368><loc_51>3. Figure 3-10 shows the masks that are created in the HR_SCHEMA.</list_item>
 </unordered_list>
 <picture><loc_52><loc_60><loc_447><loc_107><caption><loc_53><loc_110><loc_239><loc_115>Figure 3-10 Column masks shown in System i Navigator</caption></picture>
 <section_header_level_1><loc_53><loc_128><loc_167><loc_135>3.6.6 Activating RCAC</section_header_level_1>
@@ -204,8 +203,8 @@
 <picture><loc_52><loc_270><loc_433><loc_408><caption><loc_53><loc_410><loc_284><loc_415>Figure 3-11 Selecting the EMPLOYEES table from System i Navigator</caption></picture>
 <page_footer><loc_53><loc_477><loc_64><loc_482>28</page_footer>
 <page_footer><loc_76><loc_477><loc_273><loc_482>Row and Column Access Control Support in IBM DB2 for i</page_footer>
-<unordered_list><page_break>
-<list_item><loc_112><loc_45><loc_420><loc_66>2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.</list_item>
+<page_break>
+<unordered_list><list_item><loc_112><loc_45><loc_420><loc_66>2. Figure 4-68 shows the Visual Explain of the same SQL statement, but with RCAC enabled. It is clear that the implementation of the SQL statement is more complex because the row permission rule becomes part of the WHERE clause.</list_item>
 <list_item><loc_112><loc_320><loc_447><loc_341>3. Compare the advised indexes that are provided by the Optimizer without RCAC and with RCAC enabled. Figure 4-69 shows the index advice for the SQL statement without RCAC enabled. The index being advised is for the ORDER BY clause.</list_item>
 </unordered_list>
 <picture><loc_112><loc_75><loc_446><loc_301><caption><loc_112><loc_303><loc_267><loc_309>Figure 4-68 Visual Explain with RCAC enabled</caption></picture>