page json: Why do not all paragraphs need to have coordinates #45

simon-clematide · 2025-03-06T21:33:02Z

For some reasons I don't understand, page json does not require coordinates for paragraphs. They are required for regions, lines and tokens.

https://github.com/impresso/impresso-schemas/blob/master/json/newspaper/page.schema.json

E.g. JDG-1936-07-19-a-p0001, IMP-2016-09-28-a-p0002 etc.

simon-clematide · 2025-03-06T22:05:30Z

Some pages also don't validate with the page json: E.g.

2025-03-06 23:04:53,514 cli_facsimile_assessment.py:527 ERROR: JSON schema validation error for facsimile EXP-1820-03-02-a-p0002: None is not of type 'string'

Failed validating 'type' in schema['properties']['r']['items']['properties']['pOf']:
    {'type': 'string',
     'description': 'The canonical ID of the content item to which the '
                    'page region belongs.'}

On instance['r'][1]['pOf']:
    None

piconti · 2025-03-07T16:13:23Z

Relative to having paragraph coordinates as being required, I cannot be sure of the exact motivation, but I don't think it was a simple omission, probably necessary due to some OCR formats.
For instance, the SwissInfo bulletins data does not have paragraphs as a structure, thus no paragraph coordinates either, so I don't think adding it as requirement is the best idea.

Relating to the "pOf" property, ideed, form Danae's project we learned that not all bounding boxes were linked to a content item.
This often happens when there are small/numerous bounding boxes that were identified by the OCR systems in areas of the page where there are no articles (eg. earrings, or page number or small things here and there).

There is for sure a need to clean a little all of this, but it's really hard with the information available within the OCR to know if a bounding box is to be ignored or not, and there are for sure cases where no amount of heuristics can help disantangle it...
Just a situation of doing the best possible with the input format in this case...

In the example you provided where pOf is not defined, I think we can indeed adapt the schema for cases where it's None (or modify the code so that it's not added if it's None). And in general adding the schema validation is a good idea.
From what I remember of the code I inherited of last fall, the schema validation had been removed because it created a huge overhead, but I use a different approach for manifests, maybe it's not as bad anymore

simon-clematide added the question Further information is requested label Mar 6, 2025

simon-clematide assigned piconti Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page json: Why do not all paragraphs need to have coordinates #45

page json: Why do not all paragraphs need to have coordinates #45

simon-clematide commented Mar 6, 2025

simon-clematide commented Mar 6, 2025

piconti commented Mar 7, 2025

page json: Why do not all paragraphs need to have coordinates #45

page json: Why do not all paragraphs need to have coordinates #45

Comments

simon-clematide commented Mar 6, 2025

simon-clematide commented Mar 6, 2025

piconti commented Mar 7, 2025