You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some pages also don't validate with the page json: E.g.
2025-03-06 23:04:53,514 cli_facsimile_assessment.py:527 ERROR: JSON schema validation error for facsimile EXP-1820-03-02-a-p0002: None is not of type 'string'
Failed validating 'type' in schema['properties']['r']['items']['properties']['pOf']:
{'type': 'string',
'description': 'The canonical ID of the content item to which the '
'page region belongs.'}
On instance['r'][1]['pOf']:
None
Relative to having paragraph coordinates as being required, I cannot be sure of the exact motivation, but I don't think it was a simple omission, probably necessary due to some OCR formats.
For instance, the SwissInfo bulletins data does not have paragraphs as a structure, thus no paragraph coordinates either, so I don't think adding it as requirement is the best idea.
Relating to the "pOf" property, ideed, form Danae's project we learned that not all bounding boxes were linked to a content item.
This often happens when there are small/numerous bounding boxes that were identified by the OCR systems in areas of the page where there are no articles (eg. earrings, or page number or small things here and there).
There is for sure a need to clean a little all of this, but it's really hard with the information available within the OCR to know if a bounding box is to be ignored or not, and there are for sure cases where no amount of heuristics can help disantangle it...
Just a situation of doing the best possible with the input format in this case...
In the example you provided where pOf is not defined, I think we can indeed adapt the schema for cases where it's None (or modify the code so that it's not added if it's None). And in general adding the schema validation is a good idea.
From what I remember of the code I inherited of last fall, the schema validation had been removed because it created a huge overhead, but I use a different approach for manifests, maybe it's not as bad anymore
For some reasons I don't understand, page json does not require coordinates for paragraphs. They are required for regions, lines and tokens.
https://github.com/impresso/impresso-schemas/blob/master/json/newspaper/page.schema.json
E.g. JDG-1936-07-19-a-p0001, IMP-2016-09-28-a-p0002 etc.
The text was updated successfully, but these errors were encountered: