Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page json: Why do not all paragraphs need to have coordinates #45

Open
simon-clematide opened this issue Mar 6, 2025 · 2 comments
Open
Assignees
Labels
question Further information is requested

Comments

@simon-clematide
Copy link
Contributor

For some reasons I don't understand, page json does not require coordinates for paragraphs. They are required for regions, lines and tokens.

https://github.com/impresso/impresso-schemas/blob/master/json/newspaper/page.schema.json

E.g. JDG-1936-07-19-a-p0001, IMP-2016-09-28-a-p0002 etc.

@simon-clematide simon-clematide added the question Further information is requested label Mar 6, 2025
@simon-clematide
Copy link
Contributor Author

Some pages also don't validate with the page json: E.g.

2025-03-06 23:04:53,514 cli_facsimile_assessment.py:527 ERROR: JSON schema validation error for facsimile EXP-1820-03-02-a-p0002: None is not of type 'string'

Failed validating 'type' in schema['properties']['r']['items']['properties']['pOf']:
    {'type': 'string',
     'description': 'The canonical ID of the content item to which the '
                    'page region belongs.'}

On instance['r'][1]['pOf']:
    None

@piconti
Copy link
Member

piconti commented Mar 7, 2025

Relative to having paragraph coordinates as being required, I cannot be sure of the exact motivation, but I don't think it was a simple omission, probably necessary due to some OCR formats.
For instance, the SwissInfo bulletins data does not have paragraphs as a structure, thus no paragraph coordinates either, so I don't think adding it as requirement is the best idea.

Relating to the "pOf" property, ideed, form Danae's project we learned that not all bounding boxes were linked to a content item.
This often happens when there are small/numerous bounding boxes that were identified by the OCR systems in areas of the page where there are no articles (eg. earrings, or page number or small things here and there).

There is for sure a need to clean a little all of this, but it's really hard with the information available within the OCR to know if a bounding box is to be ignored or not, and there are for sure cases where no amount of heuristics can help disantangle it...
Just a situation of doing the best possible with the input format in this case...

In the example you provided where pOf is not defined, I think we can indeed adapt the schema for cases where it's None (or modify the code so that it's not added if it's None). And in general adding the schema validation is a good idea.
From what I remember of the code I inherited of last fall, the schema validation had been removed because it created a huge overhead, but I use a different approach for manifests, maybe it's not as bad anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants