Renaming/fixing of content-item schema #42

piconti · 2024-12-16T09:02:59Z

The Content-item schema is meant to represent the rebuilt content-items, but it's not very clear of explicit and it might be outdated. It should be updated and made more clear.

In addition, as pointed out by @simon-clematide, the porperty "t" is present twice in the schema, both for title and token.
It's not at the same level of the schema hierarchy, so it has currently not caused too many issues, but is not ideal and should probably be changed.

Action points for this issue are thus:

Making it more explicit that the content-item schema corresponds to the rebuilt, potentially by renaming the schema
Ensuring it is up to date and still aligned with the actual content of the rebuilt data
Renaming the title or token property, once it has been modified in the data.

The text was updated successfully, but these errors were encountered:

simon-clematide · 2025-01-13T07:22:02Z

I propose that renaming the title property (e.g., from t to "title") should coincide with addressing additional title-related issues at the ingestion level. Below are some notes from linguistic processing that highlight key challenges:

Artificial titles: Completely fabricated titles, such as those found in advertisements (e.g., Adv.7 Page 4).
Placeholder titles: Artificial placeholders for missing titles (e.g., UNKNOWN, UNTITLED, UNTITLED AD, UNTITLED ARTICLE).
Overly long titles: Titles that are longer than the associated text.
Title-fulltext discrepancies:
- Titles often appear within the fulltext but with variations (e.g., ellipses, missing characters, unresolved XML entities, or inconsistent tokenization).
- In some cases, the title is entirely absent from the fulltext.

Addressing these issues alongside the property renaming would improve consistency and reliability in title handling.

piconti · 2025-01-13T09:53:09Z

I agree, that's a good idea, we first need to figure out how we want to handle each of these situations though.
Would this be something we want for the consolidated rebuilt or for the first rebuilt?

simon-clematide · 2025-01-16T22:30:46Z

I think it should be in the first rebuilt.

piconti · 2025-01-17T08:32:37Z

But if this information is obtained thank's to linguistic processing, would it not be more logical or practical to make this change in the consolidated rebuilt?

The process creating the rebuilt is already quite complex and resource-demanding as it needs to handle many issue and page documents at the same time. Since titles are already inherited values, I think it makes more sense to either try to fix it from the start (canonical) or in the consolidated rebuilt.
The argument against the canonical however is that there is a lot of different code that generates the canonical, and it's still not clear whether this problem is due to the original OLR or the processing. For things inherited from the original OLR, once again I thik it makes more sense to fix them all at the same moment, in a much more unified approach, when creating the consolitaded rebuilt. Especially if the linguistic processing is what informs one on the exact case scenario and which is necessary to know how to fix it.

simon-clematide · 2025-01-21T13:16:42Z

Yes, sure. The lingproc was just the place where the issue caused a bit of a headache. The fix needs to be done earlier. I am currently collecting stats on the different issues for all processed lingproc items (meaning, it will be restricted to the supported languages de/fr). I'll put the results on s3 once the processing went through (which takes a bit of time).

piconti self-assigned this Dec 16, 2024

piconti mentioned this issue Dec 16, 2024

[Rebuilt] - Fix the possible confusion between title and token properties. impresso/impresso-text-acquisition#135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Renaming/fixing of content-item schema #42

Renaming/fixing of content-item schema #42

piconti commented Dec 16, 2024

simon-clematide commented Jan 13, 2025

piconti commented Jan 13, 2025 •

edited

Loading

simon-clematide commented Jan 16, 2025

piconti commented Jan 17, 2025

simon-clematide commented Jan 21, 2025

Renaming/fixing of content-item schema #42

Renaming/fixing of content-item schema #42

Comments

piconti commented Dec 16, 2024

simon-clematide commented Jan 13, 2025

piconti commented Jan 13, 2025 • edited Loading

simon-clematide commented Jan 16, 2025

piconti commented Jan 17, 2025

simon-clematide commented Jan 21, 2025

piconti commented Jan 13, 2025 •

edited

Loading