Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renaming/fixing of content-item schema #42

Open
3 tasks
piconti opened this issue Dec 16, 2024 · 5 comments
Open
3 tasks

Renaming/fixing of content-item schema #42

piconti opened this issue Dec 16, 2024 · 5 comments
Assignees

Comments

@piconti
Copy link
Member

piconti commented Dec 16, 2024

The Content-item schema is meant to represent the rebuilt content-items, but it's not very clear of explicit and it might be outdated. It should be updated and made more clear.

In addition, as pointed out by @simon-clematide, the porperty "t" is present twice in the schema, both for title and token.
It's not at the same level of the schema hierarchy, so it has currently not caused too many issues, but is not ideal and should probably be changed.

Action points for this issue are thus:

  • Making it more explicit that the content-item schema corresponds to the rebuilt, potentially by renaming the schema
  • Ensuring it is up to date and still aligned with the actual content of the rebuilt data
  • Renaming the title or token property, once it has been modified in the data.
@simon-clematide
Copy link
Contributor

I propose that renaming the title property (e.g., from t to "title") should coincide with addressing additional title-related issues at the ingestion level. Below are some notes from linguistic processing that highlight key challenges:

  • Artificial titles: Completely fabricated titles, such as those found in advertisements (e.g., Adv.7 Page 4).
  • Placeholder titles: Artificial placeholders for missing titles (e.g., UNKNOWN, UNTITLED, UNTITLED AD, UNTITLED ARTICLE).
  • Overly long titles: Titles that are longer than the associated text.
  • Title-fulltext discrepancies:
    • Titles often appear within the fulltext but with variations (e.g., ellipses, missing characters, unresolved XML entities, or inconsistent tokenization).
    • In some cases, the title is entirely absent from the fulltext.

Addressing these issues alongside the property renaming would improve consistency and reliability in title handling.

@piconti
Copy link
Member Author

piconti commented Jan 13, 2025

I agree, that's a good idea, we first need to figure out how we want to handle each of these situations though.
Would this be something we want for the consolidated rebuilt or for the first rebuilt?

@simon-clematide
Copy link
Contributor

I think it should be in the first rebuilt.

@piconti
Copy link
Member Author

piconti commented Jan 17, 2025

But if this information is obtained thank's to linguistic processing, would it not be more logical or practical to make this change in the consolidated rebuilt?

The process creating the rebuilt is already quite complex and resource-demanding as it needs to handle many issue and page documents at the same time. Since titles are already inherited values, I think it makes more sense to either try to fix it from the start (canonical) or in the consolidated rebuilt.
The argument against the canonical however is that there is a lot of different code that generates the canonical, and it's still not clear whether this problem is due to the original OLR or the processing. For things inherited from the original OLR, once again I thik it makes more sense to fix them all at the same moment, in a much more unified approach, when creating the consolitaded rebuilt. Especially if the linguistic processing is what informs one on the exact case scenario and which is necessary to know how to fix it.

@simon-clematide
Copy link
Contributor

Yes, sure. The lingproc was just the place where the issue caused a bit of a headache. The fix needs to be done earlier. I am currently collecting stats on the different issues for all processed lingproc items (meaning, it will be restricted to the supported languages de/fr). I'll put the results on s3 once the processing went through (which takes a bit of time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants