You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue serves as a discussion/checklist elaboration for the next OSCAR version to come.
We shoud aim to fix existing bugs/problems as well as adding potential features.
Issues
Increase robustness of newline handling in documents. (Ensure that we don't have problems storing newlines, newline characters and so on, using Rust and Python so that end users don't have issues) (see OSCAR-2109 huggingface datasets are misaligned and truncated #18 )
Enforce correct language tagging. We should have something that is perfectly BCP-47 valid with no mismatches (als/gsw) (see [BUG] Chavacano marked as "cbr" rather than cbk ungoliant#53). Ideally, add a verification layer after sentence identification to correct potentially erroneous tags, and report places where such translations couldn't have been made.
Add other blocklists (from UT1). Settle on which ones
Add KenLM model based filtering.
Rework the annotation part: adult is too strong for something we know has a lot of false positives. Also, with the inclusion of model based filtering, we'll have to find a way to specify annotation source.
The text was updated successfully, but these errors were encountered:
This issue serves as a discussion/checklist elaboration for the next OSCAR version to come.
We shoud aim to fix existing bugs/problems as well as adding potential features.
Issues
Features
adult
is too strong for something we know has a lot of false positives. Also, with the inclusion of model based filtering, we'll have to find a way to specify annotation source.The text was updated successfully, but these errors were encountered: