OSCAR 22.XX scope #21

Uinelj · 2022-09-07T09:52:27Z

This issue serves as a discussion/checklist elaboration for the next OSCAR version to come.

We shoud aim to fix existing bugs/problems as well as adding potential features.

Issues

Increase robustness of newline handling in documents. (Ensure that we don't have problems storing newlines, newline characters and so on, using Rust and Python so that end users don't have issues) (see OSCAR-2109 huggingface datasets are misaligned and truncated #18 )
Enforce correct language tagging. We should have something that is perfectly BCP-47 valid with no mismatches (als/gsw) (see [BUG] Chavacano marked as "cbr" rather than cbk ungoliant#53). Ideally, add a verification layer after sentence identification to correct potentially erroneous tags, and report places where such translations couldn't have been made.
Systematically inspect very low resourced languages and remove subcorpora where data is not usable (Wu Chinese dataset is of bad quality. #5, Tajik language contains large chunks of Uzbek sentences in Cyrillic script. #6, Quality warning: Chavacano #10, Quality warning: Northern Frisian #11; Quality warning: Somali #12, Quality warning: Neapolitan #13, Scots language corpus is non linguistic? #14)

Add other blocklists (from UT1). Settle on which ones
Add KenLM model based filtering.
Rework the annotation part: adult is too strong for something we know has a lot of false positives. Also, with the inclusion of model based filtering, we'll have to find a way to specify annotation source.

The text was updated successfully, but these errors were encountered:

Uinelj · 2022-09-23T12:33:01Z

zstd instead of gzip. Check how we can use dictionnaries. Check if we can do multipart?

ruaMIsmail · 2022-09-23T12:53:55Z

We have the sampling already finished? no

Uinelj · 2022-09-23T13:08:55Z

Sampling is related to oscar-tools and has already been merged: oscar-project/oscar-tools#23

This issue tracks the changes and new features of the output corpus

Uinelj added this to the OSCAR 22.XX milestone Sep 7, 2022