Ressource type : Comparable corpus
Ressource name: CALYOU «<A Comparable Spoken Algerian Corpus Harvested from YouTube»
Languages: Algerian dialect, Modern Standard Arabic, French and English.
Modality: Written.
Use of the resource: Extraction of bilingual segments and lexicon.
Resource availability: Freely available
License: GNU General public License V3.0
Resource URL: https://smart.loria.fr/corpora/
Resource description:
CALYOU is a comparable corpus extracted automatically from YouTube by using a multilingual word embedding approach. These comments could be written either in local Arabic (dialect), Modern Standard Arabic, French or English.
The source part of this corpus contains comments written in Latin script: Arabizi (Arabic dialect written latin character), French or English. However the target part contains comments written in Arabic script: (MSA or Algerian dialect).
We offer you two versions of CALYOU. In the first one we selected only the comparable documents with a high compatibility degree, which leads to build a corpus composed of 5.19k of entries. Whereas the second version is moderately comparable which contains of 38.5k of comparable comments.
Some examples are given below.
Source comment | Target comment |
vive tahar misoum rabi yahafdak | تحيا طاهر ميسوم |
Merci chemsou je t'adore | نحبك شمسو بزاف |
10 mai 2017 mazalni nchouf la vidéo chkoun kima ana | لي مزال يعاود ف فيديو يكليكي جام علبالي رانا بزاف |
wch 3nwan song li darha svp jawboni | واش هي الاغنية لي دارها فالفيديو وشكرا |
nice song by hasni and also by zouhair nice cover all the best | اغنية رائعة للمرحوم حسني شكراا زهير سنة مزال حاضر الشاب حسني |
If you use this ressource, please cite the following paper :
Karima Abidi, Mohamed Amine Menacer, Kamel Smaili. CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube. 18th Annual Conference of the International Communication Association (Interspeech), Aug 2017, Stockholm, Sweden. Pdf: https://hal.archives-ouvertes.fr/hal-01531591/file/KarimaKAmelInterspeech2017.pdf