Skip to content

Latest commit

 

History

History
208 lines (183 loc) · 15.9 KB

README.md

File metadata and controls

208 lines (183 loc) · 15.9 KB

Kola Saami Christian Text Corpus (KSCTC)

Current version: Alpha (not released yet)

Authors and credits

The originator of the dataset and administrator of this project is Michael Rießler. The second main author is Ilia Egorov.

The paper Rießler 2024 includes a more complete description. If you use the data, please cite this paper (see below a bibtex code snippet).


@incollection{riesler2024a,
	address = {Helsinki},
	author = {Rie{\ss}ler, Michael},
	booktitle = {Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages},
	editor = {H{\"a}m{\"a}l{\"a}inen, Mika and Pirinen, Flammie and Macias, Melany and Avila, Mario Crespo},
	pages = {138-144},
	publisher = {ACL},
	title = {Kola Saami Christian Text Corpus},
	url = {https://aclanthology.org/2024.iwclul-1.18},
	year = {2024}}

License, data reuse, and collaboration

The corpus data in this repository are free and open and licensed under CC-BY, see LICENSE. Other parts of the corpus – under bound licenses – are found in KSCTC-bound, which is a private repository and visible only to project collaborators.

If you are interested in collaboration, contact Michael Rießler. Collaborators must acknowledge the terms of use by signing the Code of Conduct (in the works).

Data format

The data is modelled in XLM using the corpus tool ELAN.

Tiers included are

  • "ref" - project-internal ID for each single chunk
  • "ref(orig)" - if applicable: verse number (Gospel of Matthew and Lord's Prayer) as location for parallelization
  • "orth(orig_)" - original text in original script ( refers to an ID in the Bibliography)
  • "orth" - normalized orthography
  • "ft-rus(orig_)" - if applicable: original Russian text on which the translation is based ( refers to an ID in the Bibliography)

Metadata

Basic metadata can be extracted from the overview below. Bibliographic metadata about the text sources is found in Bibliography. Personal metadata about the text originators is included in the tier names in each individual session (e.g. "PAP1821m" is a male person born 1821, "AAA1932f" is a female person born 1932).

Simple corpus statistics

The following table provides an overview of the included data and the state of their completion.

Session Description Tokens Status Use
sia18760800_bibleMATTHEW1878a-23 Matthew 23 free
sia18760800_bibleMATTHEW1878a-24 Matthew 24 free
sia18760800_bibleMATTHEW1878a-25 Matthew 25 free
sia18760800_bibleMATTHEW1878a-26 Matthew 26 free
sia18760800_bibleMATTHEW1878a-27 Matthew 27 free
sia18760800_bibleMATTHEW1878a-28 Matthew 28 free
sia18760800_genetz1879a-23 Matthew 23 free
sia18760800_genetz1879a-24 Matthew 24 free
sia18760800_genetz1879a-25 Matthew 25 free
sia18760800_genetz1879a-26 Matthew 26 free
sia18760800_genetz1879a-27 Matthew 27 free
sia18760800_genetz1879a-28 Matthew 28 free
sjd18280000_sjogren1828a-LordsPrayer Lord's Prayer 59 finished free
sjd18760800_bibleMATTHEW1878a-00Title Imprint 2 finished free
sjd18760800_bibleMATTHEW1878a-00Titlepage Imprint 19 finished free
sjd18760800_bibleMATTHEW1878a-01 Matthew 1 323 finished free
sjd18760800_bibleMATTHEW1878a-02 Matthew 2 444 finished free
sjd18760800_bibleMATTHEW1878a-03 Matthew 3 281 finished free
sjd18760800_bibleMATTHEW1878a-04 Matthew 4 404 finished free
sjd18760800_bibleMATTHEW1878a-05 Matthew 5 822 finished free
sjd18760800_bibleMATTHEW1878a-06 Matthew 6 664 finished free
sjd18760800_bibleMATTHEW1878a-07 Matthew 7 497 finished free
sjd18760800_bibleMATTHEW1878a-08 Matthew 8 546 finished free
sjd18760800_bibleMATTHEW1878a-09 Matthew 9 639 finished free
sjd18760800_bibleMATTHEW1878a-10 Matthew 10 679 finished free
sjd18760800_bibleMATTHEW1878a-11 Matthew 11 517 finished free
sjd18760800_bibleMATTHEW1878a-12 Matthew 12 847 finished free
sjd18760800_bibleMATTHEW1878a-13 Matthew 13 1002 finished free
sjd18760800_bibleMATTHEW1878a-14 Matthew 14 528 finished free
sjd18760800_bibleMATTHEW1878a-15 Matthew 15 583 finished free
sjd18760800_bibleMATTHEW1878a-16 Matthew 16 517 finished free
sjd18760800_bibleMATTHEW1878a-17 Matthew 17 483 finished free
sjd18760800_bibleMATTHEW1878a-18 Matthew 18 639 finished free
sjd18760800_bibleMATTHEW1878a-19 Matthew 19 556 finished free
sjd18760800_bibleMATTHEW1878a-20 Matthew 20 579 finished free
sjd18760800_bibleMATTHEW1878a-21 Matthew 21 822 finished free
sjd18760800_bibleMATTHEW1878a-22 Matthew 22 625 finished free
sjd18760800_genetz1879a-01 Matthew 1 328 finished free
sjd18760800_genetz1879a-05 Matthew 5 825 finished free
sjd18760800_genetz1879a-07 Matthew 7 498 finished free
sjd18760800_genetz1879b-01 Matthew 1 328 finished free
sjd18760800_genetz1879b-02 Matthew 2 450 finished free
sjd18760800_genetz1879b-05 Matthew 5 825 finished free
sjd18760800_genetz1879b-07 Matthew 7 498 finished free
sjd18760800_genetz1891a-kilb05 Matthew 5 824 finished free
sjd18760800_genetz1891a-kilb06 Matthew 6 633 finished free
sjd18760800_genetz1891a-kilb07 Matthew 7 497 finished free
sjd19960000_arapovic1996a-00Content Imprint 152 finished free
sjd19960000_arapovic1996a-00Front Imprint 3 finished free
sjd19960000_arapovic1996a-00Imprint Imprint 80 finished free
sjd19960000_arapovic1996a-00Title Imprint 13 finished free
sjd19960000_arapovic1996a-02Annunciation Jesus 260 finished acad
sjd19960000_arapovic1996a-04Birth Jesus 191 finished acad
sjd19960000_arapovic1996a-06Shepherds Jesus 240 finished acad
sjd19960000_arapovic1996a-08Adoration Jesus 251 finished acad
sjd19960000_arapovic1996a-10Exile Jesus 275 finished acad
sjd19960000_arapovic1996a-12Disciplines Jesus 253 finished acad
sjd19960000_arapovic1996a-14Samaritan Jesus 289 finished acad
sjd19960000_arapovic1996a-16Sermon-Beatitudes Jesus 259 finished acad
sjd19960000_arapovic1996a-19LordsPrayer Jesus 63 finished acad
sjd19960000_arapovic1996a-20Sower Jesus 214 finished acad
sjd19960000_arapovic1996a-22Storm Jesus 132 finished acad
sjd19960000_arapovic1996a-24Raising Jesus 282 finished acad
sjd19960000_arapovic1996a-24Raising-Fn Jesus 10 finished acad
sjd19960000_arapovic1996a-26Feeding Jesus 196 finished acad
sjd19960000_arapovic1996a-28GoodSamaritan Jesus 227 finished acad
sjd19960000_arapovic1996a-28GoodSamaritan-Fn Jesus 6 finished acad
sjd19960000_arapovic1996a-30GoodShepherd Jesus 287 finished acad
sjd19960000_arapovic1996a-32LostSheep Jesus 144 finished acad
sjd19960000_arapovic1996a-34Prodigal Jesus 150 finished acad
sjd19960000_arapovic1996a-36Return Jesus 234 finished acad
sjd19960000_arapovic1996a-38Blesses Jesus 122 finished acad
sjd19960000_arapovic1996a-40Blind Jesus 165 finished acad
sjd19960000_arapovic1996a-42Zacchaeus Jesus 197 finished acad
sjd19960000_arapovic1996a-44Commandment Jesus 114 finished acad
sjd19960000_arapovic1996a-46Supper Jesus 164 finished acad
sjd19960000_arapovic1996a-48Gethsemane Jesus 245 finished acad
sjd19960000_arapovic1996a-50Arrest Jesus 208 finished acad
sjd19960000_arapovic1996a-52Pilatus Jesus 310 finished acad
sjd19960000_arapovic1996a-54Crucifixion Jesus 351 finished acad
sjd19960000_arapovic1996a-56Burial Jesus 266 finished acad
sjd19960000_arapovic1996a-56Burial-Fn Jesus 14 finished acad
sjd19960000_bibleLORDMATTHEW1996a Lord's Prayer 63 finished fair
sjd20050300_saam2005-03Prayer Prayer 70 finished fair
sjd20080000_ermolaeva2010a-00Title Imprint 5 finished free
sjd20080000_ermolaeva2010a-01 Matthew 1 322 finished fair
sjd20141008_bibleLORDMATTH2022b Lord's Prayer 58 finished fair
sjd20220300_bibleAPOSTLES2022a Apostles' Creed 71 finished fair
sjd20220300_bibleLORDMATTH2022a Lord's Prayer 60 finished fair
sjd20220300_bibleMATTHEW2022a-00Imprint Imprint 56 finished free
sjd20220300_bibleMATTHEW2022a-00Title Imprint 10 finished free
sjd20220300_bibleMATTHEW2022a-01 Matthew 1 315 finished bound
sjd20220300_bibleMATTHEW2022a-02 Matthew 2 392 finished bound
sjd20220300_bibleMATTHEW2022a-03 Matthew 3 269 finished bound
sjd20220300_bibleMATTHEW2022a-04 Matthew 4 385 finished bound
sjd20220300_bibleMATTHEW2022a-05 Matthew 5 771 finished bound
sjd20220300_bibleMATTHEW2022a-06 Matthew 6 619 finished bound
sjd20220300_bibleMATTHEW2022a-07 Matthew 7 bound
sjd20220300_bibleMATTHEW2022a-08 Matthew 8 bound
sjd20220300_bibleMATTHEW2022a-09 Matthew 9 bound
sjd20220300_bibleMATTHEW2022a-10 Matthew 10 bound
sjd20220300_bibleMATTHEW2022a-11 Matthew 11 bound
sjd20220300_bibleMATTHEW2022a-12 Matthew 12 bound
sjd20220300_bibleMATTHEW2022a-13 Matthew 13 bound
sjd20220300_bibleMATTHEW2022a-14 Matthew 14 bound
sjd20220300_bibleMATTHEW2022a-15 Matthew 15 bound
sjd20220300_bibleMATTHEW2022a-16 Matthew 16 464 finished bound
sjd20220300_bibleMATTHEW2022a-16L2 Matthew 16:22-28 127 finished fair
sjd20220300_bibleMATTHEW2022a-17 Matthew 17 bound
sjd20220300_bibleMATTHEW2022a-18 Matthew 18 bound
sjd20220300_bibleMATTHEW2022a-19 Matthew 19 bound
sjd20220300_bibleMATTHEW2022a-20 Matthew 20 bound
sjd20220300_bibleMATTHEW2022a-21 Matthew 21 bound
sjd20220300_bibleMATTHEW2022a-22 Matthew 22 bound
sjd20220300_bibleMATTHEW2022a-23 Matthew 23 bound
sjd20220300_bibleMATTHEW2022a-24 Matthew 24 bound
sjd20220300_bibleMATTHEW2022a-25 Matthew 25 bound
sjd20220300_bibleMATTHEW2022a-26 Matthew 26 bound
sjd20220300_bibleMATTHEW2022a-27 Matthew 27 1225 finished bound
sjd20220300_bibleMATTHEW2022a-28 Matthew 28 306 finished bound
sjd20220300_scheller2022a Preface 559 finished bound

Session

  • Points to the file name
    • The language is indicated (as ISO) in the first three letters of the file name.
    • The second section of the file name indicates the (known or estimated) date of origin, at least the year.
    • The third section (after underscore) includes a pointer to the Bibliography, where the original text sources are listed.
    • The fourth section (if applicable, after hyphen) indicates sub-sets (e.g. for footnotes or Matthew 16:22-28).

Description

  • See more metadata in the respective CMDI files (work in progress).

Tokens

  • Simple word token count

Status

  • Data marked as "finished" includes data in the minimal tier set (ref, [ref(orig)], orth(orig), [ft-rus(orig)]; read the paper for more detail).

Use

  • Data marked as "bound" must not be re-published or shared in any way outside this project because copyright applies. These data can, however, be analysed and processed by project collaborators.
  • Data marked as "acad" are under copyright and not made freely available via the present repository . But they can be used under an academic license, which is handled by the Language Bank of Finland.
  • Data marked as "fair" are technically also under copyright. But the nature of these texts (especially Lord's Prayer), their earlier distribution, or other reasons allow their use as open corpus data in research, in our opinion.
  • Data marked as "free" are either in the Open Domain (old publications) or do not meet the threshold of originality (book imprints, title pages, etc.).

TODO

Planned improvements in the data

  • finnish the corpus

Planned improvements at the platform level

  • Website
  • Pipeline: sjd-fair --> KSCTC
  • validation scripts
  • dynamic corpus statistics
  • metadata links to Wikidata

Questions

  • what to do with letter η in Cyrillic (Genetz), using Greek or Latin?