zs2021

This dataset accompanies the following research paper:

Sandra Waldenberger, Stefanie Dipper, and Ilka Lemke (2021). Towards a broad-coverage graphemic analysis of large historical corpora. Zeitschrift für Sprachwissenschaft, Special Issue, 40(3), pp. 401–420. PDF

Tool settings

Norma has been used with default settings, except for WLD's train_ngrams=4 (note that input files for Norma's WLD component have to be preprocessed in that initial and final '#' must be added).

Datasets

The diatopic dataset consists of texts (charters) from the same time period, from different regions.
The diachronic dataset consists of texts from two time spans (13.2 = 1250-1300, 14.1. = 1300-1350), from mainly two different regions (schwaeb/alem and rip).

Type	Text	Time Period	Language Area
diatopic	M345	14.1	alem
	M347	14.1	alem
	M348	14.1	thuer
	M350	14.1	rip
	M351	14.1	mbair
	M352	14.1	rhfrk
	M353	14.1	ofrk
diachronic	M344	13.2	schwaeb
	M346	13.2	alem
	M349	13.2	rip
	M345	14.1	alem
	M347	14.1	alem
	M350	14.1	rip

Directories

orig_REM: contains the original XML files from the REM corpus
training: contains pairs of <wofo1,wofo2>, where wofo1 is from one variety and wofo2 from another variety, and both normalize to the same standardized form in REM. This is the input for Norma.
models: contains the replacement rules and WLD mappings learned by Norma. 'E' means the empty string, '#' denotes the word boundary
1. files *.rules.noident: only contains non-identity rules (i.e. rules like {v->v/#_z} are filtered out); sorted according to their frequency (highest first)
2. files *.wld.sorted: sorted according their weights (lowest first)
3. files *.char_align.enriched: these are actually not models but pairs of <wofo1,wofo2> as in training, enriched with character-based alignments
similarity: contains the union sets used for calculating similarity scores between individual texts

Scripts

similarity.R: calculates pairwise similarity between texts

Acknowledgements

The development of this software was supported by Deutsche Forschungsgemeinschaft (DFG), Grants Nos. 1558/1, 1558/4 and 1558/5 (project IDs 89085660, 179943363, 200609649).

License

All software (in the datasets/ and scripts/ directories) is provided under CC BY 4.0.

The original REM data is licensed under CC BY-SA 4.0 and provided by Klein et al. (2016).

Klein, Thomas; Wegera, Klaus-Peter; Dipper, Stefanie; Wich-Reif, Claudia (2016). Referenzkorpus Mittelhochdeutsch (1050–1350), Version 1.0, https://www.linguistics.ruhr-uni-bochum.de/rem/. ISLRN 332-536-136-099-5.

For questions or problems, feel free to file a GitHub issue, or contact me directly:

Stefanie Dipper ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
datasets		datasets
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zs2021

Tool settings

Datasets

Directories

Scripts

Acknowledgements

License

About

Releases

Packages

Languages

rubcompling/zs2021

Folders and files

Latest commit

History

Repository files navigation

zs2021

Tool settings

Datasets

Directories

Scripts

Acknowledgements

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages