Skip to content

Computational Linguistic Analysis of Syntactic Structures In German

License

Notifications You must be signed in to change notification settings

katrinortmann/classig-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLASSIG Pipeline (Computational Linguistic Analysis of Syntactic Structures In German)

This repository contains the code and documentation for the CLASSIG pipeline. The pipeline can be used to automatically annotate, evaluate, and analyze syntactic structures in modern and historical German texts.

The pipeline was created for my dissertation "Computational Methods for Investigating Syntactic Change: Automatic Identification of Extraposition in Modern and Historical German". It is accompanied by data sets, results, and models in the CLASSIG Data repository.

In this repository, the following resources are provided:

  • The models folder contains parser models for automatic annotation. Due to space constraints, the chunker models cannot be stored in this repository. If you want to use the chunker, the models are available for download at Zenodo.
  • The src folder contains all scripts to use the pipeline. That includes modified versions of the C6C Converter Pipeline, the COAST implementation for orality analysis, the NCRF++ Chunker, the Java-based Berkeley Parser for topological field parsing and constituency analysis, FairEval for evaluation, and the code to run the pipeline.
  • The R folder contains the scripts used to produce the plots and statistics in the thesis.
  • An example configuration file is provided in the config folder.

Contents of this Documentation

  1. Requirements
  2. Basic Usage
  3. Configuration
  4. Supported Functions
    4.1 Annotate
    4.2 Evaluate
    4.3 Calculate Data Statistics
    4.4 Create a Variant Corpus
    4.5 Create Language Models
    4.6 Surprisal Calculation
    4.7 DORM
    4.8 Orality Analysis
    4.9 RelC Analysis
    4.10 Create Tables
  5. Models
  6. License
  7. Acknowledgement

Requirements

  • Python 3 to execute the scripts in the src folder
  • R to execute the scripts in the R folder

To use the NCRF++ chunker, the required Python packages must be installed (torch, numpy, etc.).

To invoke the Berkeley parser, you must be able to run the Java code.

Basic Usage

The pipeline is called via the command line with a configuration file that specifies the required parameters:

py CLASSIG.py --config ./../config/example.config

Config

The configuration file consists of key-value pairs key = value with one pair per line.
Multiple values for one key can be added with commas, e.g., annotations = chunks, phrases.

Empty lines and lines starting with a hash sign # are ignored.

The following keys are recognized:

Key Function
action Determines what the pipeline will do, see supported actions.
annotations If action includes the value annotate, the given annotations are created. all will create all supported annotations.
models If action includes the value annotate, the given models are used for annotation. all will apply all models.
format_gold The input format (conllup or conll2000). The output format is always conllup.
format_in Format of the gold data (conllup or conll2000)
corpus Name of the corpus to analyze.
in_dir Folder with input files.
out_dir Folder where system annotations will be stored.
gold_dir Folder with gold standard data.
eval_dir Folder where evaluation results will be stored.
variant_dir Folder in which the variant corpus should be generated.
train_dir Folder with training files for language model creation.
lm_dir Folder where language models are stored.
lm_models_n N-gram size of language models (1 and/or 2).
lm_models Type of language models (FORM, LEMMA, WORD, XPOS).
model_dir Folder with parser/chunker models (default: ./../models/).
norm Input column to use as normalization (e.g., NORM; if None, defaults to FORM).

The config folder contains an example configuration file, which would execute all possible actions with all supported models on an example corpus.

Actions

The pipeline can be used for different purposes. The following actions are currently supported:

Action Function
annotate Creates the annotations listed under annotations with the models listed under models.
evaluate Evaluates the annotations listed under annotations for each model listed under models. Annotations in out_dir are compared to those in gold_dir with traditional and fair evaluation.
data_stats Counts documents, sentences, tokens, words, and labels for each annotation in the given corpus.
create_lm Creates n-gram language model(s) with size lm_models_n. Models are generated for each annotation given in lm_models.
variants Generates a variant corpus by undoing the extraposition of relative clauses.
surprisal Calculates n-gram surprisal for original and variant sentences with the given language models. Also calculates mean surprisal for each RelC in the original data.
dorm Calculates DORM from n-gram surprisal values of original and variant files.
orality Calculates orality scores with COAST.
analyze_relcs Collects length, position, and additional information for each annotated relative clause.
tables Creates LaTeX tables and input for scripts in the R folder from evaluate and data_stats results
all Executes everything listed above.

The actions are performed in the given order, so they can build on each other (i.e., the data is annotated before evaluation, etc.).

Different parameters must be set depending on the desired action:

Action Input from Output to Parameters
annotate in_dir out_dir annotations, models, corpus, format_in, model_dir, res_dir, tagger, norm
evaluate out_dir, gold_dir eval_dir annotations, models, corpus, format_in, format_gold
data_stats gold_dir eval_dir annotations, corpus, format_gold
create_lm train_dir lm_dir corpus, lm_models, lm_models_n, format_gold, norm
variants gold_dir variant_dir corpus, format_gold
surprisal gold_dir, variant_dir, lm_dir out_dir, eval_dir corpus, lm_models, lm_models_n, format_gold, norm
dorm out_dir eval_dir corpus, format_gold, models
orality gold_dir eval_dir corpus, format_gold
analyze_relcs gold_dir eval_dir corpus, format_gold
tables eval_dir eval_dir annotations

If you do not have gold data, set gold_dir to the folder where your automatic annotations are stored (usually a sub-directory of out_dir).


Annotate

The action annotate creates all annotations listed under annotations with the models listed under models.
As input, it takes the data from in_dir and stores the annotated data in out_dir.

The following annotations are currently supported with the specified models:

Value Annotation Tool Models
brackets Sentence brackets Berkeley parser Punct, News1
topf Topological fields Berkeley parser Punct, News1
chunks Chunks NCRF++ News1, News2, Hist, Mix
phrases Phrases Berkeley parser News1, News2, Hist, Mix
extrap Extraposition candidates (including relative clauses and their antecedents) Berkeley parser News1, News2, Hist, Mix (always uses Punct, too)

A documentation of tagsets and output formats can be found in the CLASSIG Data repository.

Additional remarks:
During annotation, the Berkeley parser often outputs "ROOT has more than one child!" This message can be ignored.


Evaluate

The action evaluate applies traditional and fair evaluation to the annotations listed under annotations for each model listed under models. Annotations in out_dir are compared to those in gold_dir and the results are stored in eval_dir.

The action outputs the numbers of true positives and error types and the calculated metrics (precision, recall, F-score) for the given corpus. Results are created for individual files and labels and overall. Fair evaluation is performed with FairEval.

The output can be used by the table action to produce LaTeX tables and input for plotting and statistics with R.


Data Statistics

The action data_stats counts documents, sentences, tokens, words, and labels for each annotation in the given corpus from gold_dir. Statistics are stored in the eval_dir and can be used by the table action to produce input for plotting with R.


Create Language Models

The action create_lm creates n-gram language model(s) with n of size lm_models_n, e.g., 2. Models are generated for each annotation given in lm_models, e.g., FORM, POS.

Training data is taken from train_dir and the models are stored in lm_dir. Models are named after n-gram size and annotation, e.g., 1-gram_XPOS for a unigram model based on POS tags. The model files contain two tab-separated columns with n-gram and frequency. n-grams (first column) are separated with spaces for n > 1. #S and #E are used as padding elements.

The models from the thesis can be downloaded from Zenodo.


Variant Corpus

The action variants generates a variant corpus by undoing the extraposition of relative clauses.

For each input file from gold_dir, a variant file with the same name is created in variant_dir. The variant file contains all sentences from the original file but extraposed relative clauses (labeled as RELC-extrap) are moved adjacent to their antecedent and re-labeled as RELC-insitu. Tokens are re-indexed and the sentence attribute #text is regenerated. If relative clauses were preceded by punctuation or coordination, those are moved, too.


Surprisal Calculation

The action surprisal calculates n-gram surprisal for original (gold_dir) and variant sentences (variant_dir) with the given language models. Currently, always calculates unigram and bigram surprisal with models for the annotations listed in lm_models. The output is stored in out_dir with surprisal values in columns 'UnigramSurpr' and 'BigramSurpr' followed by the annotation name, e.g., BigramSurprXPOS.

The action also calculates mean surprisal for each RelC in the original data. The results are output to eval_dir.

Caution: Every time the action is performed, the output of the RelC surprisal analysis is appended to the result file! This allows to add the results from different corpora to the same file. If you want to recalculate the results, you must use another output file or remove the old results beforehand.


DORM

The action dorm calculates DORM from n-gram surprisal values of original and variant files (taken from the output of the surprisal action in out_dir). Calculations are performed for word form and POS bigram surprisal and stored in two output files for token-based and constituent-based analysis in eval_dir.

To determine constituents, the files must contain a constituency analysis (PTBstring) and the model must be given as models. For TüBa-style trees, the model News1 should be specified. For Tiger-style trees, any other model can be given. If the model cannot be determined, only a token-based analysis is performed.

Caution: Every time the action is performed, the output is appended to the result file! This allows to add the results from different corpora to the same file. If you want to recalculate the results, you must use another output file or remove the old results beforehand.


Orality Analysis

The action orality calculates orality scores with the integrated version of COAST.

For the given input files from gold_dir, two files are created in eval_dir: one with the raw feature values (_results.csv) and one with scaled results and the orality score for each text (_results_scaled.csv).

Files are expected to contain the lemmas required by the orality analysis.


RelC Analysis

The action analyze_relcs collects length (in words), position (insitu/ambig/extrap), the distance to the antecedent (in words) and the distance to the end of the sentence (in words) for each annotated relative clause in the input data (gold_dir). The results are output to eval_dir.

Caution: Every time the action is performed, the output is appended to the result file! This allows to add the results from different corpora to the same file. If you want to recalculate the results, you must use another output file or remove the old results beforehand.


Tables

The action tables creates LaTeX tables and input for scripts in the R folder from the results of the actions evaluate and data_stats in the eval_dir. The output is stored in eval_dir, too.

Models

The models folder contains parser models for automatic annotation. For usage with the pipeline, the models must be placed in the specified model_dir.
The following models are available:

Constituency grammars:

  • News1 : constituency_grammars/grammar_tueba_simple.gr
  • News2 : constituency_grammars/grammar_tiger_simple.gr
  • Hist : constituency_grammars/grammar_hist_simple.gr
  • Mix : constituency_grammars/grammar_mix_simple.gr

Chunker models (Zenodo):

  • News1 : ncrfpp/lstmcrf_tueba_pos_pre-trained
  • News2 : ncrfpp/lstmcrf_tiger_pos_pre-trained
  • Hist : ncrfpp/lstmcrf_hist_pos_pre-trained
  • Mix : ncrfpp/lstmcrf_tigerxml_pos_pre-trained

Due to space constraints, the chunker models cannot be stored in this repository. If you want to use the chunker, all models are available for download at Zenodo.

Topological field grammars:

  • Punct : topfgrammars/topfgrammar_punct.gr
  • NoPunct : topfgrammars/topfgrammar_nopunct.gr (Currently not supported by the pipeline.)

R Code

The R folder contains the scripts used to produce the plots and statistics in the thesis:

License

  • The Berkeley parser is licensed under GPL 2.0
  • NCRF++ is licensed under Apache 2.0
  • COAST, C6C, and the remaining code are provided under the MIT license

Acknowledgement

If you use the pipeline, data, or models in your work, please cite:

  • Ortmann, Katrin. 2023. Computational Methods for Investigating Syntactic Change: Automatic Identification of Extraposition in Modern and Historical German. Bochumer Linguistische Arbeitsberichte (BLA), Vol. 25. PDF

About

Computational Linguistic Analysis of Syntactic Structures In German

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published