Skip to content

rubcompling/C6C

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

C6C

C6C is built to convert different historical Corpora from various input formats to common, standardized output format(s). It also allows for custom processing of the input data, e.g., to map historical POS tagsets to the modern standard tagset.

General Structure

The pipeline takes one or more files in a given input format and imports them into document objects. These document can then be modified by one or more processors, before exporting them into the same or another export format.

Pipeline structure

Usage

Requirements

Command line usage

The pipeline is called via the command line:

py C6C.py convert -i input_format -e export_format -p "['processor_name', 'processor_name']" input_dir_or_file output_dir_or_file

  • input_dir_or_file: can be a single file or a folder
  • output_dir_or_file: can be a single file or a folder
  • input_format: the following input formats are currently supported (documentation see below): text, tcfDTA, xmlDTA, tiger, tigerxml, mercuriustigerxml, conlluplus, conllu, conll2000, DTAtsv, tuebadz, annisgrid, webannotopf, webannotsv, coraxmlrem, coraxmlrefbo, coraxmlanselm, tuebadsconll, tuebatrees, ddbtigernegra, fuerstinnenexb, refup, germanc, sdewac, graphvar
  • export_format: the following export formats are currently supported (documentation see below): conlluplus, conllu, DTAtsv, HIPKONtsv, text, pos, conll2000, ptb
  • processor_name: processors are called in the given order; the following processors are currently implemented (documentation see below): dtachopper, dtasimplifier, hipkontostts, addmissingstts, topfsimplifier, satzklammertotopf, tsvindexer, hitstostts, tuebadstopf, anselmtostts, topfchopper, conllindexer, refhitstostts, depmanipulator, depprocessor, mercuriustostts, refuptostts, fuerstinnentostts, virgelmapper, pronominaladverb, refupcoding, bracketremover, treetobio

Documentation of pipeline components

  1. Importers
  2. Processors
  3. Exporters

About

C6 Converter for Historical Corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages