Skip to content

rubcompling/nna-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nna-datasets

Datasets and scripts for the analysis of anaphora with non-nominal antecedents

How to cite

For the pronominal dataset:

Dipper, S. & Zinsmeister, H. (2012). Annotating abstract anaphora. Language Resources and Evaluation, 46, pp. 37–52. (preprint)

For the shell nouns dataset:

Simonjetz, F. & Roussel, A. (2016). Crosslinguistic Annotation of German and English Shell Noun Complexes. Proceedings of the 13th Conference on Natural Language Processing (KONVENS), pp. 265–278.

See also

For an extensive review of theoretical literature and the resources pertaining to anaphora with non-nominal antecedents, see:

Kolhatkar, V., Roussel, A., Dipper, S., & Zinsmeister, H. (2018). Anaphora with Non-nominal Antecedents in Computational Linguistics: A Survey. Computational Linguistics, 44(3).

Further resources and data are available in this repository: https://github.com/kvarada/non-NA_Resources

Usage

The datasets are provided in the form of TSV tables, which can be easily read into R or Pandas as required. Each dataset contains basedata in the form of tokens.tsv tables, containing tokens only, and tokens_parsed.tsv tables, which contain POS tags, lemmas, and dependency parses from the SpaCy toolkit.

R scripts with utility functions (util.R), for loading data (loadData.R), and for generating tables and graphics (graphics.R) are provided in the ./scripts directory.

If you use the loadData.R script to load the data, you should end up with a few useful data tables:

  • pro.tokens -- complete corpus with SpaCy annotations (pronouns)
  • sn.tokens -- " " (SNs)
  • both.sns -- shell nouns (just anaphor annotations from both annotators)
  • both.pro.withante -- pronoun data from both annotators with antecedents
  • both.sn.withcp -- shell noun data from both annotators (only pairs where both annotators marked just one antecedent)
  • gold.pro -- pronouns with content (all pairs)
  • gold.all -- shell nouns with content (all pairs)

Link tables contain rows referring to anaphor and antecedent instances (n:m relation) and can be used to join anaphor and antecedent tables (the loadData.R script takes care of this).

Directory layout

.
├── LICENSE
├── pronouns
│   ├── annotator1
│   │   ├── anaphors.tsv
│   │   ├── antecedents.tsv
│   │   └── linktable.tsv
│   ├── annotator2
│   │   ├── anaphors.tsv
│   │   ├── antecedents.tsv
│   │   └── linktable.tsv
│   ├── gold
│   │   ├── anaphors.tsv
│   │   ├── antecedents.tsv
│   │   └── linktable.tsv
│   ├── tokens_parsed.tsv
│   └── tokens.tsv
├── README.md
├── scripts
│   ├── annotation-nouns.de
│   ├── graphics.R
│   ├── loadData.R
│   └── util.R
├── shellnouns-de
│   ├── annotator1
│   │   ├── contentphrases.tsv
│   │   ├── linktable.tsv
│   │   └── shellnouns.tsv
│   ├── annotator2
│   │   ├── contentphrases.tsv
│   │   ├── linktable.tsv
│   │   └── shellnouns.tsv
│   ├── gold
│   │   ├── contentphrases.tsv
│   │   ├── linktable.tsv
│   │   └── shellnouns.tsv
│   ├── tokens_parsed.tsv
│   └── tokens.tsv
└── shellnouns-en
    ├── annotator1
    │   ├── contentphrases.tsv
    │   ├── linktable.tsv
    │   └── shellnouns.tsv
    ├── annotator2
    │   ├── contentphrases.tsv
    │   ├── linktable.tsv
    │   └── shellnouns.tsv
    ├── gold
    │   ├── contentphrases.tsv
    │   ├── linktable.tsv
    │   └── shellnouns.tsv
    ├── tokens_parsed.tsv
    └── tokens.tsv

About

Datasets and scripts for the analysis of non-nominal-antecedent anaphora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages