`nna-datasets`

Datasets and scripts for the analysis of anaphora with non-nominal antecedents

How to cite

For the pronominal dataset:

Dipper, S. & Zinsmeister, H. (2012). Annotating abstract anaphora. Language Resources and Evaluation, 46, pp. 37–52. (preprint)

For the shell nouns dataset:

Simonjetz, F. & Roussel, A. (2016). Crosslinguistic Annotation of German and English Shell Noun Complexes. Proceedings of the 13th Conference on Natural Language Processing (KONVENS), pp. 265–278.

Usage

The datasets are provided in the form of TSV tables, which can be easily read into R or Pandas as required. Each dataset contains basedata in the form of tokens.tsv tables, containing tokens only, and tokens_parsed.tsv tables, which contain POS tags, lemmas, and dependency parses from the SpaCy toolkit.

R scripts with utility functions (util.R), for loading data (loadData.R), and for generating tables and graphics (graphics.R) are provided in the ./scripts directory.

If you use the loadData.R script to load the data, you should end up with a few useful data tables:

pro.tokens -- complete corpus with SpaCy annotations (pronouns)
sn.tokens -- " " (SNs)
both.sns -- shell nouns (just anaphor annotations from both annotators)
both.pro.withante -- pronoun data from both annotators with antecedents
both.sn.withcp -- shell noun data from both annotators (only pairs where both annotators marked just one antecedent)
gold.pro -- pronouns with content (all pairs)
gold.all -- shell nouns with content (all pairs)

Link tables contain rows referring to anaphor and antecedent instances (n:m relation) and can be used to join anaphor and antecedent tables (the loadData.R script takes care of this).

Directory layout

.
├── LICENSE
├── pronouns
│   ├── annotator1
│   │   ├── anaphors.tsv
│   │   ├── antecedents.tsv
│   │   └── linktable.tsv
│   ├── annotator2
│   │   ├── anaphors.tsv
│   │   ├── antecedents.tsv
│   │   └── linktable.tsv
│   ├── gold
│   │   ├── anaphors.tsv
│   │   ├── antecedents.tsv
│   │   └── linktable.tsv
│   ├── tokens_parsed.tsv
│   └── tokens.tsv
├── README.md
├── scripts
│   ├── annotation-nouns.de
│   ├── graphics.R
│   ├── loadData.R
│   └── util.R
├── shellnouns-de
│   ├── annotator1
│   │   ├── contentphrases.tsv
│   │   ├── linktable.tsv
│   │   └── shellnouns.tsv
│   ├── annotator2
│   │   ├── contentphrases.tsv
│   │   ├── linktable.tsv
│   │   └── shellnouns.tsv
│   ├── gold
│   │   ├── contentphrases.tsv
│   │   ├── linktable.tsv
│   │   └── shellnouns.tsv
│   ├── tokens_parsed.tsv
│   └── tokens.tsv
└── shellnouns-en
    ├── annotator1
    │   ├── contentphrases.tsv
    │   ├── linktable.tsv
    │   └── shellnouns.tsv
    ├── annotator2
    │   ├── contentphrases.tsv
    │   ├── linktable.tsv
    │   └── shellnouns.tsv
    ├── gold
    │   ├── contentphrases.tsv
    │   ├── linktable.tsv
    │   └── shellnouns.tsv
    ├── tokens_parsed.tsv
    └── tokens.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`nna-datasets`

How to cite

See also

Usage

Directory layout

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
pronouns		pronouns
scripts		scripts
shellnouns-de		shellnouns-de
shellnouns-en		shellnouns-en
LICENSE		LICENSE
README.md		README.md

License

rubcompling/nna-datasets

Folders and files

Latest commit

History

Repository files navigation

nna-datasets

How to cite

See also

Usage

Directory layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`nna-datasets`

Packages