German/English Parallel Shell Noun Corpus

German/English parallel text from the Europarl Corpus with manually annotated shell noun complexes, as described in Simonjetz & Roussel (2016).

Annotations contain a series of noncontiguous turns (turn) from the Europarl corpus, grouped according to plenary session (europarl_chunk). The turns are presented in English/German pairs, with the language of a particular turn marked both in turn_id and in the lang attribute. The pairs were presented to annotators in random order, either German first or English first, in order to minimize bias towards one language or the other, and this is the order they occur in here as well.

The following excerpt shows how the data is organized:

<europarl_chunk source="../../ep-02-03-11.txt">
  <turn turn_id="t_02-03-11_12/en" lang="en">
    <alignUnit al_id="a_02-03-11_12/en.1">
      <sent sent_id="s_1">
        <tok mate_lemma="mr" mate_morph="_" mate_pos="NNP" id="t_1" mate_mother="t_2" mate_rel="NMOD">Mr</tok>

After the corpus base data there are two elements containing the actual shell noun–related data, shellnouns and content_phrase.

Each shellnoun element contains the following attributes:

align_unit : Identifier which all elements aligned to one another will share.

content : Either given (content is marked), external (content is probably present, but not in this turn), or, in rare cases unclear (unclear whether marked phrase is part of shell noun complex).

content_phrases : Reference to id of content_phrase element containing content of this shell noun instance.

id : Identifier for this shell noun instance.

span : Reference to token span corresponding to this shell noun instance.

value : Either true (this is a shell noun), false (not a shell noun), undefined (not annotated), or unclear (not clear whether this is a shell noun instance or not).

Each content_phrase element has the following attributes:

align_unit : As above.

id : Identifier for this content phrase instance.

nominal : If true, then this instance is nominal; if false, then clausal.

span : Reference to tokens. (See above.)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
annotator1		annotator1
annotator2		annotator2
gold-standard-dataset		gold-standard-dataset
sn_corpus		sn_corpus
src		src
LICENSE		LICENSE
README.md		README.md
annotation-nouns.de		annotation-nouns.de
annotation-nouns.en		annotation-nouns.en

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

German/English Parallel Shell Noun Corpus

About

Uh oh!

Releases

Packages

Languages

License

rubcompling/shell-nouns-data

Folders and files

Latest commit

History

Repository files navigation

German/English Parallel Shell Noun Corpus

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages