German/English parallel text from the Europarl Corpus with manually annotated shell noun complexes, as described in Simonjetz & Roussel (2016).
Annotations contain a series of noncontiguous turns (turn
) from the Europarl corpus, grouped according to plenary session (europarl_chunk
). The turns are presented in English/German pairs, with the language of a particular turn marked both in turn_id
and in the lang
attribute. The pairs were presented to annotators in random order, either German first or English first, in order to minimize bias towards one language or the other, and this is the order they occur in here as well.
The following excerpt shows how the data is organized:
<europarl_chunk source="../../ep-02-03-11.txt">
<turn turn_id="t_02-03-11_12/en" lang="en">
<alignUnit al_id="a_02-03-11_12/en.1">
<sent sent_id="s_1">
<tok mate_lemma="mr" mate_morph="_" mate_pos="NNP" id="t_1" mate_mother="t_2" mate_rel="NMOD">Mr</tok>
After the corpus base data there are two elements containing the actual shell noun–related data, shellnouns
and content_phrase
.
Each shellnoun
element contains the following attributes:
align_unit
: Identifier which all elements aligned to one another will share.
content
: Either given
(content is marked), external
(content is
probably present, but not in this turn), or, in rare cases
unclear
(unclear whether marked phrase is part of shell noun
complex).
content_phrases
: Reference to id
of content_phrase
element containing content
of this shell noun instance.
id
: Identifier for this shell noun instance.
span
: Reference to token span corresponding to this shell noun instance.
value
: Either true
(this is a shell noun), false
(not a shell
noun), undefined
(not annotated), or unclear
(not clear
whether this is a shell noun instance or not).
Each content_phrase
element has the following attributes:
align_unit
: As above.
id
: Identifier for this content phrase instance.
nominal
: If true
, then this instance is nominal; if false
, then clausal.
span
: Reference to tokens. (See above.)