Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
this PR finally adds support for the recently introduced BarNER Dataset - which is proposed in the "Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data" LREC-COLING 2024 paper from Peng et al.:
Fixes #3533.
Resources
Entities
Four major entity types are annotated:
PER
,ORG
,LOC
andMISC
. Additionally,-deriv
and-part
suffixes for tokens derived or partly containing NEs are also introduced.More additionally, NEs referring to languages (
LANG
), religions (RELIGION
), events (EVENT
), and works of art (WOA
) are also annotaed incl. their -deriv and -part suffixed variations.Notice:
PER/LOC/ORG/MISC
are considered as coarse grained and the extension with-part
,-deriv
, and other entity types as fined-grained.Demo
The dataset can be loaded with:
Additionally, the written dataset loader also supports loading the BarNER dataset with fine-grained labels:
Here is one example from the training dataset (coarse-grained vs. fine-grained):
LOC
, "Namibia"/LOC
LANG
, "afrikanisch"/LANG
, "Kapholländisch"/LANG
, "Kolonial-Niedaländisch"/LANG
, "Sidafrika"/LOC
, "Namibia"/LOC
Caveats
Unfortunately, only the Wikipedia-part of the BarNER dataset is publicly available - as the Twitter data has license restrictions, so I decided to only include the Wikipedia part for now.
It seems that the total number of sentences does not quite match with the paper numbers. Table 1 in the paper reports 75,687 tokens and 3,574 sentences, whereas 75,690 tokens and 3,577 sentences can be found in the original data.
I could reproduce these numbers both with the Flair dataset loader in this PR and on command line using:
Fine-Tuning
I could successfully fine-tune a model on this dataset (with Flair and the dataset loader in this PR). It is available here: