Add BarNER Dataset #3604

stefan-it · 2025-01-27T14:37:11Z

Hi,

this PR finally adds support for the recently introduced BarNER Dataset - which is proposed in the "Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data" LREC-COLING 2024 paper from Peng et al.:

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

Fixes #3533.

Resources

Entities

Four major entity types are annotated: PER, ORG, LOC and MISC. Additionally, -deriv and -part suffixes for tokens derived or partly containing NEs are also introduced.

More additionally, NEs referring to languages (LANG), religions (RELIGION), events (EVENT), and works of art (WOA) are also annotaed incl. their -deriv and -part suffixed variations.

Notice: PER/LOC/ORG/MISC are considered as coarse grained and the extension with -part, -deriv, and other entity types as fined-grained.

Demo

The dataset can be loaded with:

from flair.datasets import NER_BAVARIAN_WIKI

corpus = NER_BAVARIAN_WIKI()

print(str(corpus))

# Outputs:
# 'Corpus: 2905 train + 342 dev + 374 test sentences'

Additionally, the written dataset loader also supports loading the BarNER dataset with fine-grained labels:

from flair.datasets import NER_BAVARIAN_WIKI

corpus = NER_BAVARIAN_WIKI(fine_grained=True)

Here is one example from the training dataset (coarse-grained vs. fine-grained):

Sentence	Coarse-grained	Fine-grained
"Afrikaans ( weatli : afrikanisch ) , friaha aa : Kapholländisch oda Kolonial-Niedaländisch gnennt , is oane vo de ejf Amtssprochn in Sidafrika und a oneakonnte Mindaheitnsproch in Namibia ."	"Sidafrika"/`LOC`, "Namibia"/`LOC`	"Afrikaans"/`LANG`, "afrikanisch"/`LANG`, "Kapholländisch"/`LANG`, "Kolonial-Niedaländisch"/`LANG`, "Sidafrika"/`LOC`, "Namibia"/`LOC`

Caveats

Unfortunately, only the Wikipedia-part of the BarNER dataset is publicly available - as the Twitter data has license restrictions, so I decided to only include the Wikipedia part for now.

It seems that the total number of sentences does not quite match with the paper numbers. Table 1 in the paper reports 75,687 tokens and 3,574 sentences, whereas 75,690 tokens and 3,577 sentences can be found in the original data.

I could reproduce these numbers both with the Flair dataset loader in this PR and on command line using:

$ cat *.tsv | grep -v "^# " | grep -v "^$" | wc -l  # Count tokens
$ cat *.tsv | grep -c "# text"  # Count sentences

Fine-Tuning

I could successfully fine-tune a model on this dataset (with Flair and the dataset loader in this PR). It is available here:

Flair NER model on BarNER

alanakbik · 2025-02-03T11:00:35Z

@stefan-it thanks for adding this! I tested on Ubuntu and Windows. It works on Ubuntu but unfortunately raises an encoding error on Windows.

Traceback (most recent call last):
  File "\flair\local_quick.py", line 7, in <module>
    corpus = NER_BAVARIAN_WIKI()
  File "\flair\flair\datasets\sequence_labeling.py", line 5576, in __init__
    for line in f_p:
  File "\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3393: character maps to <undefined>

stefan-it · 2025-02-03T15:59:06Z

Hey @alanakbik,

many thanks for testing! I've seen this encoding statements also in the #3557 PR, so I will add here as well soon :)

alanakbik · 2025-02-03T16:14:14Z

I already pushed a fix to the branch!

stefan-it added 3 commits February 1, 2025 11:55

dataset: add support for BarNER dataset

6a5fc2c

dataset: make NER_BAVARIAN_WIKI (BarNER) globally available)

f8186a3

tests: add basic sentence & token count test for new BarNER dataset

dcd029b

stefan-it force-pushed the add_bavarian_wiki_dataset branch from ab70501 to dcd029b Compare February 1, 2025 10:55

Specify UTF-8 encoding

4332d79

alanakbik merged commit 3d24c35 into master Feb 3, 2025
2 checks passed

alanakbik deleted the add_bavarian_wiki_dataset branch February 3, 2025 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BarNER Dataset #3604

Add BarNER Dataset #3604

stefan-it commented Jan 27, 2025 •

edited

Loading

alanakbik commented Feb 3, 2025

stefan-it commented Feb 3, 2025

alanakbik commented Feb 3, 2025

Add BarNER Dataset #3604

Add BarNER Dataset #3604

Conversation

stefan-it commented Jan 27, 2025 • edited Loading

Resources

Entities

Demo

Caveats

Fine-Tuning

alanakbik commented Feb 3, 2025

stefan-it commented Feb 3, 2025

alanakbik commented Feb 3, 2025

stefan-it commented Jan 27, 2025 •

edited

Loading