Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BarNER Dataset #3604

Merged
merged 4 commits into from
Feb 3, 2025
Merged

Add BarNER Dataset #3604

merged 4 commits into from
Feb 3, 2025

Conversation

stefan-it
Copy link
Member

@stefan-it stefan-it commented Jan 27, 2025

image


Hi,

this PR finally adds support for the recently introduced BarNER Dataset - which is proposed in the "Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data" LREC-COLING 2024 paper from Peng et al.:

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

Fixes #3533.

Resources

Entities

Four major entity types are annotated: PER, ORG, LOC and MISC. Additionally, -deriv and -part suffixes for tokens derived or partly containing NEs are also introduced.

More additionally, NEs referring to languages (LANG), religions (RELIGION), events (EVENT), and works of art (WOA) are also annotaed incl. their -deriv and -part suffixed variations.

Notice: PER/LOC/ORG/MISC are considered as coarse grained and the extension with -part, -deriv, and other entity types as fined-grained.

Demo

The dataset can be loaded with:

from flair.datasets import NER_BAVARIAN_WIKI

corpus = NER_BAVARIAN_WIKI()

print(str(corpus))

# Outputs:
# 'Corpus: 2905 train + 342 dev + 374 test sentences'

Additionally, the written dataset loader also supports loading the BarNER dataset with fine-grained labels:

from flair.datasets import NER_BAVARIAN_WIKI

corpus = NER_BAVARIAN_WIKI(fine_grained=True)

Here is one example from the training dataset (coarse-grained vs. fine-grained):

Sentence Coarse-grained Fine-grained
"Afrikaans ( weatli : afrikanisch ) , friaha aa : Kapholländisch oda Kolonial-Niedaländisch gnennt , is oane vo de ejf Amtssprochn in Sidafrika und a oneakonnte Mindaheitnsproch in Namibia ." "Sidafrika"/LOC, "Namibia"/LOC "Afrikaans"/LANG, "afrikanisch"/LANG, "Kapholländisch"/LANG, "Kolonial-Niedaländisch"/LANG, "Sidafrika"/LOC, "Namibia"/LOC

Caveats

Unfortunately, only the Wikipedia-part of the BarNER dataset is publicly available - as the Twitter data has license restrictions, so I decided to only include the Wikipedia part for now.

It seems that the total number of sentences does not quite match with the paper numbers. Table 1 in the paper reports 75,687 tokens and 3,574 sentences, whereas 75,690 tokens and 3,577 sentences can be found in the original data.

I could reproduce these numbers both with the Flair dataset loader in this PR and on command line using:

$ cat *.tsv | grep -v "^# " | grep -v "^$" | wc -l  # Count tokens
$ cat *.tsv | grep -c "# text"  # Count sentences

Fine-Tuning

I could successfully fine-tune a model on this dataset (with Flair and the dataset loader in this PR). It is available here:

@stefan-it stefan-it force-pushed the add_bavarian_wiki_dataset branch from ab70501 to dcd029b Compare February 1, 2025 10:55
@alanakbik
Copy link
Collaborator

@stefan-it thanks for adding this! I tested on Ubuntu and Windows. It works on Ubuntu but unfortunately raises an encoding error on Windows.

Traceback (most recent call last):
  File "\flair\local_quick.py", line 7, in <module>
    corpus = NER_BAVARIAN_WIKI()
  File "\flair\flair\datasets\sequence_labeling.py", line 5576, in __init__
    for line in f_p:
  File "\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3393: character maps to <undefined>

@stefan-it
Copy link
Member Author

Hey @alanakbik,

many thanks for testing! I've seen this encoding statements also in the #3557 PR, so I will add here as well soon :)

@alanakbik
Copy link
Collaborator

I already pushed a fix to the branch!

@alanakbik alanakbik merged commit 3d24c35 into master Feb 3, 2025
2 checks passed
@alanakbik alanakbik deleted the add_bavarian_wiki_dataset branch February 3, 2025 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Add support for Bavarian NER Dataset (BarNER)
2 participants