-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental Min Nan extraction function #397
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
I am fine with the --subdialect
proposal but agree that --dialects
is a superior solution. It should also reduce our need to micromanage this all, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Kudos to you, Lucas, for working on these challenges! Understood that this Min Nan scrape was arbitrarily set for Hokkien for now, and is subject to change in future PRs based on how we want to handle subdialects.
The --dialects
proposal does sound great (might need another flag name -- too easily confused with the existing --dialect
), but I see implementation may be tricky (which Lucas has already alluded to). So far we've seen the Chinese-styled and Brazilian Portuguese-styled subdialect formatting on Wiktionary. Are there other flavors we haven't come across yet?
"zyyy": "Common", | ||
"latn": "Latin", | ||
"hira": "Hiragana", | ||
"hani": "Han" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a quick reminder -- where do these come from again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our new languages_update.py
postprocessing step.
Certainly, though perhaps by introducing something even more demanding of micromanagement!
Not that I'm aware of, though it'd definitely be best to go on a bit of a hunt for them before trying out this |
Unreleased
inCHANGELOG.md
to reflect the changes in code or data.This adds an (incomplete) Min Nan extraction function in the hopes that it will nudge us toward developing solutions to #259 and #329. This extraction function currently only targets entries from the Hokkien 'dialect' (just because it seemed the most prevalent) and a user can then specify 'subdialects' of Hokkien with
--dialect
. I've added some data with (sub)dialect set as Xiamen.To improve/expand the coverage of this extraction function we need to settle on a solution to #329. One solution might be to add a
--subdialect
option and have--dialect
be used for Hokkein/Teochew and--subdialect
for the nested dialects like Xiamen/Taipei (so for Portuguese we could set--dialect
as Brazil and--subdialect
as Paulista/South Brazil). This would be easy to implement but would probably be confusing to users. Also it would require users (or people running the big scrape) to scrape the same language a whole bunch of separate times if they were interested in getting data from all the dialects and subdialects of that language in separate TSVs.An alternative solution would be to try and revamp our dialects system somewhat like so: If a user runs something like
wikipron nan --dialects
we run through the language once and write as many TSVs as there are dialects/subdialects in that language (easier said than done). If a user runswikipron nan
we run through the language once and they get one TSV containing all the entries from all the dialects/subdialects. I think there are different ways of doing this but the coolest would be to 'discover' the dialects/subdialects as we scrape a language and automatically write them to different TSVs.