Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental Min Nan extraction function #397

Merged
merged 8 commits into from
Mar 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ Unreleased
#### Added

- Added test of phones list generation in `test_data/test_summary.py` (\#363)
- Added Min Nan extraction function. (\#397)

[1.2.0] - 2021-01-30
--------------------
Expand Down
1 change: 1 addition & 0 deletions data/scrape/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,7 @@
| [TSV](tsv/okm_hang_phonemic.tsv) | okm | Middle Korean (10th-16th cent.) | Middle Korean | Hangul | | False | Phonemic | False | 334 |
| [TSV](tsv/gml_latn_phonemic.tsv) | gml | Middle Low German | Middle Low German | Latin | | False | Phonemic | True | 170 |
| [TSV](tsv/wlm_latn_phonemic.tsv) | wlm | Middle Welsh | Middle Welsh | Latin | | False | Phonemic | True | 144 |
| [TSV](tsv/nan_hani_xi_phonemic.tsv) | nan | Min Nan Chinese | Min Nan | Han | Xiamen | False | Phonemic | True | 44,588 |
| [TSV](tsv/mdf_cyrl_phonemic.tsv) | mdf | Moksha | Moksha | Cyrillic | | False | Phonemic | True | 117 |
| [TSV](tsv/mnw_mymr_phonemic.tsv) | mnw | Mon | Mon | Myanmar | | False | Phonemic | False | 514 |
| [TSV](tsv/mon_cyrl_phonemic.tsv) | mon | Mongolian | Mongolian | Cyrillic | | False | Phonemic | True | 1,166 |
Expand Down
16 changes: 8 additions & 8 deletions data/scrape/lib/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
"Big scrape" scripts
====================

[`scrape.py`](scrape.py) calls WikiPron's scraping functions on all Wiktionary
[`scrape.py`](../scrape.py) calls WikiPron's scraping functions on all Wiktionary
languages with over 100 entries. If a `.phones` file is present for a given
language, the process will generate an additional filtered file only containing
the permitted phones/phonemes.
[`generate_tsv_summary.py`](generate_tsv_summary.py) generates a
[README](../README.md) and a [TSV](../tsv_summary.tsv) with selected information
regarding the contents of the TSVs [`scrape.py`](scrape.py) generated and the
configuration settings that were passed to scrape. [`postprocess`](postprocess)
configuration settings that were passed to scrape. [`postprocess`](../postprocess)
sorts and removes entries in each TSV if they have the same graphemic form and
phonetic/phonemic form as a previous entry. In addition it splits TSVs
containing multiple scripts (Arabic, Cyrillic, etc.) into constituent TSVs
containing a single script. [`languages.json`](languages.json) provides
[`scrape.py`](scrape.py) with a dictionary containing the information it needs
[`scrape.py`](../scrape.py) with a dictionary containing the information it needs
to call scrape on all Wiktionary languages with over 100 entries and is also
used to generate the previously mentioned [README](../README.md).
[`codes.py`](codes.py) is used to generate [`languages.json`](languages.json).
Expand All @@ -37,26 +37,26 @@ Steps used to update the dataset
[`languages.json`](languages.json).
- To find new languages you can run `git diff languages.json` or search
for `null` values within [`languages.json`](languages.json).
2. Run [`scrape.py`](scrape.py).
2. Run [`scrape.py`](../scrape.py).
- By default `cut_off_date` in `main()` is set using
`datetime.date.today().isoformat()` but can be set manually using an ISO
formatted string (ex. "2019-10-31").
3. Run [`postprocess`](postprocess).
3. Run [`postprocess`](../postprocess).

Running a subset of languages using the big scrape
--------------------------------------------------

The following steps can be used to run the big scrape procedure for a subset:

1. Run [`scrape.py`](scrape.py) with `--restriction` flag, followed by command
1. Run [`scrape.py`](../scrape.py) with `--restriction` flag, followed by command
line arguments for desired languages. Note: languages must be in their ISO
designation and argument string must delineate with comma, semicolon, or
space. E.g. To target only Lithuanian and Spanish:
`./scrape.py --restriction='lit; spa'`
2. If `cut_off_date` in [`scrape.py`](scrape.py) was set using
2. If `cut_off_date` in [`scrape.py`](../scrape.py) was set using
`datetime.date.today().isoformat()` and it is important that all the data
you scrape is from before the same date, then manually set `cut_off_date` in
`main()` (using an ISO formatted string) to the date of the original big
scrape run - which can be found in the messages logged to the console or in
`scraping.log`.
3. Run [`postprocess`](postprocess).
3. Run [`postprocess`](../postprocess).
12 changes: 11 additions & 1 deletion data/scrape/lib/languages.json
Original file line number Diff line number Diff line change
Expand Up @@ -1263,7 +1263,17 @@
"iso639_name": "Min Nan Chinese",
"wiktionary_name": "Min Nan",
"wiktionary_code": "nan",
"casefold": true
"casefold": true,
"skip_spaces_pron": false,
"dialect": {
"xi": "Xiamen"
},
"script": {
"zyyy": "Common",
"latn": "Latin",
"hira": "Hiragana",
"hani": "Han"
Comment on lines +1272 to +1275
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a quick reminder -- where do these come from again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our new languages_update.py postprocessing step.

}
},
"mvi": {
"iso639_name": "Miyako",
Expand Down
Loading