Skip to content

Commit

Permalink
Experimental Min Nan extraction function (#397)
Browse files Browse the repository at this point in the history
* somewhat acceptable Min Nan extractor

* raw nan scrape before internet cut out

* raw nan scrape

* nan postprocessing, big scrape readme fixes

* updates tests

* cleanup test_scrape

* updates changelog
  • Loading branch information
lfashby authored Mar 27, 2021
1 parent e10c68c commit db093e6
Show file tree
Hide file tree
Showing 10 changed files with 44,688 additions and 14 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ Unreleased
#### Added

- Added test of phones list generation in `test_data/test_summary.py` (\#363)
- Added Min Nan extraction function. (\#397)

[1.2.0] - 2021-01-30
--------------------
Expand Down
1 change: 1 addition & 0 deletions data/scrape/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,7 @@
| [TSV](tsv/okm_hang_phonemic.tsv) | okm | Middle Korean (10th-16th cent.) | Middle Korean | Hangul | | False | Phonemic | False | 334 |
| [TSV](tsv/gml_latn_phonemic.tsv) | gml | Middle Low German | Middle Low German | Latin | | False | Phonemic | True | 170 |
| [TSV](tsv/wlm_latn_phonemic.tsv) | wlm | Middle Welsh | Middle Welsh | Latin | | False | Phonemic | True | 144 |
| [TSV](tsv/nan_hani_xi_phonemic.tsv) | nan | Min Nan Chinese | Min Nan | Han | Xiamen | False | Phonemic | True | 44,588 |
| [TSV](tsv/mdf_cyrl_phonemic.tsv) | mdf | Moksha | Moksha | Cyrillic | | False | Phonemic | True | 117 |
| [TSV](tsv/mnw_mymr_phonemic.tsv) | mnw | Mon | Mon | Myanmar | | False | Phonemic | False | 514 |
| [TSV](tsv/mon_cyrl_phonemic.tsv) | mon | Mongolian | Mongolian | Cyrillic | | False | Phonemic | True | 1,166 |
Expand Down
16 changes: 8 additions & 8 deletions data/scrape/lib/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
"Big scrape" scripts
====================

[`scrape.py`](scrape.py) calls WikiPron's scraping functions on all Wiktionary
[`scrape.py`](../scrape.py) calls WikiPron's scraping functions on all Wiktionary
languages with over 100 entries. If a `.phones` file is present for a given
language, the process will generate an additional filtered file only containing
the permitted phones/phonemes.
[`generate_tsv_summary.py`](generate_tsv_summary.py) generates a
[README](../README.md) and a [TSV](../tsv_summary.tsv) with selected information
regarding the contents of the TSVs [`scrape.py`](scrape.py) generated and the
configuration settings that were passed to scrape. [`postprocess`](postprocess)
configuration settings that were passed to scrape. [`postprocess`](../postprocess)
sorts and removes entries in each TSV if they have the same graphemic form and
phonetic/phonemic form as a previous entry. In addition it splits TSVs
containing multiple scripts (Arabic, Cyrillic, etc.) into constituent TSVs
containing a single script. [`languages.json`](languages.json) provides
[`scrape.py`](scrape.py) with a dictionary containing the information it needs
[`scrape.py`](../scrape.py) with a dictionary containing the information it needs
to call scrape on all Wiktionary languages with over 100 entries and is also
used to generate the previously mentioned [README](../README.md).
[`codes.py`](codes.py) is used to generate [`languages.json`](languages.json).
Expand All @@ -37,26 +37,26 @@ Steps used to update the dataset
[`languages.json`](languages.json).
- To find new languages you can run `git diff languages.json` or search
for `null` values within [`languages.json`](languages.json).
2. Run [`scrape.py`](scrape.py).
2. Run [`scrape.py`](../scrape.py).
- By default `cut_off_date` in `main()` is set using
`datetime.date.today().isoformat()` but can be set manually using an ISO
formatted string (ex. "2019-10-31").
3. Run [`postprocess`](postprocess).
3. Run [`postprocess`](../postprocess).

Running a subset of languages using the big scrape
--------------------------------------------------

The following steps can be used to run the big scrape procedure for a subset:

1. Run [`scrape.py`](scrape.py) with `--restriction` flag, followed by command
1. Run [`scrape.py`](../scrape.py) with `--restriction` flag, followed by command
line arguments for desired languages. Note: languages must be in their ISO
designation and argument string must delineate with comma, semicolon, or
space. E.g. To target only Lithuanian and Spanish:
`./scrape.py --restriction='lit; spa'`
2. If `cut_off_date` in [`scrape.py`](scrape.py) was set using
2. If `cut_off_date` in [`scrape.py`](../scrape.py) was set using
`datetime.date.today().isoformat()` and it is important that all the data
you scrape is from before the same date, then manually set `cut_off_date` in
`main()` (using an ISO formatted string) to the date of the original big
scrape run - which can be found in the messages logged to the console or in
`scraping.log`.
3. Run [`postprocess`](postprocess).
3. Run [`postprocess`](../postprocess).
12 changes: 11 additions & 1 deletion data/scrape/lib/languages.json
Original file line number Diff line number Diff line change
Expand Up @@ -1263,7 +1263,17 @@
"iso639_name": "Min Nan Chinese",
"wiktionary_name": "Min Nan",
"wiktionary_code": "nan",
"casefold": true
"casefold": true,
"skip_spaces_pron": false,
"dialect": {
"xi": "Xiamen"
},
"script": {
"zyyy": "Common",
"latn": "Latin",
"hira": "Hiragana",
"hani": "Han"
}
},
"mvi": {
"iso639_name": "Miyako",
Expand Down
Loading

0 comments on commit db093e6

Please sign in to comment.