Experimental Min Nan extraction function (#397)

* somewhat acceptable Min Nan extractor * raw nan scrape before internet cut out * raw nan scrape * nan postprocessing, big scrape readme fixes * updates tests * cleanup test_scrape * updates changelog
CUNY-CL · Mar 27, 2021 · db093e6 · db093e6
1 parent e10c68c
commit db093e6
Show file tree

Hide file tree

Showing 10 changed files with 44,688 additions and 14 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -51,6 +51,7 @@ Unreleased
 #### Added
 
 -  Added test of phones list generation in `test_data/test_summary.py` (\#363)
+-  Added Min Nan extraction function. (\#397)
 
 [1.2.0] - 2021-01-30
 --------------------

diff --git a/data/scrape/README.md b/data/scrape/README.md
@@ -198,6 +198,7 @@
 | [TSV](tsv/okm_hang_phonemic.tsv) | okm | Middle Korean (10th-16th cent.) | Middle Korean | Hangul |  | False | Phonemic | False | 334 |
 | [TSV](tsv/gml_latn_phonemic.tsv) | gml | Middle Low German | Middle Low German | Latin |  | False | Phonemic | True | 170 |
 | [TSV](tsv/wlm_latn_phonemic.tsv) | wlm | Middle Welsh | Middle Welsh | Latin |  | False | Phonemic | True | 144 |
+| [TSV](tsv/nan_hani_xi_phonemic.tsv) | nan | Min Nan Chinese | Min Nan | Han | Xiamen | False | Phonemic | True | 44,588 |
 | [TSV](tsv/mdf_cyrl_phonemic.tsv) | mdf | Moksha | Moksha | Cyrillic |  | False | Phonemic | True | 117 |
 | [TSV](tsv/mnw_mymr_phonemic.tsv) | mnw | Mon | Mon | Myanmar |  | False | Phonemic | False | 514 |
 | [TSV](tsv/mon_cyrl_phonemic.tsv) | mon | Mongolian | Mongolian | Cyrillic |  | False | Phonemic | True | 1,166 |

diff --git a/data/scrape/lib/README.md b/data/scrape/lib/README.md
@@ -1,19 +1,19 @@
 "Big scrape" scripts
 ====================
 
-[`scrape.py`](scrape.py) calls WikiPron's scraping functions on all Wiktionary
+[`scrape.py`](../scrape.py) calls WikiPron's scraping functions on all Wiktionary
 languages with over 100 entries. If a `.phones` file is present for a given
 language, the process will generate an additional filtered file only containing
 the permitted phones/phonemes.
 [`generate_tsv_summary.py`](generate_tsv_summary.py) generates a
 [README](../README.md) and a [TSV](../tsv_summary.tsv) with selected information
 regarding the contents of the TSVs [`scrape.py`](scrape.py) generated and the
-configuration settings that were passed to scrape. [`postprocess`](postprocess)
+configuration settings that were passed to scrape. [`postprocess`](../postprocess)
 sorts and removes entries in each TSV if they have the same graphemic form and
 phonetic/phonemic form as a previous entry. In addition it splits TSVs
 containing multiple scripts (Arabic, Cyrillic, etc.) into constituent TSVs
 containing a single script. [`languages.json`](languages.json) provides
-[`scrape.py`](scrape.py) with a dictionary containing the information it needs
+[`scrape.py`](../scrape.py) with a dictionary containing the information it needs
 to call scrape on all Wiktionary languages with over 100 entries and is also
 used to generate the previously mentioned [README](../README.md).
 [`codes.py`](codes.py) is used to generate [`languages.json`](languages.json).
@@ -37,26 +37,26 @@ Steps used to update the dataset
         [`languages.json`](languages.json).
     -   To find new languages you can run `git diff languages.json` or search
         for `null` values within [`languages.json`](languages.json).
-2.  Run [`scrape.py`](scrape.py).
+2.  Run [`scrape.py`](../scrape.py).
     -   By default `cut_off_date` in `main()` is set using
         `datetime.date.today().isoformat()` but can be set manually using an ISO
         formatted string (ex. "2019-10-31").
-3.  Run [`postprocess`](postprocess).
+3.  Run [`postprocess`](../postprocess).
 
 Running a subset of languages using the big scrape
 --------------------------------------------------
 
 The following steps can be used to run the big scrape procedure for a subset:
 
-1.  Run [`scrape.py`](scrape.py) with `--restriction` flag, followed by command
+1.  Run [`scrape.py`](../scrape.py) with `--restriction` flag, followed by command
     line arguments for desired languages. Note: languages must be in their ISO
     designation and argument string must delineate with comma, semicolon, or
     space. E.g. To target only Lithuanian and Spanish:
     `./scrape.py --restriction='lit; spa'`
-2.  If `cut_off_date` in [`scrape.py`](scrape.py) was set using
+2.  If `cut_off_date` in [`scrape.py`](../scrape.py) was set using
     `datetime.date.today().isoformat()` and it is important that all the data
     you scrape is from before the same date, then manually set `cut_off_date` in
     `main()` (using an ISO formatted string) to the date of the original big
     scrape run - which can be found in the messages logged to the console or in
     `scraping.log`.
-3.  Run [`postprocess`](postprocess).
+3.  Run [`postprocess`](../postprocess).
diff --git a/data/scrape/lib/languages.json b/data/scrape/lib/languages.json
@@ -1263,7 +1263,17 @@
         "iso639_name": "Min Nan Chinese",
         "wiktionary_name": "Min Nan",
         "wiktionary_code": "nan",
-        "casefold": true
+        "casefold": true,
+        "skip_spaces_pron": false,
+        "dialect": {
+            "xi": "Xiamen"
+        },
+        "script": {
+            "zyyy": "Common",
+            "latn": "Latin",
+            "hira": "Hiragana",
+            "hani": "Han"
+        }
     },
     "mvi": {
         "iso639_name": "Miyako",