Skip to content

Commit

Permalink
Reorganization of phones (#395)
Browse files Browse the repository at this point in the history
* Adds subdirectories 'lib' and 'phones' to 'data/phones'. Moves '.phones' files into 'data/phones/phones'. Moves relevant scripts from 'data/src' into 'data/phones/lib'. Updates paths in 'data/scrape/lib/codes.py'

* Updates 'data/phones/HOWTO.md' to reflect new locations of files

* Fixes file path in 'data/phones/HOWTO.md'

* Updates outdated components of 'data/phones/HOWTO.md'

* Updates path to phones in 'tests/test_data/test_summary.py'

* Updates changelog
  • Loading branch information
ajmalanoski authored Mar 26, 2021
1 parent 42d362a commit e10c68c
Show file tree
Hide file tree
Showing 45 changed files with 23 additions and 14 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Unreleased
- Made summary generation in `common_characters.py` optional. (\#382)
- Fixed phone counting in `data/src/generate_phones_summary.py` (\#390, \#392)
- Reorganizes scraping scripts under `data/scrape` (\#394)
- Reorganizes `.phones` files and related scripts under `data/phones` (\#395)

### Under `wikipron/` and elsewhere

Expand Down
22 changes: 12 additions & 10 deletions data/phones/HOWTO.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ We welcome user submissions for `.phones` files from linguists. Note that we use
the [fork and pull](../../CONTRIBUTING.md) model for contributions.

1. Make a list of all phones or phonemes, in descending-frequency order, using
the appropriate file in [`../tsv`](../tsv). The script
[`list_phones.py`](../src/list_phones.py) is available to facilitate this
the appropriate file in [`../scrape/tsv`](../scrape/tsv). The script
[`list_phones.py`](lib/list_phones.py) is available to facilitate this
step. Running `./list_phones.py ../tsv/<some-TSV-file> > foo.phones`
generates `foo.phones` that you can edit by the following steps.
2. Remove typos, invalid IPA transcriptions, and non-native segments. The
Expand All @@ -47,21 +47,23 @@ the [fork and pull](../../CONTRIBUTING.md) model for contributions.
phones/phonemes to remove. For the phones or phonemes to retain, remove the
comments of counts and example word-pronunciation pairs.
3. For a phonemic list, add comments about allophony.
4. In [`../src/`](../src) run
```./scrape.py --restriction=<your-lang> && ./postprocess && ./generate_tsv_summary.py && ./generate_phones_summary.py```
4. In [`../scrape`](../scrape) run
```./scrape.py --restriction=<your-lang> && ./postprocess```
This may take a while.
5. Add the `.phones` file, the filtered `.tsv` file(s), and the summary files
5. In [`../scrape/lib`](../scrape/lib) run `./generate_tsv_summary.py`.
6. In [`lib`](lib) run `./generate_phones_summary.py`.
7. Add the `.phones` file, the filtered `.tsv` file(s), and the summary files
using `git add`. The `.phones` file must use the [NFC Unicode
normalization](https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization).
If you used `../src/list_phones.py` to create the `.phones` file, then it
should be in this form already. Otherwise, in `../src/`, you can run
should be in this form already. Otherwise, in [`lib`](lib), you can run
`./normalize.py <your-file> NFC` to put your file in the correct form.
6. Commit using `git commit`, push to your branch using `git push`, and then
8. Commit using `git commit`, push to your branch using `git push`, and then
file a pull request.

The `.phones` file format is a UTF-8 encoded file with one segment per line,
with optional comments formatted as two spaces, `#`, one space, and then a
sentence or sentence fragment with appropriate punctuation (e.g.,
`tʰ # Allophone of /t/.`). Please do not leave any trailing whitespace. The
`.phones` file should have the same name as the corresponding TSV file, but with
a `.phones` extension instead of `.tsv`.
`tʰ # Allophone of /t/.`). Please include a blank line at the end of the file.
The `.phones` file should have the same name as the corresponding TSV file, but
with a `.phones` extension instead of `.tsv`.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
12 changes: 9 additions & 3 deletions data/scrape/lib/codes.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,15 @@
LOGGING_PATH = os.path.join(SCRAPE_DIRECTORY, "scraping.log")
README_PATH = os.path.join(SCRAPE_DIRECTORY, "README.md")
TSV_DIRECTORY = os.path.join(SCRAPE_DIRECTORY, "tsv")
PHONES_DIRECTORY = os.path.join(os.path.dirname(SCRAPE_DIRECTORY), "phones")
PHONES_README_PATH = os.path.join(PHONES_DIRECTORY, "README.md")
PHONES_SUMMARY_PATH = os.path.join(PHONES_DIRECTORY, "phones_summary.tsv")
PHONES_DIRECTORY = os.path.join(
os.path.dirname(SCRAPE_DIRECTORY), "phones/phones"
)
PHONES_README_PATH = os.path.join(
os.path.dirname(PHONES_DIRECTORY), "README.md"
)
PHONES_SUMMARY_PATH = os.path.join(
os.path.dirname(PHONES_DIRECTORY), "phones_summary.tsv"
)
URL = "https://en.wiktionary.org/w/api.php"


Expand Down
2 changes: 1 addition & 1 deletion tests/test_data/test_summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
_TSV_SUMMARY = os.path.join(_REPO_DIR, "data/scrape/tsv_summary.tsv")
_TSV_DIRECTORY = os.path.join(_REPO_DIR, "data/scrape/tsv")
_PHONES_SUMMARY = os.path.join(_REPO_DIR, "data/phones/phones_summary.tsv")
_PHONES_DIRECTORY = os.path.join(_REPO_DIR, "data/phones")
_PHONES_DIRECTORY = os.path.join(_REPO_DIR, "data/phones/phones")


def test_language_data_matches_summary():
Expand Down

0 comments on commit e10c68c

Please sign in to comment.