-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
6723b86
commit 1b6b76e
Showing
1 changed file
with
171 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
# Transliteration | ||
## Thai W2P | ||
|
||
**Model Details** | ||
|
||
- Developer: Wannaphong Phatthiyaphaibun | ||
- This report author: Wannaphong Phatthiyaphaibun | ||
- Model date: 2020-12-29 | ||
- Model version: 0.1 | ||
- Used in PyThaiNLP version: 2.3+ | ||
- Filename: `~/pythainlp-data/w2p_0.1.npy` | ||
- GitHub: [https://github.com/PyThaiNLP/pythainlp/pull/511](https://github.com/PyThaiNLP/pythainlp/pull/511) | ||
- License: CC0 | ||
- train notebook: [https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb](https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb) | ||
|
||
**Intended Use** | ||
|
||
- Converter thai word to thai phoneme | ||
- Not suitable for other language. | ||
|
||
**Factors** | ||
|
||
- Based on thai word to thai phoneme problems. | ||
|
||
**Metrics** | ||
|
||
- Evaluation metrics include phoneme error rate (number error / number phonemes) | ||
|
||
**Training Data** | ||
|
||
Thai W2P (80%) | ||
|
||
**Evaluation Data** | ||
|
||
Thai W2P (20%) | ||
|
||
**Quantitative Analyses** | ||
|
||
``` | ||
epoch: 100 | ||
step: 100, loss: 0.03179970383644104 | ||
step: 200, loss: 0.04126007482409477 | ||
step: 300, loss: 0.01877519115805626 | ||
step: 400, loss: 0.03311225399374962 | ||
per: 0.0432 | ||
per: 0.0419 | ||
``` | ||
|
||
**Ethical Considerations** | ||
|
||
This corpus is based on the website, such as wiktionary, Royal Institute et cetera and more. It may not be the dialect that you use in everyday life. | ||
|
||
**Caveats and Recommendations** | ||
|
||
- 1 Thai word only | ||
|
||
## Thai2Rom | ||
|
||
Thai romanization using LSTM encoder-decoder model with attention mechanism | ||
|
||
### v0.1 | ||
|
||
**Model Details** | ||
|
||
- Developer: Chakri Lowphansirikul | ||
- This report author: Wannaphong Phatthiyaphaibun | ||
- Model date: 2019-08-11 | ||
- Model version: 0.1 | ||
- Used in PyThaiNLP version: 2.1 + | ||
- Filename: `~/pythainlp-data/thai2rom-pytorch-attn-v0.1.tar` | ||
- GitHub: [https://github.com/PyThaiNLP/pythainlp/pull/246](https://github.com/PyThaiNLP/pythainlp/pull/246) | ||
- Train Notebook: [https://github.com/lalital/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb](https://github.com/lalital/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb) | ||
- LSTM Model | ||
- Dataset: [https://github.com/lalital/thai-romanization/blob/master/dataset/data.new](https://github.com/lalital/thai-romanization/blob/master/dataset/data.new) | ||
- License: CC0 | ||
|
||
**Intended Use** | ||
- conversion of thai text to the Roman. | ||
|
||
**Factors** | ||
- Based on known problems with thai natural Language processing. | ||
|
||
**Metrics** | ||
- Evaluation metrics include precision, recall and f1-score. | ||
|
||
**Training Data** | ||
Thai2Rom trainset | ||
|
||
**Evaluation Data** | ||
|
||
Thai2Rom testset | ||
|
||
**Quantitative Analyses** | ||
|
||
The model was evaluated with 3 metrics including F1-score, Exact match, Exact match at character level on the validation set (20% of the dataset or 129,642 examples). | ||
|
||
- F1 (macro-average): 0.987 | ||
- Exact match: 0.883 | ||
- Exact match (Character-level): 0.949 | ||
|
||
|
||
**Ethical Considerations** | ||
|
||
no ideas | ||
|
||
**Caveats and Recommendations** | ||
|
||
- Thai text only | ||
|
||
## Thai G2P | ||
|
||
Thai Grapheme-to-Phoneme (Thai G2P) based on Deep Learning (Seq2Seq model) | ||
|
||
### v0.1 | ||
|
||
**Model Details** | ||
|
||
- Developer: Wannaphong Phatthiyaphaibun | ||
- This report author: Wannaphong Phatthiyaphaibun | ||
- Model date: 2020-08-20 | ||
- Model version: 0.1 | ||
- Used in PyThaiNLP version: 2.2+ | ||
- Filename: `~/pythainlp-data/thaig2p-0.1.tar` | ||
- Pull request GitHub: [https://github.com/PyThaiNLP/pythainlp/pull/377](https://github.com/PyThaiNLP/pythainlp/pull/377) | ||
- GitHub: [https://github.com/wannaphong/thai-g2p](https://github.com/wannaphong/thai-g2p) | ||
- Train notebook: [https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb](https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb) | ||
- Dataset: [wiktionary-11-2-2020.tsv](https://github.com/wannaphong/thai-g2p/blob/master/wiktionary-11-2-2020.tsv) | ||
- Seq2Seq model | ||
- License: CC0 | ||
|
||
**Intended Use** | ||
|
||
Grapheme-to-Phoneme conversion tool. | ||
|
||
**Factors** | ||
|
||
- Based on thai grapheme-to-phoneme conversion problems. | ||
|
||
**Metrics** | ||
|
||
f1-score. | ||
|
||
**Training Data** | ||
|
||
wiktionary trainset | ||
|
||
**Evaluation Data** | ||
|
||
wiktionary testset | ||
|
||
**Quantitative Analyses** | ||
|
||
``` | ||
F1 (macro-average) = 0.9415941561267093 | ||
EM = 0.71 | ||
EM (Character-level) = 0.8660247630539959 | ||
save best model em score=0.71 at epoch=1148 | ||
Save model at epoch 1148 | ||
Epoch: 1149 | Time: 2m 55s | ||
Train Loss: 0.352 | Train PPL: 1.422 | ||
Val. Loss: 0.512 | Val. PPL: 1.669 | ||
epoch=1149, teacher_forcing_ratio=0.4 | ||
``` | ||
|
||
**Ethical Considerations** | ||
|
||
This model is based on the Thai wiktionary Dump (include bias from Thai wiktionary). | ||
|
||
**Caveats and Recommendations** | ||
|
||
- 1 Thai word only |