Skip to content

Commit

Permalink
Create transliteration.md
Browse files Browse the repository at this point in the history
  • Loading branch information
wannaphong authored Dec 19, 2023
1 parent 6723b86 commit 1b6b76e
Showing 1 changed file with 171 additions and 0 deletions.
171 changes: 171 additions & 0 deletions docs/transliteration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Transliteration
## Thai W2P

**Model Details**

- Developer: Wannaphong Phatthiyaphaibun
- This report author: Wannaphong Phatthiyaphaibun
- Model date: 2020-12-29
- Model version: 0.1
- Used in PyThaiNLP version: 2.3+
- Filename: `~/pythainlp-data/w2p_0.1.npy`
- GitHub: [https://github.com/PyThaiNLP/pythainlp/pull/511](https://github.com/PyThaiNLP/pythainlp/pull/511)
- License: CC0
- train notebook: [https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb](https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb)

**Intended Use**

- Converter thai word to thai phoneme
- Not suitable for other language.

**Factors**

- Based on thai word to thai phoneme problems.

**Metrics**

- Evaluation metrics include phoneme error rate (number error / number phonemes)

**Training Data**

Thai W2P (80%)

**Evaluation Data**

Thai W2P (20%)

**Quantitative Analyses**

```
epoch: 100
step: 100, loss: 0.03179970383644104
step: 200, loss: 0.04126007482409477
step: 300, loss: 0.01877519115805626
step: 400, loss: 0.03311225399374962
per: 0.0432
per: 0.0419
```

**Ethical Considerations**

This corpus is based on the website, such as wiktionary, Royal Institute et cetera and more. It may not be the dialect that you use in everyday life.

**Caveats and Recommendations**

- 1 Thai word only

## Thai2Rom

Thai romanization using LSTM encoder-decoder model with attention mechanism

### v0.1

**Model Details**

- Developer: Chakri Lowphansirikul
- This report author: Wannaphong Phatthiyaphaibun
- Model date: 2019-08-11
- Model version: 0.1
- Used in PyThaiNLP version: 2.1 +
- Filename: `~/pythainlp-data/thai2rom-pytorch-attn-v0.1.tar`
- GitHub: [https://github.com/PyThaiNLP/pythainlp/pull/246](https://github.com/PyThaiNLP/pythainlp/pull/246)
- Train Notebook: [https://github.com/lalital/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb](https://github.com/lalital/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb)
- LSTM Model
- Dataset: [https://github.com/lalital/thai-romanization/blob/master/dataset/data.new](https://github.com/lalital/thai-romanization/blob/master/dataset/data.new)
- License: CC0

**Intended Use**
- conversion of thai text to the Roman.

**Factors**
- Based on known problems with thai natural Language processing.

**Metrics**
- Evaluation metrics include precision, recall and f1-score.

**Training Data**
Thai2Rom trainset

**Evaluation Data**

Thai2Rom testset

**Quantitative Analyses**

The model was evaluated with 3 metrics including F1-score, Exact match, Exact match at character level on the validation set (20% of the dataset or 129,642 examples).

- F1 (macro-average): 0.987
- Exact match: 0.883
- Exact match (Character-level): 0.949


**Ethical Considerations**

no ideas

**Caveats and Recommendations**

- Thai text only

## Thai G2P

Thai Grapheme-to-Phoneme (Thai G2P) based on Deep Learning (Seq2Seq model)

### v0.1

**Model Details**

- Developer: Wannaphong Phatthiyaphaibun
- This report author: Wannaphong Phatthiyaphaibun
- Model date: 2020-08-20
- Model version: 0.1
- Used in PyThaiNLP version: 2.2+
- Filename: `~/pythainlp-data/thaig2p-0.1.tar`
- Pull request GitHub: [https://github.com/PyThaiNLP/pythainlp/pull/377](https://github.com/PyThaiNLP/pythainlp/pull/377)
- GitHub: [https://github.com/wannaphong/thai-g2p](https://github.com/wannaphong/thai-g2p)
- Train notebook: [https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb](https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb)
- Dataset: [wiktionary-11-2-2020.tsv](https://github.com/wannaphong/thai-g2p/blob/master/wiktionary-11-2-2020.tsv)
- Seq2Seq model
- License: CC0

**Intended Use**

Grapheme-to-Phoneme conversion tool.

**Factors**

- Based on thai grapheme-to-phoneme conversion problems.

**Metrics**

f1-score.

**Training Data**

wiktionary trainset

**Evaluation Data**

wiktionary testset

**Quantitative Analyses**

```
F1 (macro-average) = 0.9415941561267093
EM = 0.71
EM (Character-level) = 0.8660247630539959
save best model em score=0.71 at epoch=1148
Save model at epoch 1148
Epoch: 1149 | Time: 2m 55s
Train Loss: 0.352 | Train PPL: 1.422
Val. Loss: 0.512 | Val. PPL: 1.669
epoch=1149, teacher_forcing_ratio=0.4
```

**Ethical Considerations**

This model is based on the Thai wiktionary Dump (include bias from Thai wiktionary).

**Caveats and Recommendations**

- 1 Thai word only

0 comments on commit 1b6b76e

Please sign in to comment.