Skip to content

Commit

Permalink
feat: Add test for tokenizer
Browse files Browse the repository at this point in the history
  • Loading branch information
soeque1 committed Feb 1, 2021
1 parent 41f3e30 commit 8f29633
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 1 deletion.
2 changes: 1 addition & 1 deletion cfgs/pipelines/word_piece_with_morpheme.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Path:


Pipelines:
Tokenizer: WordPieceTokenizer()
Tokenizer: WordPieceTokenizer(unk_token='[UNK]')

normalizer: []

Expand Down
Empty file modified prepare.sh
100644 → 100755
Empty file.
11 changes: 11 additions & 0 deletions tests/test_tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from tokenizers import normalizers


def test_tokenizer(cfg):
tokenizer = cfg['Pipelines']['Tokenizer']
tokenizer.pre_tokenizer = cfg['Pipelines']['pre_tokenizer']
tokenizer.normalizer = normalizers.Sequence(cfg['Pipelines']['normalizer'])
tokenizer.decoder = cfg['Pipelines']['decoder']

tokenizer.train_from_iterator(['안녕하세요'])
assert tokenizer.encode('안녕').tokens == ['안', '##녕']

0 comments on commit 8f29633

Please sign in to comment.