Skip to content

PyThaiNLP v3.0.0-beta0

Pre-release
Pre-release
Compare
Choose a tag to compare
@wannaphong wannaphong released this 20 Jan 13:38
· 1460 commits to dev since this release
fae2bf6

PyThaiNLP 3.0 have many improvement and new features to help you in Thai language processing tasks. This release is PyThaiNLP v3.0.0-beta0. It is The first beta release of PyThaiNLP 3.0

You can install by pip install pythainlp==3.0.0b0.

Documentation: https://pythainlp.github.io/dev-docs/index.html
Report bug: https://github.com/PyThaiNLP/pythainlp/issues

See PyThaiNLP 3.0 change log #545

If you want to contributing to PyThaiNLP, you can read Contributing to PyThaiNLP.

News

Since PyThaiNLP 3.0, We will end support PyThaiNLP on Python 3.6. Python 3.6 users can use PyThaiNLP 2.3.2.
We have updated the dict & rule for newmm. If you use newmm for word tokenization in your model, we recommend you retrain your model.

What is new?

Deprecation and other API changes

  • Deprecated syllable_tokenize. syllable_tokenize is deprecated, use subword_tokenize instead
  • pythainlp.tag.named_entity.ThaiNameTagger is change to pythainlp.tag.thainer.ThaiNameTagger. This old class will be deprecated in PyThaiNLP version 3.1.

Augment

  • Add Thai Text Augmentation

Corpus

  • Fix lots of misspellings in dictionary (words_th.txt)
  • Add get_corpus_default_db and thainer 1.5 model. Now, You can add corpus on default_db.json and you dont load last thainer model from Internet.

Tag

  • Add tltk (pos_tag and ner) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
  • Add NER class - NER class for Named-entity recognizer tasks.

Translate

  • Add pythainlp.translate.Translate Class
  • Add Chinese-Thai Machine Translation

Tokenization

  • Tokenize repeating dots and commas from numbers
  • Fix token_max_len bug that makes it always zero
  • Tokenize repeating dots and commas from numbers (fix #461)
  • Retrained sentenceseg_crfcut.model for PyThaiNLP 2.4
  • Add SEFR CUT to pythainlp
  • Add tltk (sentence_tokenize and word_tokenize) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
  • Add nlpo3

Transliterate

  • Refactor Royin Transliterate: Avoid embedded if blocks and simplified consonant replacing operations
  • Manually merge update-royin branch with dev branch to add O-ANG rule
  • Add tltk (g2p and ipa) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
  • Add pythainlp.transliterate.puan

Word Vector

  • Fix token_max_len bug that makes it always zero
  • Add pythainlp.word_vector.WordVector

Spell

  • Add more spelling engine
  • Add tltk (spell) - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.

Generate

  • Add pythainlp.generate

Tool

  • Add misspell module

Other

  • Add tltk - add tltk wrapper to pythainlp functions ex ner, word_tokenize and more.
  • Update requirements from ssg 0.0.6 to ssg 0.0.8
  • Spoonerism: Add supports words more 3 syllables
  • Add maiyamok; This function is preprocessing MaiYaMok in Thai sentence.

Contributors

Thanks all the contributors. (Image made with contributors-img)

If you want to contributing to PyThaiNLP, you can read Contributing to PyThaiNLP.

#PyThaiNLP #ThaiNLP