Skip to content
KEINOS edited this page Jan 31, 2023 · 5 revisions

What the disadvantages/advantages of the different dictionaries are?

Dictionaries in kagome are a "set of morphemes" of dict.Dict type, and the differences are the information contained in the dictionary.

IPADIC has a vocabulary of about 400,000 words and UniDIC about 750,000; IPADIC is suitable for memory-limited environments and most use cases, while UniDIC is more suitable for splitting words when searching by its shorter lexical units.

This means that UniDIC is more efficient than IPADIC if the main purpose is to use the Wakachi() function.

For more pros and cons of the dictionaries, see "About the dictionary" | kagome | Wiki @ GitHub

What is the difference between Tokenizer.Analyze() and Tokenizer.Tokenize()?

t.Tokenize(s) is an alias of t.Analyze(s, tokenizer.Normal). The argument "tokenizer.Normal" describes the segmentation mode during analysis.

kagome has some segmentation modes.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also uni-gram unknown words

What is the difference between Tokenizer.Wakati() and Tokenizer.Tokenize()?

As you may know, most Asian texts are not word-separated. The word "wakati" means "word divide" in Japanese. Thus, wakati helps to divide the text into word tokens. Imagine the following.

  • Wakati("thistextwritingissomewhatsimilartotheasianstyle.") --> this text writing is somewhat similar to the asian style.

The Tokenizer.Wakati() is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.

The Tokenizer.Tokenize() is similar to Wakati(). But each wakatized(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.

What are the pros/cons of using the different dictionaries?

In order to do the wakati thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.

The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.

The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.

  • Mr.McIntoshandMr.McNamara --> Mr. Mc Into sh and Mr. Mc Namara or Mr. McIntosh and Mr. McNamara

And the "cons" would be memory usage and slowness.