FAQ

What the disadvantages/advantages of the different dictionaries are?

From issue comment @ issue #277

Dictionaries in kagome are a "set of morphemes" of dict.Dict type, and the differences are the information contained in the dictionary.

IPADIC has a vocabulary of about 400,000 words and UniDIC about 750,000; IPADIC is suitable for memory-limited environments and most use cases, while UniDIC is more suitable for splitting words when searching by its shorter lexical units.

This means that UniDIC is more efficient than IPADIC if the main purpose is to use the Wakachi() function.

For more pros and cons of the dictionaries, see "About the dictionary" | kagome | Wiki @ GitHub

What is the difference between `Tokenizer.Analyze()` and `Tokenizer.Tokenize()`?

From issue #293

t.Tokenize(s) is an alias of t.Analyze(s, tokenizer.Normal). The argument "tokenizer.Normal" describes the segmentation mode during analysis.

kagome has some segmentation modes.

Normal: Regular segmentation
Search: Use a heuristic to do additional segmentation useful for search
Extended: Similar to search mode, but also uni-gram unknown words

What is the difference between `Tokenizer.Wakati()` and `Tokenizer.Tokenize()`?

From issue #274

As you may know, most Asian texts are not word-separated. The word "wakati" means "word divide" in Japanese. Thus, wakati helps to divide the text into word tokens. Imagine the following.

Wakati("thistextwritingissomewhatsimilartotheasianstyle.") --> this text writing is somewhat similar to the asian style.

The Tokenizer.Wakati() is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.

The Tokenizer.Tokenize() is similar to Wakati(). But each wakatized(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.

What are the pros/cons of using the different dictionaries?

From issue #274

In order to do the wakati thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.

The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.

The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.

Mr.McIntoshandMr.McNamara --> Mr. Mc Into sh and Mr. Mc Namara or Mr. McIntosh and Mr. McNamara

And the "cons" would be memory usage and slowness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

What the disadvantages/advantages of the different dictionaries are?

What is the difference between `Tokenizer.Analyze()` and `Tokenizer.Tokenize()`?

What is the difference between `Tokenizer.Wakati()` and `Tokenizer.Tokenize()`?

What are the pros/cons of using the different dictionaries?

Clone this wiki locally

FAQ

What the disadvantages/advantages of the different dictionaries are?

What is the difference between Tokenizer.Analyze() and Tokenizer.Tokenize()?

What is the difference between Tokenizer.Wakati() and Tokenizer.Tokenize()?

What are the pros/cons of using the different dictionaries?

Clone this wiki locally

What is the difference between `Tokenizer.Analyze()` and `Tokenizer.Tokenize()`?

What is the difference between `Tokenizer.Wakati()` and `Tokenizer.Tokenize()`?