Improve pinyin fuzzy segement algorithm by wengxt · Pull Request #88 · fcitx/libime

wengxt · 2024-12-06T21:52:17Z

Previously, we blindly choose the segment to always prefer the longer
next match, this is prove wrong in the case of "sangeren".

Which should produce, "san ge ren", "sang er en", "sang e ren".

Instead, we change the check to be:
if (current + next match) is valid, and complete pinyin, make it an
acceptable option, unless (current, next match) is actually an inner
fuzzy, which is handled separately below.

For example:

For sangeren, will produce sang & san, since next match of
"san", which is "ge", is a complete pinyin.
For hua, will only produce hua, since hu a is a inner fuzzy.

Even if it will produce "extra" segement, for example, in the case of
"sanger" will produce a partial pinyin "san" "ge" "r". We may still
consider it as make sense. Since partial pinyin match is considered
fuzzy and will have a penalty score.

People may even benefit from such segement, since "san ge r" seems to be
the most possible option.

Fix #87

Previously, we blindly choose the segment to always prefer the longer next match, this is prove wrong in the case of "sangeren". Which should produce, "san ge ren", "sang er en", "sang e ren". Instead, we change the check to be: if (current + next match) is valid, and complete pinyin, make it an acceptable option, unless (current, next match) is actually an inner fuzzy, which is handled separately below. For example: 1. For sangeren, will produce sang & san, since next match of "san", which is "ge", is a complete pinyin. 2. For hua, will only produce hua, since hu a is a inner fuzzy. Even if it will produce "extra" segement, for example, in the case of "sanger" will produce a partial pinyin "san" "ge" "r". We may still consider it as make sense. Since partial pinyin match is considered fuzzy and will have a penalty score. People may even benefit from such segement, since "san ge r" seems to be the most possible option. Fix #87

wengxt merged commit 5bb49ec into master Dec 7, 2024

wengxt deleted the segment branch December 7, 2024 20:33

wengxt restored the segment branch December 20, 2024 18:38

wengxt deleted the segment branch December 20, 2024 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pinyin fuzzy segement algorithm#88

Improve pinyin fuzzy segement algorithm#88
wengxt merged 1 commit intomasterfrom
segment

wengxt commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wengxt commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant