Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pinyin fuzzy segement algorithm #88

Merged
merged 1 commit into from
Dec 7, 2024
Merged

Improve pinyin fuzzy segement algorithm #88

merged 1 commit into from
Dec 7, 2024

Conversation

wengxt
Copy link
Member

@wengxt wengxt commented Dec 6, 2024

Previously, we blindly choose the segment to always prefer the longer
next match, this is prove wrong in the case of "sangeren".

Which should produce, "san ge ren", "sang er en", "sang e ren".

Instead, we change the check to be:
if (current + next match) is valid, and complete pinyin, make it an
acceptable option, unless (current, next match) is actually an inner
fuzzy, which is handled separately below.

For example:

  1. For sangeren, will produce sang & san, since next match of
    "san", which is "ge", is a complete pinyin.
  2. For hua, will only produce hua, since hu a is a inner fuzzy.

Even if it will produce "extra" segement, for example, in the case of
"sanger" will produce a partial pinyin "san" "ge" "r". We may still
consider it as make sense. Since partial pinyin match is considered
fuzzy and will have a penalty score.

People may even benefit from such segement, since "san ge r" seems to be
the most possible option.

Fix #87

Previously, we blindly choose the segment to always prefer the longer
next match, this is prove wrong in the case of "sangeren".

Which should produce, "san ge ren", "sang er en", "sang e ren".

Instead, we change the check to be:
if (current + next match) is valid, and complete pinyin, make it an
acceptable option, unless (current, next match) is actually an inner
fuzzy, which is handled separately below.

For example:
1. For sangeren, will produce sang & san, since next match of
"san", which is "ge", is a complete pinyin.
2. For hua, will only produce hua, since hu a is a inner fuzzy.

Even if it will produce "extra" segement, for example, in the case of
"sanger" will produce a partial pinyin "san" "ge" "r". We may still
consider it as make sense. Since partial pinyin match is considered
fuzzy and will have a penalty score.

People may even benefit from such segement, since "san ge r" seems to be
the most possible option.

Fix #87
@wengxt wengxt merged commit 5bb49ec into master Dec 7, 2024
4 checks passed
@wengxt wengxt deleted the segment branch December 7, 2024 20:33
@wengxt wengxt restored the segment branch December 20, 2024 18:38
@wengxt wengxt deleted the segment branch December 20, 2024 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sangeren should give some result with san ge ren
1 participant