Skip to content

feat(nori): add metadata support to Korean tokenizer #14969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

twosom
Copy link
Contributor

@twosom twosom commented Jul 20, 2025

Description

Summary

Adds metadata support to Nori Korean analyzer, allowing users to attach additional information to dictionary words.

Changes

  • Added MetadataAttribute interface and implementation
  • Extended user dictionary format to support word >> metadata syntax
  • Preserves metadata during compound word decomposition
  • Maintains backward compatibility with existing dictionaries

Example

Dictionary:

자바 >> computer language
엘라스틱서치 엘라스틱 서치 >> search engine

Result:

  • 자바 → Term: "자바", Metadata: "computer language"
  • 엘라스틱서치 → All decomposed terms ("엘라스틱서치", "엘라스틱", "서치") carry "search engine" metadata

Fixes #14940

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 11.0.0 milestone Jul 20, 2025
@twosom twosom force-pushed the add_nori_metadata branch from 7be08b0 to 81ce2c8 Compare July 20, 2025 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Nori] Add metadata support for Korean analyzer tokens
1 participant