-
Notifications
You must be signed in to change notification settings - Fork 266
Add guide on multilingual datasets #3359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the "Guides" section atm is very focussed on integrating Meilisearch with third-party libs/tools (e.g. use meilisearch with mistral, deploy on AWS, etc) I'm thinking that adding it to /learn/indexing
might actually make more sense. What do you think?
Other than that, I made a couple of suggestions to reduce the size of the article a bit.
Handling multilingual datasets in Meilisearch requires careful planning of both indexing and querying. | ||
By choosing the right indexing strategy, and explicitly configuring languages with `localizedAttributes` and `locales`, you ensure that documents and queries are processed consistently. | ||
|
||
This alignment leads to more accurate results, better relevancy across languages, and a smoother search experience for users—no matter what language they search in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This alignment leads to more accurate results, better relevancy across languages, and a smoother search experience for users—no matter what language they search in. |
|
||
## Tokenizers and language differences | ||
|
||
Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies: | ||
|
||
- **Space-separated languages** (English, French, Spanish): | ||
|
||
Words are clearly separated by spaces, making them straightforward to tokenize. | ||
|
||
- **Non-space-separated languages** (Chinese, Japanese): | ||
|
||
Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units. | ||
|
||
- **Languages with compound words** (German, Swedish): | ||
|
||
Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly. | ||
|
||
### Normalization differences | ||
|
||
Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search. | ||
|
||
- **Accents and diacritics**: | ||
|
||
In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant). | ||
|
||
In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This content is already available on the tokenization article and is to some extent theoretical knowledge. Given we generally want guides to be more oriented towards solving problems, and that this article is pretty long, I think we could skip these two sections and perhaps add links to the tokenization article in the middle of the body copy.
## Tokenizers and language differences | |
Search quality in Meilisearch depends heavily on how text is broken down into tokens. Since each language has its own writing system and rules, they require different tokenization strategies: | |
- **Space-separated languages** (English, French, Spanish): | |
Words are clearly separated by spaces, making them straightforward to tokenize. | |
- **Non-space-separated languages** (Chinese, Japanese): | |
Words are written continuously without spaces. These languages require specialized tokenizers to correctly split text into searchable units. | |
- **Languages with compound words** (German, Swedish): | |
Words can be combined to form long terms, such as _Donaudampfschifffahrtsgesellschaft_ (German for Danube Steamship Company). Meilisearch provides specialized tokenizers to process them correctly. | |
### Normalization differences | |
Normalization ensures that different spellings or character variations (like accents or case differences) are treated consistently during search. | |
- **Accents and diacritics**: | |
In many languages, accents can often be ignored without losing meaning (e.g., éléphant vs elephant). | |
In other languages like Swedish, diacritics may represent entirely different letters, so they must be preserved. |
Pull Request
Related issue
Fixes #<issue_number>
What does this PR do?
PR checklist
Please check if your PR fulfills the following requirements:
Thank you so much for contributing to Meilisearch!